API Reference#
This section provides detailed documentation for the core infrastructure classes and modules in the DataFrame Expectations library. For user-facing expectation methods, see Expectation Gallery.
Core Infrastructure#
Base Expectation Classes#
- class dataframe_expectations.core.expectation.DataFrameExpectation(tags: List[str] | None = None)[source]#
Bases:
ABCBase class for DataFrame expectations.
- __init__(tags: List[str] | None = None)[source]#
Initialize the base expectation with optional tags. :param tags: Optional tags as list of strings in “key:value” format.
Example: [“priority:high”, “env:test”]
- classmethod infer_data_frame_type(data_frame: DataFrame | DataFrame) DataFrameType[source]#
Infer the DataFrame type based on the provided DataFrame.
- classmethod num_data_frame_rows(data_frame: DataFrame | DataFrame) int[source]#
Count the number of rows in the DataFrame.
- validate(data_frame: DataFrame | DataFrame, **kwargs)[source]#
Validate the DataFrame against the expectation.
- abstract validate_pandas(data_frame: DataFrame | DataFrame, **kwargs) DataFrameExpectationResultMessage[source]#
Validate a pandas DataFrame against the expectation.
- abstract validate_pyspark(data_frame: DataFrame | DataFrame, **kwargs) DataFrameExpectationResultMessage[source]#
Validate a PySpark DataFrame against the expectation.
Column Expectations#
- class dataframe_expectations.core.column_expectation.DataFrameColumnExpectation(expectation_name: str, column_name: str, fn_violations_pandas: Callable, fn_violations_pyspark: Callable, description: str, error_message: str, tags: List[str] | None = None)[source]#
Bases:
DataFrameExpectationBase class for DataFrame column expectations. This class is designed to validate a specific column in a DataFrame against a condition defined by fn_violations_pandas and fn_violations_pyspark functions.
- __init__(expectation_name: str, column_name: str, fn_violations_pandas: Callable, fn_violations_pyspark: Callable, description: str, error_message: str, tags: List[str] | None = None)[source]#
Template for implementing DataFrame column expectations, where a column value is tested against a condition. The conditions are defined by the fn_violations_pandas and fn_violations_pyspark functions.
- Parameters:
expectation_name – The name of the expectation. This will be used during logging.
column_name – The name of the column to check.
fn_violations_pandas – Function to find violations in a pandas DataFrame.
fn_violations_pyspark – Function to find violations in a PySpark DataFrame.
description – A description of the expectation used in logging.
error_message – The error message to return if the expectation fails.
tags – Optional tags as list of strings in “key:value” format. Example: [“priority:high”, “env:test”]
- row_validation(data_frame_type: DataFrameType, data_frame: DataFrame | DataFrame, fn_violations: Callable, **kwargs) DataFrameExpectationResultMessage[source]#
Validate the DataFrame against the expectation.
- Parameters:
data_frame_type – The type of DataFrame (Pandas or PySpark).
data_frame – The DataFrame to validate.
fn_violations – The function to find violations.
- Returns:
ExpectationResultMessage indicating success or failure.
- validate_pandas(data_frame: DataFrame | DataFrame, **kwargs) DataFrameExpectationResultMessage[source]#
Validate a pandas DataFrame against the expectation.
- validate_pyspark(data_frame: DataFrame | DataFrame, **kwargs) DataFrameExpectationResultMessage[source]#
Validate a PySpark DataFrame against the expectation.
Aggregation Expectations#
- class dataframe_expectations.core.aggregation_expectation.DataFrameAggregationExpectation(expectation_name: str, column_names: List[str], description: str, tags: List[str] | None = None)[source]#
Bases:
DataFrameExpectationBase class for DataFrame aggregation expectations. This class is designed to first aggregate data and then validate the aggregation results.
- __init__(expectation_name: str, column_names: List[str], description: str, tags: List[str] | None = None)[source]#
Template for implementing DataFrame aggregation expectations, where data is first aggregated and then the aggregation results are validated.
- Parameters:
expectation_name – The name of the expectation. This will be used during logging.
column_names – The list of column names to aggregate on.
description – A description of the expectation used in logging.
tags – Optional tags as list of strings in “key:value” format. Example: [“priority:high”, “env:test”]
- abstract aggregate_and_validate_pandas(data_frame: DataFrame | DataFrame, **kwargs) DataFrameExpectationResultMessage[source]#
Aggregate and validate a pandas DataFrame against the expectation.
Note: This method should NOT check for column existence - that’s handled automatically by the validate_pandas method.
- abstract aggregate_and_validate_pyspark(data_frame: DataFrame | DataFrame, **kwargs) DataFrameExpectationResultMessage[source]#
Aggregate and validate a PySpark DataFrame against the expectation.
Note: This method should NOT check for column existence - that’s handled automatically by the validate_pyspark method.
- validate_pandas(data_frame: DataFrame | DataFrame, **kwargs) DataFrameExpectationResultMessage[source]#
Validate a pandas DataFrame against the expectation. Automatically checks column existence before calling the implementation.
- validate_pyspark(data_frame: DataFrame | DataFrame, **kwargs) DataFrameExpectationResultMessage[source]#
Validate a PySpark DataFrame against the expectation. Automatically checks column existence before calling the implementation.
Expectation Registry#
- class dataframe_expectations.registry.DataFrameExpectationRegistry[source]#
Bases:
objectRegistry for dataframe expectations.
- classmethod get_all_metadata() Dict[str, ExpectationMetadata][source]#
Get metadata for all registered expectations.
- Returns:
Dictionary mapping expectation names to their metadata.
- classmethod get_expectation(expectation_name: str, **kwargs) DataFrameExpectation[source]#
Get an expectation instance by name.
Note: This method is kept for backward compatibility with tests. The suite uses get_expectation_by_suite_method() for better performance.
- Parameters:
expectation_name – The name of the expectation.
kwargs – Parameters to pass to the expectation factory function.
- Returns:
An instance of DataFrameExpectation.
- classmethod get_expectation_by_suite_method(suite_method_name: str, **kwargs) DataFrameExpectation[source]#
Get an expectation instance by suite method name.
- Parameters:
suite_method_name – The suite method name (e.g., ‘expect_value_greater_than’).
kwargs – Parameters to pass to the expectation factory function.
- Returns:
An instance of DataFrameExpectation.
- Raises:
ValueError – If suite method not found.
- classmethod get_metadata(expectation_name: str) ExpectationMetadata[source]#
Get metadata for a registered expectation.
- Parameters:
expectation_name – The name of the expectation.
- Returns:
Metadata for the expectation.
- Raises:
ValueError – If expectation not found.
- classmethod get_suite_method_mapping() Dict[str, str][source]#
Get mapping of suite method names to expectation names.
- Returns:
Dictionary mapping suite method names (e.g., ‘expect_value_greater_than’) to expectation names (e.g., ‘ExpectationValueGreaterThan’).
- classmethod list_expectations() list[source]#
List all registered expectation names.
- Returns:
List of registered expectation names.
- classmethod register(name: str, pydoc: str, category: ExpectationCategory, subcategory: ExpectationSubcategory, params_doc: Dict[str, str], suite_method_name: str | None = None)[source]#
Decorator to register an expectation factory function with metadata.
- Parameters:
name – Expectation name (e.g., ‘ExpectationValueGreaterThan’). Required.
pydoc – Human-readable description of the expectation. Required.
category – Category from ExpectationCategory enum. Required.
subcategory – Subcategory from ExpectationSubcategory enum. Required.
params_doc – Documentation for each parameter. Required.
suite_method_name – Override for suite method name. If not provided, auto-generated from expectation name.
- Returns:
Decorator function.
- classmethod remove_expectation(expectation_name: str)[source]#
Remove an expectation from the registry.
- Parameters:
expectation_name – The name of the expectation to remove.
- Raises:
ValueError – If expectation not found.
- dataframe_expectations.registry.register_expectation(name: str, pydoc: str, category: ExpectationCategory, subcategory: ExpectationSubcategory, params_doc: Dict[str, str], suite_method_name: str | None = None)#
Decorator to register an expectation factory function with metadata.
- Parameters:
name – Expectation name (e.g., ‘ExpectationValueGreaterThan’). Required.
pydoc – Human-readable description of the expectation. Required.
category – Category from ExpectationCategory enum. Required.
subcategory – Subcategory from ExpectationSubcategory enum. Required.
params_doc – Documentation for each parameter. Required.
suite_method_name – Override for suite method name. If not provided, auto-generated from expectation name.
- Returns:
Decorator function.
Result Messages#
- class dataframe_expectations.result_message.DataFrameExpectationFailureMessage(expectation_str: str, data_frame_type: DataFrameType, violations_data_frame: DataFrame | DataFrame | None = None, message: str | None = None, limit_violations: int = 5)[source]#
Utilities#
- dataframe_expectations.core.utils.requires_params(*required_params, types: Dict[str, Type | Tuple[Type, ...]] | None = None)[source]#
Decorator that validates required parameters and optionally checks their types.
- Parameters:
required_params – Required parameter names
types – Optional dict mapping parameter names to expected types
Exception Classes#
- class dataframe_expectations.suite.DataFrameExpectationsSuite(suite_name: str | None = None, violation_sample_limit: int = 5)[source]#
Bases:
objectA builder for creating expectation suites for validating DataFrames.
Use this class to add expectations, then call build() to create an immutable runner that can execute the expectations on DataFrames.
Example:
suite = DataFrameExpectationsSuite(suite_name="user_validation") suite.expect_value_greater_than( column_name="age", value=18, tags=["priority:high", "category:compliance"] ) suite.expect_value_less_than( column_name="salary", value=100000, tags=["priority:medium", "category:budget"] ) suite.expect_min_rows( min_rows=10, tags=["priority:low", "category:data_quality"] ) # Build runner for all expectations (no filtering) runner_all = suite.build() runner_all.run(df) # Runs all 3 expectations # Build runner for high OR medium priority expectations (OR logic) runner_any = suite.build(tags=["priority:high", "priority:medium"], tag_match_mode=TagMatchMode.ANY) runner_any.run(df) # Runs 2 expectations (age and salary checks) # Build runner for expectations with both high priority AND compliance category (AND logic) runner_and = suite.build(tags=["priority:high", "category:compliance"], tag_match_mode=TagMatchMode.ALL) runner_and.run(df) # Runs 1 expectation (age check - has both tags)
- __init__(suite_name: str | None = None, violation_sample_limit: int = 5)[source]#
Initialize the expectation suite builder.
- Parameters:
suite_name – Optional name for the suite (useful for logging/reporting).
violation_sample_limit – Max number of violation rows to include in results (default 5).
- build(tags: List[str] | None = None, tag_match_mode: TagMatchMode | None = None) DataFrameExpectationsSuiteRunner[source]#
Build an immutable runner from the current expectations.
This creates a snapshot of the current expectations in the suite. You can continue to add more expectations to this suite and build new runners without affecting previously built runners.
- Parameters:
tags – Optional tag filters as list of strings in “key:value” format. Example: [“priority:high”, “priority:medium”] If None or empty, all expectations will be included.
tag_match_mode – How to match tags - TagMatchMode.ANY (OR logic) or TagMatchMode.ALL (AND logic). Required if tags are provided, must be None if tags are not provided. - TagMatchMode.ANY: Include expectations with ANY of the filter tags - TagMatchMode.ALL: Include expectations with ALL of the filter tags
- Returns:
An immutable DataFrameExpectationsSuiteRunner instance.
- Raises:
ValueError – If no expectations have been added, if tag_match_mode validation fails, or if no expectations match the tag filters.
- exception dataframe_expectations.suite.DataFrameExpectationsSuiteFailure(total_expectations: int, failures: List[DataFrameExpectationFailureMessage], result: SuiteExecutionResult | None = None, *args)[source]#
Bases:
ExceptionRaised when one or more expectations in the suite fail.
- class dataframe_expectations.suite.DataFrameExpectationsSuiteRunner(expectations: List[Any], suite_name: str | None = None, violation_sample_limit: int = 5, tags: List[str] | None = None, tag_match_mode: TagMatchMode | None = None)[source]#
Bases:
objectImmutable runner for executing a fixed set of expectations. This class is created by DataFrameExpectationsSuite.build() and runs the expectations on provided DataFrames.
- __init__(expectations: List[Any], suite_name: str | None = None, violation_sample_limit: int = 5, tags: List[str] | None = None, tag_match_mode: TagMatchMode | None = None)[source]#
Initialize the runner with a list of expectations and metadata.
- Parameters:
expectations – List of expectation instances.
suite_name – Optional name for the suite.
violation_sample_limit – Max number of violation rows to include in results.
tags – Optional tag filters as list of strings in “key:value” format. Example: [“priority:high”, “priority:medium”] If None or empty, all expectations will run.
tag_match_mode – How to match tags - TagMatchMode.ANY (OR logic) or TagMatchMode.ALL (AND logic). Required if tags are provided, must be None if tags are not provided. - TagMatchMode.ANY: Expectation matches if it has ANY of the filter tags - TagMatchMode.ALL: Expectation matches if it has ALL of the filter tags
- Raises:
ValueError – If tag_match_mode is provided without tags, or if tags are provided without tag_match_mode, or if tag filters result in zero expectations to run.
- property get_applied_tags: TagSet#
Return the applied tag filters for this runner.
- list_all_expectations() List[str][source]#
Return a list of all expectation descriptions before filtering.
- Returns:
List of all expectation descriptions as strings in the format: “ExpectationName (description)”
- list_selected_expectations() List[str][source]#
Return a list of selected expectation descriptions (after filtering).
- Returns:
List of selected expectation descriptions as strings in the format: “ExpectationName (description)”
- run(data_frame: DataFrame | DataFrame, raise_on_failure: bool = True, context: Dict[str, Any] | None = None) SuiteExecutionResult[source]#
Run all expectations on the provided DataFrame with PySpark caching optimization.
- Parameters:
data_frame – The DataFrame to validate.
raise_on_failure – If True (default), raises DataFrameExpectationsSuiteFailure on any failures. If False, returns SuiteExecutionResult instead.
context – Optional runtime context metadata (e.g., {“job_id”: “123”, “env”: “prod”}).
- Returns:
None if raise_on_failure=True and all pass, SuiteExecutionResult if raise_on_failure=False.
- property selected_expectations_count: int#
Return the number of expectations that will run (after filtering).
- validate(func: Callable | None = None, *, allow_none: bool = False) Callable[source]#
Decorator to validate the DataFrame returned by a function.
This decorator runs the expectations suite on the DataFrame returned by the decorated function. If validation fails, it raises DataFrameExpectationsSuiteFailure.
Example:
runner = suite.build() @runner.validate def load_data(): return pd.read_csv("data.csv") df = load_data() # Automatically validated # Allow None returns @runner.validate(allow_none=True) def maybe_load_data(): if condition: return pd.read_csv("data.csv") return None
- Parameters:
func – Function that returns a DataFrame.
allow_none – If True, allows the function to return None without validation. If False (default), None will raise a ValueError.
- Returns:
Wrapped function that validates the returned DataFrame.