API Reference

Contents

API Reference#

This section provides detailed documentation for the core infrastructure classes and modules in the DataFrame Expectations library. For user-facing expectation methods, see Expectation Gallery.

Core Infrastructure#

Base Expectation Classes#

class dataframe_expectations.core.expectation.DataFrameExpectation(tags: List[str] | None = None)[source]#

Bases: ABC

Base class for DataFrame expectations.

__init__(tags: List[str] | None = None)[source]#

Initialize the base expectation with optional tags. :param tags: Optional tags as list of strings in “key:value” format.

Example: [“priority:high”, “env:test”]

abstract get_description() str[source]#

Returns a description of the expectation.

get_expectation_name() str[source]#

Returns the class name as the expectation name.

get_tags() TagSet[source]#

Returns the tags for this expectation.

classmethod infer_data_frame_type(data_frame: DataFrame | DataFrame) DataFrameType[source]#

Infer the DataFrame type based on the provided DataFrame.

classmethod num_data_frame_rows(data_frame: DataFrame | DataFrame) int[source]#

Count the number of rows in the DataFrame.

validate(data_frame: DataFrame | DataFrame, **kwargs)[source]#

Validate the DataFrame against the expectation.

abstract validate_pandas(data_frame: DataFrame | DataFrame, **kwargs) DataFrameExpectationResultMessage[source]#

Validate a pandas DataFrame against the expectation.

abstract validate_pyspark(data_frame: DataFrame | DataFrame, **kwargs) DataFrameExpectationResultMessage[source]#

Validate a PySpark DataFrame against the expectation.

Column Expectations#

class dataframe_expectations.core.column_expectation.DataFrameColumnExpectation(expectation_name: str, column_name: str, fn_violations_pandas: Callable, fn_violations_pyspark: Callable, description: str, error_message: str, tags: List[str] | None = None)[source]#

Bases: DataFrameExpectation

Base class for DataFrame column expectations. This class is designed to validate a specific column in a DataFrame against a condition defined by fn_violations_pandas and fn_violations_pyspark functions.

__init__(expectation_name: str, column_name: str, fn_violations_pandas: Callable, fn_violations_pyspark: Callable, description: str, error_message: str, tags: List[str] | None = None)[source]#

Template for implementing DataFrame column expectations, where a column value is tested against a condition. The conditions are defined by the fn_violations_pandas and fn_violations_pyspark functions.

Parameters:
  • expectation_name – The name of the expectation. This will be used during logging.

  • column_name – The name of the column to check.

  • fn_violations_pandas – Function to find violations in a pandas DataFrame.

  • fn_violations_pyspark – Function to find violations in a PySpark DataFrame.

  • description – A description of the expectation used in logging.

  • error_message – The error message to return if the expectation fails.

  • tags – Optional tags as list of strings in “key:value” format. Example: [“priority:high”, “env:test”]

get_description() str[source]#

Returns a description of the expectation.

get_expectation_name() str[source]#

Returns the expectation name.

row_validation(data_frame_type: DataFrameType, data_frame: DataFrame | DataFrame, fn_violations: Callable, **kwargs) DataFrameExpectationResultMessage[source]#

Validate the DataFrame against the expectation.

Parameters:
  • data_frame_type – The type of DataFrame (Pandas or PySpark).

  • data_frame – The DataFrame to validate.

  • fn_violations – The function to find violations.

Returns:

ExpectationResultMessage indicating success or failure.

validate_pandas(data_frame: DataFrame | DataFrame, **kwargs) DataFrameExpectationResultMessage[source]#

Validate a pandas DataFrame against the expectation.

validate_pyspark(data_frame: DataFrame | DataFrame, **kwargs) DataFrameExpectationResultMessage[source]#

Validate a PySpark DataFrame against the expectation.

Aggregation Expectations#

class dataframe_expectations.core.aggregation_expectation.DataFrameAggregationExpectation(expectation_name: str, column_names: List[str], description: str, tags: List[str] | None = None)[source]#

Bases: DataFrameExpectation

Base class for DataFrame aggregation expectations. This class is designed to first aggregate data and then validate the aggregation results.

__init__(expectation_name: str, column_names: List[str], description: str, tags: List[str] | None = None)[source]#

Template for implementing DataFrame aggregation expectations, where data is first aggregated and then the aggregation results are validated.

Parameters:
  • expectation_name – The name of the expectation. This will be used during logging.

  • column_names – The list of column names to aggregate on.

  • description – A description of the expectation used in logging.

  • tags – Optional tags as list of strings in “key:value” format. Example: [“priority:high”, “env:test”]

abstract aggregate_and_validate_pandas(data_frame: DataFrame | DataFrame, **kwargs) DataFrameExpectationResultMessage[source]#

Aggregate and validate a pandas DataFrame against the expectation.

Note: This method should NOT check for column existence - that’s handled automatically by the validate_pandas method.

abstract aggregate_and_validate_pyspark(data_frame: DataFrame | DataFrame, **kwargs) DataFrameExpectationResultMessage[source]#

Aggregate and validate a PySpark DataFrame against the expectation.

Note: This method should NOT check for column existence - that’s handled automatically by the validate_pyspark method.

get_description() str[source]#

Returns a description of the expectation.

get_expectation_name() str[source]#

Returns the expectation name.

validate_pandas(data_frame: DataFrame | DataFrame, **kwargs) DataFrameExpectationResultMessage[source]#

Validate a pandas DataFrame against the expectation. Automatically checks column existence before calling the implementation.

validate_pyspark(data_frame: DataFrame | DataFrame, **kwargs) DataFrameExpectationResultMessage[source]#

Validate a PySpark DataFrame against the expectation. Automatically checks column existence before calling the implementation.

Expectation Registry#

class dataframe_expectations.registry.DataFrameExpectationRegistry[source]#

Bases: object

Registry for dataframe expectations.

classmethod clear_expectations()[source]#

Clear all registered expectations.

classmethod get_all_metadata() Dict[str, ExpectationMetadata][source]#

Get metadata for all registered expectations.

Returns:

Dictionary mapping expectation names to their metadata.

classmethod get_expectation(expectation_name: str, **kwargs) DataFrameExpectation[source]#

Get an expectation instance by name.

Note: This method is kept for backward compatibility with tests. The suite uses get_expectation_by_suite_method() for better performance.

Parameters:
  • expectation_name – The name of the expectation.

  • kwargs – Parameters to pass to the expectation factory function.

Returns:

An instance of DataFrameExpectation.

classmethod get_expectation_by_suite_method(suite_method_name: str, **kwargs) DataFrameExpectation[source]#

Get an expectation instance by suite method name.

Parameters:
  • suite_method_name – The suite method name (e.g., ‘expect_value_greater_than’).

  • kwargs – Parameters to pass to the expectation factory function.

Returns:

An instance of DataFrameExpectation.

Raises:

ValueError – If suite method not found.

classmethod get_metadata(expectation_name: str) ExpectationMetadata[source]#

Get metadata for a registered expectation.

Parameters:

expectation_name – The name of the expectation.

Returns:

Metadata for the expectation.

Raises:

ValueError – If expectation not found.

classmethod get_suite_method_mapping() Dict[str, str][source]#

Get mapping of suite method names to expectation names.

Returns:

Dictionary mapping suite method names (e.g., ‘expect_value_greater_than’) to expectation names (e.g., ‘ExpectationValueGreaterThan’).

classmethod list_expectations() list[source]#

List all registered expectation names.

Returns:

List of registered expectation names.

classmethod register(name: str, pydoc: str, category: ExpectationCategory, subcategory: ExpectationSubcategory, params_doc: Dict[str, str], suite_method_name: str | None = None)[source]#

Decorator to register an expectation factory function with metadata.

Parameters:
  • name – Expectation name (e.g., ‘ExpectationValueGreaterThan’). Required.

  • pydoc – Human-readable description of the expectation. Required.

  • category – Category from ExpectationCategory enum. Required.

  • subcategory – Subcategory from ExpectationSubcategory enum. Required.

  • params_doc – Documentation for each parameter. Required.

  • suite_method_name – Override for suite method name. If not provided, auto-generated from expectation name.

Returns:

Decorator function.

classmethod remove_expectation(expectation_name: str)[source]#

Remove an expectation from the registry.

Parameters:

expectation_name – The name of the expectation to remove.

Raises:

ValueError – If expectation not found.

dataframe_expectations.registry.register_expectation(name: str, pydoc: str, category: ExpectationCategory, subcategory: ExpectationSubcategory, params_doc: Dict[str, str], suite_method_name: str | None = None)#

Decorator to register an expectation factory function with metadata.

Parameters:
  • name – Expectation name (e.g., ‘ExpectationValueGreaterThan’). Required.

  • pydoc – Human-readable description of the expectation. Required.

  • category – Category from ExpectationCategory enum. Required.

  • subcategory – Subcategory from ExpectationSubcategory enum. Required.

  • params_doc – Documentation for each parameter. Required.

  • suite_method_name – Override for suite method name. If not provided, auto-generated from expectation name.

Returns:

Decorator function.

Result Messages#

class dataframe_expectations.result_message.DataFrameExpectationFailureMessage(expectation_str: str, data_frame_type: DataFrameType, violations_data_frame: DataFrame | DataFrame | None = None, message: str | None = None, limit_violations: int = 5)[source]#

Bases: DataFrameExpectationResultMessage

__init__(expectation_str: str, data_frame_type: DataFrameType, violations_data_frame: DataFrame | DataFrame | None = None, message: str | None = None, limit_violations: int = 5)[source]#
get_violations_data_frame() DataFrame | DataFrame | None[source]#

Get the DataFrame with violations.

class dataframe_expectations.result_message.DataFrameExpectationResultMessage[source]#

Bases: ABC

Base class for expectation result message.

dataframe_to_str(data_frame_type: DataFrameType, data_frame, rows: int) str[source]#

Print the DataFrame based on its type.

message: str = ''#
class dataframe_expectations.result_message.DataFrameExpectationSuccessMessage(expectation_name: str, message: str | None = None)[source]#

Bases: DataFrameExpectationResultMessage

__init__(expectation_name: str, message: str | None = None)[source]#

Initialize the expectation success message.

Utilities#

dataframe_expectations.core.utils.requires_params(*required_params, types: Dict[str, Type | Tuple[Type, ...]] | None = None)[source]#

Decorator that validates required parameters and optionally checks their types.

Parameters:
  • required_params – Required parameter names

  • types – Optional dict mapping parameter names to expected types

Usage:

@requires_params(“column_name”, “value”) def func(**kwargs): …

@requires_params(“column_name”, “value”, types={“column_name”: str, “value”: int}) def func(**kwargs): …

Exception Classes#

class dataframe_expectations.suite.DataFrameExpectationsSuite(suite_name: str | None = None, violation_sample_limit: int = 5)[source]#

Bases: object

A builder for creating expectation suites for validating DataFrames.

Use this class to add expectations, then call build() to create an immutable runner that can execute the expectations on DataFrames.

Example:

suite = DataFrameExpectationsSuite(suite_name="user_validation")
suite.expect_value_greater_than(
    column_name="age",
    value=18,
    tags=["priority:high", "category:compliance"]
)
suite.expect_value_less_than(
    column_name="salary",
    value=100000,
    tags=["priority:medium", "category:budget"]
)
suite.expect_min_rows(
    min_rows=10,
    tags=["priority:low", "category:data_quality"]
)

# Build runner for all expectations (no filtering)
runner_all = suite.build()
runner_all.run(df)  # Runs all 3 expectations

# Build runner for high OR medium priority expectations (OR logic)
runner_any = suite.build(tags=["priority:high", "priority:medium"], tag_match_mode=TagMatchMode.ANY)
runner_any.run(df)  # Runs 2 expectations (age and salary checks)

# Build runner for expectations with both high priority AND compliance category (AND logic)
runner_and = suite.build(tags=["priority:high", "category:compliance"], tag_match_mode=TagMatchMode.ALL)
runner_and.run(df)  # Runs 1 expectation (age check - has both tags)
__init__(suite_name: str | None = None, violation_sample_limit: int = 5)[source]#

Initialize the expectation suite builder.

Parameters:
  • suite_name – Optional name for the suite (useful for logging/reporting).

  • violation_sample_limit – Max number of violation rows to include in results (default 5).

build(tags: List[str] | None = None, tag_match_mode: TagMatchMode | None = None) DataFrameExpectationsSuiteRunner[source]#

Build an immutable runner from the current expectations.

This creates a snapshot of the current expectations in the suite. You can continue to add more expectations to this suite and build new runners without affecting previously built runners.

Parameters:
  • tags – Optional tag filters as list of strings in “key:value” format. Example: [“priority:high”, “priority:medium”] If None or empty, all expectations will be included.

  • tag_match_mode – How to match tags - TagMatchMode.ANY (OR logic) or TagMatchMode.ALL (AND logic). Required if tags are provided, must be None if tags are not provided. - TagMatchMode.ANY: Include expectations with ANY of the filter tags - TagMatchMode.ALL: Include expectations with ALL of the filter tags

Returns:

An immutable DataFrameExpectationsSuiteRunner instance.

Raises:

ValueError – If no expectations have been added, if tag_match_mode validation fails, or if no expectations match the tag filters.

exception dataframe_expectations.suite.DataFrameExpectationsSuiteFailure(total_expectations: int, failures: List[DataFrameExpectationFailureMessage], result: SuiteExecutionResult | None = None, *args)[source]#

Bases: Exception

Raised when one or more expectations in the suite fail.

__init__(total_expectations: int, failures: List[DataFrameExpectationFailureMessage], result: SuiteExecutionResult | None = None, *args)[source]#
class dataframe_expectations.suite.DataFrameExpectationsSuiteRunner(expectations: List[Any], suite_name: str | None = None, violation_sample_limit: int = 5, tags: List[str] | None = None, tag_match_mode: TagMatchMode | None = None)[source]#

Bases: object

Immutable runner for executing a fixed set of expectations. This class is created by DataFrameExpectationsSuite.build() and runs the expectations on provided DataFrames.

__init__(expectations: List[Any], suite_name: str | None = None, violation_sample_limit: int = 5, tags: List[str] | None = None, tag_match_mode: TagMatchMode | None = None)[source]#

Initialize the runner with a list of expectations and metadata.

Parameters:
  • expectations – List of expectation instances.

  • suite_name – Optional name for the suite.

  • violation_sample_limit – Max number of violation rows to include in results.

  • tags – Optional tag filters as list of strings in “key:value” format. Example: [“priority:high”, “priority:medium”] If None or empty, all expectations will run.

  • tag_match_mode – How to match tags - TagMatchMode.ANY (OR logic) or TagMatchMode.ALL (AND logic). Required if tags are provided, must be None if tags are not provided. - TagMatchMode.ANY: Expectation matches if it has ANY of the filter tags - TagMatchMode.ALL: Expectation matches if it has ALL of the filter tags

Raises:

ValueError – If tag_match_mode is provided without tags, or if tags are provided without tag_match_mode, or if tag filters result in zero expectations to run.

property get_applied_tags: TagSet#

Return the applied tag filters for this runner.

list_all_expectations() List[str][source]#

Return a list of all expectation descriptions before filtering.

Returns:

List of all expectation descriptions as strings in the format: “ExpectationName (description)”

list_selected_expectations() List[str][source]#

Return a list of selected expectation descriptions (after filtering).

Returns:

List of selected expectation descriptions as strings in the format: “ExpectationName (description)”

run(data_frame: DataFrame | DataFrame, raise_on_failure: bool = True, context: Dict[str, Any] | None = None) SuiteExecutionResult[source]#

Run all expectations on the provided DataFrame with PySpark caching optimization.

Parameters:
  • data_frame – The DataFrame to validate.

  • raise_on_failure – If True (default), raises DataFrameExpectationsSuiteFailure on any failures. If False, returns SuiteExecutionResult instead.

  • context – Optional runtime context metadata (e.g., {“job_id”: “123”, “env”: “prod”}).

Returns:

None if raise_on_failure=True and all pass, SuiteExecutionResult if raise_on_failure=False.

property selected_expectations_count: int#

Return the number of expectations that will run (after filtering).

property total_expectations: int#

Return the total number of expectations before filtering.

validate(func: Callable | None = None, *, allow_none: bool = False) Callable[source]#

Decorator to validate the DataFrame returned by a function.

This decorator runs the expectations suite on the DataFrame returned by the decorated function. If validation fails, it raises DataFrameExpectationsSuiteFailure.

Example:

runner = suite.build()

@runner.validate
def load_data():
    return pd.read_csv("data.csv")

df = load_data()  # Automatically validated

# Allow None returns
@runner.validate(allow_none=True)
def maybe_load_data():
    if condition:
        return pd.read_csv("data.csv")
    return None
Parameters:
  • func – Function that returns a DataFrame.

  • allow_none – If True, allows the function to return None without validation. If False (default), None will raise a ValueError.

Returns:

Wrapped function that validates the returned DataFrame.