refineryframe package

Submodules

refineryframe.demo module

demo.py - Support module for refineryframe package that contains definitions of testing dataframes, etc.

refineryframe.detect_unexpected module

detect_unexpected.py - Data Quality Checking Module

This module contains functions to detect unexpected values in a pandas DataFrame, helping to identify potential data quality issues. The functions cover various aspects of data validation, including checking for missing values, unexpected data types, duplicates, incorrect date formats, out-of-range numeric values, and date values outside specified date ranges.

Functions:

  1. check_missing_types(dataframe, MISSING_TYPES, independent=True, throw_error, thresholds, logger):

  2. check_missing_values(dataframe, throw_error, thresholds, logger):

  3. check_inf_values(dataframe, independent=True, throw_error, thresholds, logger):

  4. check_date_format(dataframe, expected_date_format=’%Y-%m-%d’, independent=True, throw_error, thresholds, logger):

  5. check_duplicates(dataframe, subset=None, independent=True, throw_error, thresholds, logger):

  6. check_col_names_types(dataframe, types_dict_str, independent=True, throw_error, thresholds, logger):

  7. check_numeric_range(dataframe, numeric_cols=None, lower_bound=-float(‘inf’), upper_bound=float(‘inf’), independent=True, ignore_values=[], throw_error, thresholds, logger):

  8. check_date_range(dataframe, earliest_date=’1900-01-01’, latest_date=’2100-12-31’, independent=True, ignore_dates=[], throw_error, thresholds, logger):

  9. check_duplicate_col_names(dataframe, throw_error, logger):

  10. detect_unexpected_values(dataframe, MISSING_TYPES, unexpected_exceptions, unexpected_exceptions_error, unexpected_conditions, thresholds, ids_for_dedup, TEST_DUV_FLAGS_PATH, types_dict_str, expected_date_format, earliest_date, latest_date, numeric_lower_bound, numeric_upper_bound, print_score, logger) -> dict: …

Note:

  • Some functions use the logger parameter for logging warning messages instead of printing.

  • Users can specify exceptions for certain checks using the unexpected_exceptions dictionary.

  • Users can define additional conditions to check for unexpected values using the unexpected_conditions dictionary.

  • The thresholds parameter in the detect_unexpected_values function allows users to set threshold scores for different checks.

  • Each function returns relevant information about detected issues or scores.

refineryframe.detect_unexpected.check_col_names_types(dataframe: DataFrame, types_dict_str: dict, silent: bool = False, independent: bool = True, throw_error: bool = False, thresholds: dict = {'incorrect_dtypes_score': 100, 'missing_score': 100}, logger: Logger | None = None) dict[source]

Checks if a given DataFrame has the same column names as keys in a provided dictionary and if those columns have the same data types as the corresponding values in the dictionary.

Parameters:

dataframepandas DataFrame

The DataFrame to be checked.

types_dict_strdict or str

A dictionary with column names as keys and expected data types as values, or a string representation of such a dictionary.

silentbool, optional

If True, suppress warning messages. Default is False.

independentbool, optional

If True, return a Boolean indicating if checks passed. If False, return a dictionary containing scores and checks. Default is True.

throw_errorbool, optional

If True, raise a ValueError for failed checks. Default is False.

thresholdsdict, optional

A dictionary containing thresholds for scoring. Default is {‘missing_score’: 100, ‘incorrect_dtypes_score’: 100}.

loggerlogging.Logger, optional

A logger instance for logging messages. If not provided, a new logger will be created.

Returns:

dict or bool

If independent is True, return a Boolean indicating if checks passed. If independent is False, return a dictionary containing scores and checks.

Examples:

Example usage and expected outputs.

from refineryframe.detect_unexpected import check_missing_types
from refineryframe.demo import tiny_example

data = tiny_example['dataframe']

check_col_names_types(dataframe = data)

Raises:

ValueError

If throw_error is True and any checks fail.

refineryframe.detect_unexpected.check_date_format(dataframe: DataFrame, expected_date_format: str = '%Y-%m-%d', independent: bool = True, silent: bool = False, throw_error: bool = False, thresholds: dict = {'date_format_score': 100}, logger: Logger | None = None) dict[source]

Checks if the values in the datetime columns of the input DataFrame have the expected ‘YYYY-MM-DD’ format.

Parameters:

dataframepandas DataFrame

The DataFrame to be checked for date format.

expected_date_formatstr, optional

The expected date format. Default is ‘%Y-%m-%d’.

independentbool, optional

If True, return a Boolean indicating if date format checks passed. If False, return a dictionary containing scores and checks. Default is True.

silentbool, optional

If True, suppress warning messages. Default is False.

throw_errorbool, optional

If True, raise a ValueError for failed date format checks. Default is False.

thresholdsdict, optional

A dictionary containing thresholds for scoring date format checks. Default is {‘date_format_score’: 100}.

loggerlogging.Logger, optional

A logger instance for logging messages. If not provided, a new logger will be created.

Returns:

bool or dict

If independent is True, return a Boolean indicating if date format checks passed. If independent is False, return a dictionary containing scores and checks.

Examples:

Example usage and expected outputs.

from refineryframe.detect_unexpected import check_missing_types
from refineryframe.demo import tiny_example

data = tiny_example['dataframe']

check_date_format(dataframe = data)

Raises:

ValueError

If throw_error is True and date format checks fail.

refineryframe.detect_unexpected.check_date_range(dataframe: DataFrame, earliest_date: str = '1900-08-25', latest_date: str = '2100-01-01', independent: bool = True, silent: bool = False, ignore_dates: list = [], throw_error: bool = False, thresholds: dict = {'early_dates_score': 100, 'future_dates_score': 100}, logger: Logger | None = None) dict[source]

Checks if date values are within expected date ranges in each column of a DataFrame.

Parameters:

dataframepandas DataFrame

The DataFrame to check for date values.

earliest_datestr, optional

The earliest date allowed in the DataFrame. Default is ‘1900-08-25’.

latest_datestr, optional

The latest date allowed in the DataFrame. Default is ‘2100-01-01’.

independentbool, optional

If True, return a boolean indicating whether all checks passed. Default is True.

silentbool, optional

If True, suppress log warnings. Default is False.

ignore_dateslist, optional

A list of dates to ignore when checking for dates outside the specified range. Default is an empty list.

throw_errorbool, optional

If True, raise an error if issues are found. Default is False.

thresholdsdict, optional

Dictionary containing thresholds for early_dates_score and future_dates_score. Default is {‘early_dates_score’: 100, ‘future_dates_score’: 100}.

loggerlogging.Logger, optional

Logger object for log messages. If not provided, a new logger will be created.

Returns:

bool or dict

If independent is True, return a boolean indicating whether all checks passed. If independent is False, return a dictionary containing scores and checks information.

Examples:

Example usage and expected outputs.

from refineryframe.detect_unexpected import check_missing_types
from refineryframe.demo import tiny_example

data = tiny_example['dataframe']

check_date_range(dataframe = data)

Raises:

ValueError

If throw_error is True and date range checks fail.

refineryframe.detect_unexpected.check_duplicate_col_names(dataframe: DataFrame, throw_error: bool = False, logger: Logger | None = None) dict[source]

Checks for duplicate column names in a pandas DataFrame.

Parameters:

dataframepandas DataFrame

The DataFrame to check for duplicate column names.

throw_errorbool, optional

If True, raise a ValueError when duplicate column names are found. If False, print a warning message and continue execution. Default is False.

loggerlogging.Logger, optional

The logger object to use for logging warning and error messages. Default is the root logger.

Returns:

dict

A dictionary containing information about the duplicates. ‘column_name_freq’: dict

A dictionary where keys are duplicate column names, and values are the number of occurrences.

‘COLUMN_NAMES_DUPLICATES_TEST’: bool

True if duplicate column names are found, False otherwise.

Examples:

Example usage and expected outputs.

from refineryframe.detect_unexpected import check_missing_types
from refineryframe.demo import tiny_example

data = tiny_example['dataframe']

check_duplicate_col_names(dataframe = data)

Raises:

ValueError

If throw_error is True and duplicate column names are found.

refineryframe.detect_unexpected.check_duplicates(dataframe: DataFrame, subset: list | None = None, independent: bool = True, silent: bool = False, throw_error: bool = False, thresholds: dict = {'key_dup_score': 100, 'row_dup_score': 100}, logger: Logger | None = None) dict[source]

Checks for duplicates in a pandas DataFrame.

Parameters:

dataframepandas DataFrame

The DataFrame to check for duplicates.

subsetlist of str or None, optional

A list of column names to consider when identifying duplicates. If not specified or None, all columns are used to identify duplicates.

independentbool, optional

If True, return a Boolean indicating if duplicate checks passed. If False, return a dictionary containing scores and checks. Default is True.

silentbool, optional

If True, suppress warning messages. Default is False.

throw_errorbool, optional

If True, raise a ValueError for failed duplicate checks. Default is False.

thresholdsdict, optional

A dictionary containing thresholds for scoring duplicate checks. Default is {‘row_dup_score’: 100, ‘key_dup_score’: 100}.

loggerlogging.Logger or None, optional

A logger instance for logging messages. If not provided, a new logger will be created.

Returns:

bool or dict

If independent is True, return a Boolean indicating if duplicate checks passed. If independent is False, return a dictionary containing scores and checks.

Examples:

Example usage and expected outputs.

from refineryframe.detect_unexpected import check_missing_types
from refineryframe.demo import tiny_example

data = tiny_example['dataframe']

check_duplicates(dataframe = data)

Raises:

ValueError

If throw_error is True and duplicate checks fail.

refineryframe.detect_unexpected.check_inf_values(dataframe: DataFrame, independent: bool = True, silent: bool = False, throw_error: bool = False, thresholds: dict = {'inf_score': 100}, logger: Logger | None = None) dict[source]

Counts the infinite (inf) values in each column of a pandas DataFrame.

Parameters:

dataframepandas DataFrame

The DataFrame to count infinite values in.

independentbool, optional

If True, consider only numeric columns when counting inf values. If False, count inf values in all columns. Default is True.

silentbool, optional

If True, suppress warning messages. Default is False.

throw_errorbool, optional

If True, raise a ValueError for failed inf value checks. Default is False.

thresholdsdict, optional

A dictionary containing thresholds for scoring inf value checks. Default is {‘inf_score’: 100}.

loggerlogging.Logger, optional

A logger instance for logging messages. If not provided, a new logger will be created.

Returns:

dict

A dictionary containing scores and checks for inf value checks.

Examples:

Example usage and expected outputs.

from refineryframe.detect_unexpected import check_missing_types
from refineryframe.demo import tiny_example

data = tiny_example['dataframe']

check_inf_values(dataframe = data)

Raises:

ValueError

If throw_error is True and inf value checks fail.

refineryframe.detect_unexpected.check_missing_types(dataframe: DataFrame, MISSING_TYPES: dict, independent: bool = True, silent: bool = False, throw_error: bool = False, thresholds: dict = {'cat_score': 100, 'date_score': 100, 'numeric_score': 100}, logger: Logger | None = None) dict[source]

Checks for instances of missing types in each column of a DataFrame and log warning messages for any found.

Parameters:

dataframepandas DataFrame

The DataFrame to search for missing values.

MISSING_TYPESdict

A dictionary of missing types to search for. Keys represent the missing type, and values are the corresponding values to search for.

independentbool, optional

If True, return a boolean indicating whether all checks passed. Default is True.

silentbool, optional

If True, suppress log warnings. Default is False.

throw_errorbool, optional

If True, raise an error if issues are found. Default is False.

thresholdsdict, optional

Dictionary containing thresholds for numeric_score, date_score, and cat_score. Default is {‘numeric_score’: 100, ‘date_score’: 100, ‘cat_score’: 100}.

loggerlogging.Logger, optional

Logger object for log messages. If not provided, a new logger will be created.

Returns:

bool or dict

If independent is True, return a boolean indicating whether all checks passed. If independent is False, return a dictionary containing scores and checks information.

Examples:

Example usage and expected outputs.

from refineryframe.detect_unexpected import check_missing_types
from refineryframe.demo import tiny_example

data = tiny_example['dataframe']
MISSING_TYPES = tiny_example['MISSING_TYPES']

check_missing_types(dataframe = data,
                    MISSING_TYPES = MISSING_TYPES)

Raises:

ValueError

If throw_error is True and missing type checks fail.

refineryframe.detect_unexpected.check_missing_values(dataframe: DataFrame, independent: bool = True, silent: bool = False, throw_error: bool = False, thresholds: dict = {'missing_values_score': 100}, logger: Logger | None = None) dict[source]

Counts the number of NaN, None, and NaT values in each column of a pandas DataFrame.

Parameters:

dataframepandas DataFrame

The DataFrame to count missing values in.

independentbool, optional

If True, consider only columns with missing values as defined by NaN, None, and NaT. If False, count missing values in all columns. Default is True.

silentbool, optional

If True, suppress warning messages. Default is False.

throw_errorbool, optional

If True, raise a ValueError for failed missing value checks. Default is False.

thresholdsdict, optional

A dictionary containing thresholds for scoring missing value checks. Default is {‘missing_values_score’: 100}.

loggerlogging.Logger, optional

A logger instance for logging messages. If not provided, a new logger will be created.

Returns:

dict

A dictionary containing scores and checks for missing value checks.

Examples:

Example usage and expected outputs.

from refineryframe.detect_unexpected import check_missing_values
from refineryframe.demo import tiny_example

data = tiny_example['dataframe']

check_missing_values(dataframe = data)

Raises:

ValueError

If throw_error is True and missing value checks fail.

refineryframe.detect_unexpected.check_numeric_range(dataframe: DataFrame, numeric_cols: list | None = None, lower_bound: float = -inf, upper_bound: float = inf, independent: bool = True, silent: bool = False, ignore_values: list = [], throw_error: bool = False, thresholds: dict = {'low_numeric_score': 100, 'upper_numeric_score': 100}, logger: Logger | None = None) dict[source]

Checks if numeric values are within expected ranges in each column of a DataFrame.

Parameters:

dataframepandas DataFrame

The DataFrame to check for numeric values.

numeric_colslist of str, optional

A list of column names to consider. If None, all numeric columns are checked.

lower_boundfloat, optional

The lower bound allowed for numeric values. Default is -infinity.

upper_boundfloat, optional

The upper bound allowed for numeric values. Default is infinity.

independentbool, optional

If True, return a boolean indicating whether all checks passed. Default is True.

silentbool, optional

If True, suppress log warnings. Default is False.

ignore_valueslist, optional

A list of values to ignore when checking for values outside the specified range. Default is empty list.

throw_errorbool, optional

If True, raise an error if issues are found. Default is False.

thresholdsdict, optional

Dictionary containing thresholds for low_numeric_score and upper_numeric_score. Default is {‘low_numeric_score’: 100, ‘upper_numeric_score’: 100}.

loggerlogging.Logger, optional

Logger object for log messages. If not provided, a new logger will be created.

Returns:

bool or dict

If independent is True, return a boolean indicating whether all checks passed. If independent is False, return a dictionary containing scores and checks information.

Examples:

Example usage and expected outputs.

from refineryframe.detect_unexpected import check_missing_types
from refineryframe.demo import tiny_example

data = tiny_example['dataframe']

check_numeric_range(dataframe = data)

Raises:

ValueError

If throw_error is True and numeric range checks fail.

refineryframe.detect_unexpected.detect_unexpected_values(dataframe: DataFrame, MISSING_TYPES: dict = {'character_not_delivered': 'missing', 'date_not_delivered': '1850-01-09', 'numeric_not_delivered': -999}, unexpected_exceptions: dict = {'col_names_types': 'NONE', 'date_format': 'NONE', 'date_range': 'NONE', 'duplicates': 'NONE', 'inf_values': 'NONE', 'missing_types': 'NONE', 'missing_values': 'NONE', 'numeric_range': 'NONE'}, unexpected_exceptions_error={'col_name_duplicates': False, 'col_names_types': False, 'date_format': False, 'date_range': False, 'duplicates': False, 'inf_values': False, 'missing_types': False, 'missing_values': False, 'numeric_range': False}, unexpected_conditions: dict | None = None, thresholds: dict = {'ccnt_scores': {'incorrect_dtypes_score': 100, 'missing_score': 100}, 'cdf_scores': {'date_format_score': 100}, 'cdr_scores': {'early_dates_score': 100, 'future_dates_score': 100}, 'cmt_scores': {'cat_score': 100, 'date_score': 100, 'numeric_score': 100}, 'cmv_scores': {'missing_values_score': 100}, 'cnr_scores': {'low_numeric_score': 100, 'upper_numeric_score': 100}, 'dup_scores': {'key_dup_score': 100, 'row_dup_score': 100}, 'inf_scores': {'inf_score': 100}}, ids_for_dedup: list | None = None, TEST_DUV_FLAGS_PATH: str | None = None, types_dict_str: dict | None = None, expected_date_format: str = '%Y-%m-%d', earliest_date: str = '1900-08-25', latest_date: str = '2100-01-01', numeric_lower_bound: float = 0, numeric_upper_bound: float = inf, print_score: bool = True, logger: Logger | None = None) dict[source]

Detects unexpected values in a pandas DataFrame.

Parameters:

dataframe (pandas DataFrame):

The DataFrame to be checked.

MISSING_TYPES (dict):

Dictionary that maps column names to the values considered as missing for that column.

unexpected_exceptions (dict):

Dictionary that lists column exceptions for each of the following checks: col_names_types, missing_values, missing_types, inf_values, date_format, duplicates, date_range, and numeric_range.

unexpected_exceptions_error (dict):

Dictionary indicating whether to throw errors for each type of unexpected exception.

unexpected_conditions (dict):

Dictionary containing additional conditions to check for unexpected values.

thresholds (dict):

Dictionary containing threshold scores for different checks.

ids_for_dedup (list):

List of columns to identify duplicates (default is None).

TEST_DUV_FLAGS_PATH (str):

Path for checking unexpected values (default is None).

types_dict_str (str):

String that describes the expected types of the columns (default is None).

expected_date_format (str):

The expected date format (default is ‘%Y-%m-%d’).

earliest_date (str):

The earliest acceptable date (default is “1900-08-25”).

latest_date (str):

The latest acceptable date (default is “2100-01-01”).

numeric_lower_bound (float):

The lowest acceptable value for numeric columns (default is 0).

numeric_upper_bound (float):

The highest acceptable value for numeric columns (default is infinity).

print_score (bool):

Whether to print the duv score (default is True).

logger (logging.Logger):

Logger object for logging messages (default is logging).

Returns:

dict:

duv_score (float): Number between 0 and 1 representing the percentage of passed tests. check_scores (dict): Scores for each check. unexpected_exceptions_scaned (dict): Unexpected exceptions based on detected unexpected values.

Examples:

Example usage and expected outputs.

from refineryframe.detect_unexpected import check_missing_types
from refineryframe.demo import tiny_example

data = tiny_example['dataframe']

detect_unexpected_values(dataframe = data)

Raises:

Exception:

If any errors occur during the detection process.

refineryframe.other module

Module Name: other.py

This module contains various utility functions for logging, data manipulation, and data type handling.

Functions:
shoutOUT(type=”dline”, mess=None, dotline_length=50, logger: logging.Logger = logging):

Print a line of text with a specified length and format.

get_type_dict(dataframe: pd.DataFrame, explicit: bool = True, stringout: bool = False,

logger: logging.Logger = logging) -> str:

Returns a string representation of a dictionary containing the data types of each column in the given pandas DataFrame.

set_types(dataframe: pd.DataFrame, types_dict_str: dict, replace_dict: dict = None,

expected_date_format: str = ‘%Y-%m-%d’, logger: logging.Logger = logging) -> pd.DataFrame:

Change the data types of the columns in the given DataFrame based on a dictionary of intended data types.

treat_unexpected_cond(df: pd.DataFrame, description: str, group: str, features: list,

query: str, warning: bool, replace, logger: logging.Logger = logging) -> pd.DataFrame:

Replace unexpected values in a pandas DataFrame with replace values.

Dependencies:
  • logging

  • pandas as pd

  • re

Note:

Please refer to the docstrings of individual functions for detailed information and usage examples.

refineryframe.other.add_index_to_duplicate_columns(dataframe: DataFrame, column_name_freq: dict, logger: Logger | None = None) DataFrame[source]

Adds an index to duplicate column names in a pandas DataFrame.

Parameters:

dataframepandas DataFrame

The DataFrame containing the duplicate columns.

column_name_freqdict

A dictionary where keys are duplicate column names, and values are the number of occurrences.

Returns:

pandas DataFrame

The DataFrame with updated column names.

refineryframe.other.get_type_dict(dataframe: DataFrame, explicit: bool = True, stringout: bool = False, logger: Logger | None = None) dict[source]

Returns a string representation of a dictionary containing the data types of each column in the given pandas DataFrame.

Numeric columns will have type ‘numeric’, date columns will have type ‘date’, character columns will have type ‘category’, and columns containing ‘id’ at the beginning or end of their name will have type ‘index’.

Parameters

dataframepandas DataFrame

The DataFrame to extract column data types from.

Returns

str

A string representation of a dictionary containing the data types of each column in the given DataFrame. The keys are the column names and the values are the corresponding data types.

refineryframe.other.set_types(dataframe: DataFrame, types_dict_str: dict, replace_dict: dict | None = None, expected_date_format: str = '%Y-%m-%d', logger: Logger | None = None) DataFrame[source]

Change the data types of the columns in the given DataFrame based on a dictionary of intended data types.

Args:
dataframe (pandas.DataFrame):

The DataFrame to change the data types of.

types_dict_str (dict):

A dictionary where the keys are the column names and the values are the intended data types for those columns.

replace_dict (dict, optional):

A dictionary containing replacement values for specific columns. Defaults to None.

expected_date_format (str, optional):

The expected date format for date columns. Defaults to ‘%Y-%m-%d’.

Returns:

pandas.DataFrame: The DataFrame with the changed data types.

Raises:

ValueError: If the keys in the dictionary do not match the columns in the DataFrame. TypeError: If the data types cannot be changed successfully.

refineryframe.other.shoutOUT(output_type: str = 'dline', mess: str | None = None, dotline_length: int = 50, logger: Logger | None = None) None[source]

Print a line of text with a specified length and format.

Args:
output_type (str):

The type of line to print. Valid values are “dline” (default), “line”, “pline”, “HEAD1”, “title”, “subtitle”, “subtitle2”, “subtitle3”, and “warning”.

mess (str):

The text to print out.

dotline_length (int):

The length of the line to print.

Returns:

None

Examples:

shoutOUT(“HEAD1”, mess=”Header”, dotline_length=50) shoutOUT(output_type=”dline”, dotline_length=50)

refineryframe.other.treat_unexpected_cond(df: DataFrame, description: str, group: str, features: list, query: str, warning: bool, replace, logger: Logger | None = None) DataFrame[source]

Replace unexpected values in a pandas DataFrame with replace values.

Parameters:

df (pandas DataFrame):

The DataFrame to be checked.

description (str):

Description of the unexpected condition being treated.

group (str):

Group identifier for the unexpected condition.

features (list):

List of column names or regex pattern for selecting columns.

query (str):

Query string for selecting rows based on the unexpected condition.

warning (str):

Warning message to be logged if unexpected condition is found.

replace (object):

Value to replace the unexpected values with.

Returns:

df (pandas DataFrame): The DataFrame with replaced unexpected values, if replace is not None.

refineryframe.refiner module

refineryframe Module

This module provides a Refiner class to encapsulate functions for data refinement and validation. The Refiner class is designed to work with pandas DataFrames and perform various checks and replacements for data preprocessing.

class refineryframe.refiner.Refiner(dataframe: ~pandas.core.frame.DataFrame, replace_dict: dict | None = None, MISSING_TYPES: dict = {'character_not_delivered': 'missing', 'date_not_delivered': '1850-01-09', 'numeric_not_delivered': -999}, expected_date_format: str = '%Y-%m-%d', mess: str = 'INITIAL PREPROCESSING', shout_type: str = 'HEAD2', logger=<module 'logging' from '/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/logging/__init__.py'>, logger_name='Refiner', loggerLvl=20, dotline_length: int = 50, lower_bound=-inf, upper_bound=inf, earliest_date='1900-08-25', latest_date='2100-01-01', ids_for_dedup: list = 'ALL', unexpected_exceptions_duv: dict = {'col_names_types': 'NONE', 'date_format': 'NONE', 'date_range': 'NONE', 'duplicates': 'NONE', 'inf_values': 'NONE', 'missing_types': 'NONE', 'missing_values': 'NONE', 'numeric_range': 'NONE'}, unexpected_exceptions_ruv: dict = {'capitalization': 'NONE', 'date_range': 'NONE', 'irregular_values': 'NONE', 'numeric_range': 'NONE', 'unicode_character': 'NONE'}, unexpected_exceptions_error: dict = {'col_name_duplicates': False, 'col_names_types': False, 'date_format': False, 'date_range': False, 'duplicates': False, 'inf_values': False, 'missing_types': False, 'missing_values': False, 'numeric_range': False}, thresholds: dict = {'ccnt_scores': {'incorrect_dtypes_score': 100, 'missing_score': 100}, 'cdf_scores': {'date_format_score': 100}, 'cdr_scores': {'early_dates_score': 100, 'future_dates_score': 100}, 'cmt_scores': {'cat_score': 100, 'date_score': 100, 'numeric_score': 100}, 'cmv_scores': {'missing_values_score': 100}, 'cnr_scores': {'low_numeric_score': 100, 'upper_numeric_score': 100}, 'dup_scores': {'key_dup_score': 100, 'row_dup_score': 100}, 'inf_scores': {'inf_score': 100}}, unexpected_conditions=None, ignore_values=[], ignore_dates=[])[source]

Bases: object

Class that encapsulates functions for data refining and validation.

Attributes:
dataframe (pd.DataFrame):

The input pandas DataFrame to be processed.

replace_dict (dict, optional):

A dictionary to define replacements for specific values in the DataFrame.

MISSING_TYPES (dict, optional):

Default values for missing types in different columns of the DataFrame.

expected_date_format (str, optional):

The expected date format for date columns in the DataFrame.

mess (str, optional):

A custom message used in the shout method for printing.

shout_type (str, optional):

The type of output for the shout method (e.g., ‘HEAD2’).

logger (logging.Logger, optional):

A custom logger object for logging messages.

logger_name (str, optional):

The name of the logger for the class instance.

loggerLvl (int, optional):

The logging level for the logger.

dotline_length (int, optional):

The length of the line to be printed in the shout method.

lower_bound (float, optional):

The lower bound for numeric range validation.

upper_bound (float, optional):

The upper bound for numeric range validation.

earliest_date (str, optional):

The earliest allowed date for date range validation.

latest_date (str, optional):

The latest allowed date for date range validation.

ids_for_dedup (list, optional):

A list of column names to be used for duplicate detection.

unexpected_exceptions_duv (dict, optional):

A dictionary of unexpected exceptions for data value validation.

unexpected_exceptions_ruv (dict, optional):

A dictionary of unexpected exceptions for data replacement validation.

unexpected_exceptions_error (dict, optional):

A dictionary that indicates if error should be raised during duv.

unexpected_conditions (None or callable, optional):

A callable function for custom unexpected conditions.

ignore_values (list, optional):

A list of values to ignore during numeric range validation.

ignore_dates (list, optional):

A list of dates to ignore during date range validation.

Methods:

shout(mess=None): Prints a line of text with a specified length and format. get_type_dict_from_dataframe(explicit=True, stringout=False): Returns a dictionary containing the data types

of each column in the given pandas DataFrame.

set_type_dict(type_dict=None, explicit=True, stringout=False): Changes the data types of the columns in the

DataFrame based on a dictionary of intended data types.

set_types(type_dict=None, replace_dict=None, expected_date_format=None): Changes the data types of the columns

in the DataFrame based on a dictionary of intended data types.

get_refiner_settings(): Extracts values of parameters from the Refiner and saves them in a dictionary for later use. set_refiner_settings(settings: dict): Updates input parameters with values from the provided settings dict. check_duplicate_col_names(throw_error=None): Checks for duplicate column names in a pandas DataFrame. add_index_to_duplicate_columns(column_names_freq: dict): Adds an index to duplicate column names in a pandas DataFrame. check_missing_types(): Searches for instances of missing types in each column of the DataFrame. check_missing_values(): Counts the number of NaN, None, and NaT values in each column of the DataFrame. check_inf_values(): Counts the inf values in each column of the DataFrame. check_date_format(): Checks if the values in the datetime columns have the expected ‘YYYY-MM-DD’ format. check_duplicates(subset=None): Checks for duplicates in the DataFrame. check_col_names_types(): Checks if the DataFrame has the same column names as the types_dict_str dictionary

and those columns have the same types as items in the dictionary.

check_numeric_range(numeric_cols=None, lower_bound=None, upper_bound=None, ignore_values=None): Checks if

numeric values are in expected ranges.

check_date_range(earliest_date=None, latest_date=None, ignore_dates=None): Checks if dates are in expected ranges. detect_unexpected_values(MISSING_TYPES=None, unexpected_exceptions=None, unexpected_conditions=None,

ids_for_dedup=None, TEST_DUV_FLAGS_PATH=None, types_dict_str=None, expected_date_format=None, earliest_date=None, latest_date=None, numeric_lower_bound=None, numeric_upper_bound=None, print_score=True): Detects unexpected values in the DataFrame.

get_unexpected_exceptions_scaned(dataframe=None): Returns unexpected_exceptions with appropriate settings for the

values in the DataFrame.

replace_unexpected_values(MISSING_TYPES=None, unexpected_exceptions=None, unexpected_conditions=None,

TEST_RUV_FLAGS_PATH=None, earliest_date=None, latest_date=None, numeric_lower_bound=None, numeric_upper_bound=None): Replaces unexpected values in the DataFrame with missing types based on a dictionary of unexpected exceptions.

add_index_to_duplicate_columns(column_name_freq=None) None[source]

Adds an index to duplicate column names in a pandas DataFrame.

check_col_names_types() None[source]

Checks if a given dataframe has the same column names as keys in a given dictionary and those columns have the same types as items in the dictionary.

check_date_format() None[source]

Checks if the values in the datetime columns of the input dataframe have the expected ‘YYYY-MM-DD’ format.

check_date_range(earliest_date=None, latest_date=None, ignore_dates=None) None[source]

Checks if dates are in expected ranges.

check_duplicate_col_names(throw_error=None) None[source]

Checks for duplicate column names in a pandas DataFrame.

check_duplicates(subset=None) None[source]

Checks for duplicates in a pandas DataFrame.

check_inf_values() None[source]

Counts the inf values in each column of a pandas DataFrame.

check_missing_types() None[source]

Takes a DataFrame and a dictionary of missing types as input, and searches for any instances of these missing types in each column of the DataFrame.

If any instances are found, a warning message is logged containing the column name, the missing value, and the count of missing values found.

check_missing_values() None[source]

Counts the number of NaN, None, and NaT values in each column of a pandas DataFrame.

check_numeric_range(numeric_cols: list | None = None, lower_bound=None, upper_bound=None, ignore_values=None) None[source]

Checks if numeric values are in expected ranges.

detect_unexpected_values(dataframe=None, MISSING_TYPES=None, unexpected_exceptions=None, unexpected_conditions=None, ids_for_dedup=None, TEST_DUV_FLAGS_PATH=None, types_dict_str=None, expected_date_format=None, earliest_date=None, latest_date=None, numeric_lower_bound=None, numeric_upper_bound=None, thresholds=None, print_score=True) None[source]

Detects unexpected values in a pandas DataFrame.

get_refiner_settings() dict[source]

Extracts values of parameters from refiner and saves them in dictionary for later use.

get_type_dict_from_dataframe(explicit=True, stringout=False) dict[source]

Returns a dictionary or string representation of a dictionary containing the data types of each column in the given pandas DataFrame.

Numeric columns will have type ‘numeric’, date columns will have type ‘date’, character columns will have type ‘category’, and columns containing ‘id’ at the beginning or end of their name will have type ‘index’.

get_unexpected_exceptions_scaned(dataframe=None) dict[source]

Returns unexpected_exceptions with appropriate settings to the values in the dataframe.

initialize_logger()[source]

Initialize a logger for the class instance based on the specified logging level and logger name.

replace_unexpected_values(MISSING_TYPES=None, unexpected_exceptions=None, unexpected_conditions=None, TEST_RUV_FLAGS_PATH=None, earliest_date=None, latest_date=None, numeric_lower_bound=None, numeric_upper_bound=None) None[source]

Replaces unexpected values in a pandas DataFrame with missing types.

set_refiner_settings(settings: dict) None[source]

Updates input parameters with values from provided settings dict.

set_type_dict(type_dict=None, explicit=True, stringout=False) None[source]

Changes the data types of the columns in the given DataFrame based on a dictionary of intended data types.

set_types(type_dict=None, replace_dict=None, expected_date_format=None)[source]

Changes the data types of the columns in the given DataFrame based on a dictionary of intended data types.

set_updated_dataframe(dataframe: DataFrame) None[source]

Updates dataframe inside Refiner class. Usefull when some manipulations with the dataframe are done in between steps.

shout(mess=None) None[source]

Prints a line of text with a specified length and format.

refineryframe.replace_unexpected module

replace_unexpected.py - Data Replacement Module

This module contains a function to replace unexpected values in a pandas DataFrame with specified missing types. It covers various aspects of data validation, including replacing missing values, out-of-range numeric values, date values outside specified date ranges, and handling character-related issues such as capitalization and Unicode characters.

refineryframe.replace_unexpected.replace_unexpected_values(dataframe: DataFrame, MISSING_TYPES: dict = {'character_not_delivered': 'missing', 'date_not_delivered': '1850-01-09', 'numeric_not_delivered': -999}, unexpected_exceptions: dict = {'capitalization': 'NONE', 'date_range': 'NONE', 'irregular_values': 'NONE', 'numeric_range': 'NONE', 'unicode_character': 'NONE'}, unexpected_conditions=None, TEST_RUV_FLAGS_PATH: str | None = None, earliest_date: str = '1900-08-25', latest_date: str = '2100-01-01', numeric_lower_bound: float = 0, numeric_upper_bound: float = inf, logger: Logger | None = None) dict[source]

Replace unexpected values in a pandas DataFrame with missing types.

Parameters:

dataframe (pandas DataFrame):

The DataFrame to be checked.

MISSING_TYPES (dict):

Dictionary that maps column names to the values considered as missing for that column.

unexpected_exceptions (dict):

Dictionary that lists column exceptions for each of the following checks: col_names_types, missing_values, missing_types, inf_values, date_format, duplicates, date_range, and numeric_range.

TEST_DUV_FLAGS_PATH (str):

Path for checking unexpected values (default is None).

earliest_date (str):

The earliest acceptable date (default is “1900-08-25”).

latest_date (str):

The latest acceptable date (default is “2100-01-01”).

numeric_lower_bound (float):

The lowest acceptable value for numeric columns (default is 0).

numeric_upper_bound (float):

The highest acceptable value for numeric columns (default is infinity).

Returns:

ruv_score - number between 0 and 1 that means data quality score

Module contents