refineryframe package

Submodules

refineryframe.demo module

demo.py - Support module for refineryframe package that contains definitions of testing dataframes, etc.

refineryframe.detect_unexpected module

detect_unexpected.py - Data Quality Checking Module

This module contains functions to detect unexpected values in a pandas DataFrame, helping to identify potential data quality issues. The functions cover various aspects of data validation, including checking for missing values, unexpected data types, duplicates, incorrect date formats, out-of-range numeric values, and date values outside specified date ranges.

Functions:

check_missing_types(dataframe, MISSING_TYPES, independent=True, throw_error, thresholds, logger):
…
check_missing_values(dataframe, throw_error, thresholds, logger):
…
check_inf_values(dataframe, independent=True, throw_error, thresholds, logger):
…
check_date_format(dataframe, expected_date_format=’%Y-%m-%d’, independent=True, throw_error, thresholds, logger):
…
check_duplicates(dataframe, subset=None, independent=True, throw_error, thresholds, logger):
…
check_col_names_types(dataframe, types_dict_str, independent=True, throw_error, thresholds, logger):
…
check_numeric_range(dataframe, numeric_cols=None, lower_bound=-float(‘inf’), upper_bound=float(‘inf’), independent=True, ignore_values=[], throw_error, thresholds, logger):
…
check_date_range(dataframe, earliest_date=’1900-01-01’, latest_date=’2100-12-31’, independent=True, ignore_dates=[], throw_error, thresholds, logger):
…
check_duplicate_col_names(dataframe, throw_error, logger):
…
detect_unexpected_values(dataframe, MISSING_TYPES, unexpected_exceptions, unexpected_exceptions_error, unexpected_conditions, thresholds, ids_for_dedup, TEST_DUV_FLAGS_PATH, types_dict_str, expected_date_format, earliest_date, latest_date, numeric_lower_bound, numeric_upper_bound, print_score, logger) -> dict: …

Note:

Some functions use the logger parameter for logging warning messages instead of printing.
Users can specify exceptions for certain checks using the unexpected_exceptions dictionary.
Users can define additional conditions to check for unexpected values using the unexpected_conditions dictionary.
The thresholds parameter in the detect_unexpected_values function allows users to set threshold scores for different checks.
Each function returns relevant information about detected issues or scores.

refineryframe.detect_unexpected.check_col_names_types(dataframe: DataFrame, types_dict_str: dict, silent: bool = False, independent: bool = True, throw_error: bool = False, thresholds: dict = {'incorrect_dtypes_score': 100, 'missing_score': 100}, logger: Logger | None = None) → dict[source]

Checks if a given DataFrame has the same column names as keys in a provided dictionary and if those columns have the same data types as the corresponding values in the dictionary.

Parameters:

dataframepandas DataFrame: The DataFrame to be checked.
types_dict_strdict or str: A dictionary with column names as keys and expected data types as values, or a string representation of such a dictionary.
silentbool, optional: If True, suppress warning messages. Default is False.
independentbool, optional: If True, return a Boolean indicating if checks passed. If False, return a dictionary containing scores and checks. Default is True.
throw_errorbool, optional: If True, raise a ValueError for failed checks. Default is False.
thresholdsdict, optional: A dictionary containing thresholds for scoring. Default is {‘missing_score’: 100, ‘incorrect_dtypes_score’: 100}.
loggerlogging.Logger, optional: A logger instance for logging messages. If not provided, a new logger will be created.

Returns:

dict or bool: If independent is True, return a Boolean indicating if checks passed. If independent is False, return a dictionary containing scores and checks.

Examples:

Example usage and expected outputs.

from refineryframe.detect_unexpected import check_missing_types
from refineryframe.demo import tiny_example

data = tiny_example['dataframe']

check_col_names_types(dataframe = data)

Raises:

ValueError: If throw_error is True and any checks fail.

refineryframe.detect_unexpected.check_date_format(dataframe: DataFrame, expected_date_format: str = '%Y-%m-%d', independent: bool = True, silent: bool = False, throw_error: bool = False, thresholds: dict = {'date_format_score': 100}, logger: Logger | None = None) → dict[source]

Checks if the values in the datetime columns of the input DataFrame have the expected ‘YYYY-MM-DD’ format.

Parameters:

dataframepandas DataFrame: The DataFrame to be checked for date format.
expected_date_formatstr, optional: The expected date format. Default is ‘%Y-%m-%d’.
independentbool, optional: If True, return a Boolean indicating if date format checks passed. If False, return a dictionary containing scores and checks. Default is True.
silentbool, optional: If True, suppress warning messages. Default is False.
throw_errorbool, optional: If True, raise a ValueError for failed date format checks. Default is False.
thresholdsdict, optional: A dictionary containing thresholds for scoring date format checks. Default is {‘date_format_score’: 100}.
loggerlogging.Logger, optional: A logger instance for logging messages. If not provided, a new logger will be created.

Returns:

bool or dict: If independent is True, return a Boolean indicating if date format checks passed. If independent is False, return a dictionary containing scores and checks.

Examples:

Example usage and expected outputs.

from refineryframe.detect_unexpected import check_missing_types
from refineryframe.demo import tiny_example

data = tiny_example['dataframe']

check_date_format(dataframe = data)

Raises:

ValueError: If throw_error is True and date format checks fail.

refineryframe.detect_unexpected.check_date_range(dataframe: DataFrame, earliest_date: str = '1900-08-25', latest_date: str = '2100-01-01', independent: bool = True, silent: bool = False, ignore_dates: list = [], throw_error: bool = False, thresholds: dict = {'early_dates_score': 100, 'future_dates_score': 100}, logger: Logger | None = None) → dict[source]

Checks if date values are within expected date ranges in each column of a DataFrame.

Parameters:

dataframepandas DataFrame: The DataFrame to check for date values.
earliest_datestr, optional: The earliest date allowed in the DataFrame. Default is ‘1900-08-25’.
latest_datestr, optional: The latest date allowed in the DataFrame. Default is ‘2100-01-01’.
independentbool, optional: If True, return a boolean indicating whether all checks passed. Default is True.
silentbool, optional: If True, suppress log warnings. Default is False.
ignore_dateslist, optional: A list of dates to ignore when checking for dates outside the specified range. Default is an empty list.
throw_errorbool, optional: If True, raise an error if issues are found. Default is False.
thresholdsdict, optional: Dictionary containing thresholds for early_dates_score and future_dates_score. Default is {‘early_dates_score’: 100, ‘future_dates_score’: 100}.
loggerlogging.Logger, optional: Logger object for log messages. If not provided, a new logger will be created.

Returns:

bool or dict: If independent is True, return a boolean indicating whether all checks passed. If independent is False, return a dictionary containing scores and checks information.

Examples:

Example usage and expected outputs.

from refineryframe.detect_unexpected import check_missing_types
from refineryframe.demo import tiny_example

data = tiny_example['dataframe']

check_date_range(dataframe = data)

Raises:

ValueError: If throw_error is True and date range checks fail.

refineryframe.detect_unexpected.check_duplicate_col_names(dataframe: DataFrame, throw_error: bool = False, logger: Logger | None = None) → dict[source]

Checks for duplicate column names in a pandas DataFrame.

Parameters:

dataframepandas DataFrame: The DataFrame to check for duplicate column names.
throw_errorbool, optional: If True, raise a ValueError when duplicate column names are found. If False, print a warning message and continue execution. Default is False.
loggerlogging.Logger, optional: The logger object to use for logging warning and error messages. Default is the root logger.

Returns:

dict

A dictionary containing information about the duplicates. ‘column_name_freq’: dict

A dictionary where keys are duplicate column names, and values are the number of occurrences.

‘COLUMN_NAMES_DUPLICATES_TEST’: bool: True if duplicate column names are found, False otherwise.

Examples:

Example usage and expected outputs.

from refineryframe.detect_unexpected import check_missing_types
from refineryframe.demo import tiny_example

data = tiny_example['dataframe']

check_duplicate_col_names(dataframe = data)

Raises:

ValueError: If throw_error is True and duplicate column names are found.

refineryframe.detect_unexpected.check_duplicates(dataframe: DataFrame, subset: list | None = None, independent: bool = True, silent: bool = False, throw_error: bool = False, thresholds: dict = {'key_dup_score': 100, 'row_dup_score': 100}, logger: Logger | None = None) → dict[source]

Checks for duplicates in a pandas DataFrame.

Parameters:

dataframepandas DataFrame: The DataFrame to check for duplicates.
subsetlist of str or None, optional: A list of column names to consider when identifying duplicates. If not specified or None, all columns are used to identify duplicates.
independentbool, optional: If True, return a Boolean indicating if duplicate checks passed. If False, return a dictionary containing scores and checks. Default is True.
silentbool, optional: If True, suppress warning messages. Default is False.
throw_errorbool, optional: If True, raise a ValueError for failed duplicate checks. Default is False.
thresholdsdict, optional: A dictionary containing thresholds for scoring duplicate checks. Default is {‘row_dup_score’: 100, ‘key_dup_score’: 100}.
loggerlogging.Logger or None, optional: A logger instance for logging messages. If not provided, a new logger will be created.

Returns:

bool or dict: If independent is True, return a Boolean indicating if duplicate checks passed. If independent is False, return a dictionary containing scores and checks.

Examples:

Example usage and expected outputs.

from refineryframe.detect_unexpected import check_missing_types
from refineryframe.demo import tiny_example

data = tiny_example['dataframe']

check_duplicates(dataframe = data)

Raises:

ValueError: If throw_error is True and duplicate checks fail.

refineryframe.detect_unexpected.check_inf_values(dataframe: DataFrame, independent: bool = True, silent: bool = False, throw_error: bool = False, thresholds: dict = {'inf_score': 100}, logger: Logger | None = None) → dict[source]

Counts the infinite (inf) values in each column of a pandas DataFrame.

Parameters:

dataframepandas DataFrame: The DataFrame to count infinite values in.
independentbool, optional: If True, consider only numeric columns when counting inf values. If False, count inf values in all columns. Default is True.
silentbool, optional: If True, suppress warning messages. Default is False.
throw_errorbool, optional: If True, raise a ValueError for failed inf value checks. Default is False.
thresholdsdict, optional: A dictionary containing thresholds for scoring inf value checks. Default is {‘inf_score’: 100}.
loggerlogging.Logger, optional: A logger instance for logging messages. If not provided, a new logger will be created.

Returns:

dict: A dictionary containing scores and checks for inf value checks.

Examples:

Example usage and expected outputs.

from refineryframe.detect_unexpected import check_missing_types
from refineryframe.demo import tiny_example

data = tiny_example['dataframe']

check_inf_values(dataframe = data)

Raises:

ValueError: If throw_error is True and inf value checks fail.

refineryframe.detect_unexpected.check_missing_types(dataframe: DataFrame, MISSING_TYPES: dict, independent: bool = True, silent: bool = False, throw_error: bool = False, thresholds: dict = {'cat_score': 100, 'date_score': 100, 'numeric_score': 100}, logger: Logger | None = None) → dict[source]

Checks for instances of missing types in each column of a DataFrame and log warning messages for any found.

Parameters:

dataframepandas DataFrame: The DataFrame to search for missing values.
MISSING_TYPESdict: A dictionary of missing types to search for. Keys represent the missing type, and values are the corresponding values to search for.
independentbool, optional: If True, return a boolean indicating whether all checks passed. Default is True.
silentbool, optional: If True, suppress log warnings. Default is False.
throw_errorbool, optional: If True, raise an error if issues are found. Default is False.
thresholdsdict, optional: Dictionary containing thresholds for numeric_score, date_score, and cat_score. Default is {‘numeric_score’: 100, ‘date_score’: 100, ‘cat_score’: 100}.
loggerlogging.Logger, optional: Logger object for log messages. If not provided, a new logger will be created.

Returns:

bool or dict: If independent is True, return a boolean indicating whether all checks passed. If independent is False, return a dictionary containing scores and checks information.

Examples:

Example usage and expected outputs.

from refineryframe.detect_unexpected import check_missing_types
from refineryframe.demo import tiny_example

data = tiny_example['dataframe']
MISSING_TYPES = tiny_example['MISSING_TYPES']

check_missing_types(dataframe = data,
                    MISSING_TYPES = MISSING_TYPES)

Raises:

ValueError: If throw_error is True and missing type checks fail.

refineryframe.detect_unexpected.check_missing_values(dataframe: DataFrame, independent: bool = True, silent: bool = False, throw_error: bool = False, thresholds: dict = {'missing_values_score': 100}, logger: Logger | None = None) → dict[source]

Counts the number of NaN, None, and NaT values in each column of a pandas DataFrame.

Parameters:

dataframepandas DataFrame: The DataFrame to count missing values in.
independentbool, optional: If True, consider only columns with missing values as defined by NaN, None, and NaT. If False, count missing values in all columns. Default is True.
silentbool, optional: If True, suppress warning messages. Default is False.
throw_errorbool, optional: If True, raise a ValueError for failed missing value checks. Default is False.
thresholdsdict, optional: A dictionary containing thresholds for scoring missing value checks. Default is {‘missing_values_score’: 100}.
loggerlogging.Logger, optional: A logger instance for logging messages. If not provided, a new logger will be created.

Returns:

dict: A dictionary containing scores and checks for missing value checks.

Examples:

Example usage and expected outputs.

from refineryframe.detect_unexpected import check_missing_values
from refineryframe.demo import tiny_example

data = tiny_example['dataframe']

check_missing_values(dataframe = data)

Raises:

ValueError: If throw_error is True and missing value checks fail.

refineryframe.detect_unexpected.check_numeric_range(dataframe: DataFrame, numeric_cols: list | None = None, lower_bound: float = -inf, upper_bound: float = inf, independent: bool = True, silent: bool = False, ignore_values: list = [], throw_error: bool = False, thresholds: dict = {'low_numeric_score': 100, 'upper_numeric_score': 100}, logger: Logger | None = None) → dict[source]

Checks if numeric values are within expected ranges in each column of a DataFrame.

Parameters:

dataframepandas DataFrame: The DataFrame to check for numeric values.
numeric_colslist of str, optional: A list of column names to consider. If None, all numeric columns are checked.
lower_boundfloat, optional: The lower bound allowed for numeric values. Default is -infinity.
upper_boundfloat, optional: The upper bound allowed for numeric values. Default is infinity.
independentbool, optional: If True, return a boolean indicating whether all checks passed. Default is True.
silentbool, optional: If True, suppress log warnings. Default is False.
ignore_valueslist, optional: A list of values to ignore when checking for values outside the specified range. Default is empty list.
throw_errorbool, optional: If True, raise an error if issues are found. Default is False.
thresholdsdict, optional: Dictionary containing thresholds for low_numeric_score and upper_numeric_score. Default is {‘low_numeric_score’: 100, ‘upper_numeric_score’: 100}.
loggerlogging.Logger, optional: Logger object for log messages. If not provided, a new logger will be created.

Returns:

bool or dict: If independent is True, return a boolean indicating whether all checks passed. If independent is False, return a dictionary containing scores and checks information.

Examples:

Example usage and expected outputs.

from refineryframe.detect_unexpected import check_missing_types
from refineryframe.demo import tiny_example

data = tiny_example['dataframe']

check_numeric_range(dataframe = data)

Raises:

ValueError: If throw_error is True and numeric range checks fail.

refineryframe.detect_unexpected.detect_unexpected_values(dataframe: DataFrame, MISSING_TYPES: dict = {'character_not_delivered': 'missing', 'date_not_delivered': '1850-01-09', 'numeric_not_delivered': -999}, unexpected_exceptions: dict = {'col_names_types': 'NONE', 'date_format': 'NONE', 'date_range': 'NONE', 'duplicates': 'NONE', 'inf_values': 'NONE', 'missing_types': 'NONE', 'missing_values': 'NONE', 'numeric_range': 'NONE'}, unexpected_exceptions_error={'col_name_duplicates': False, 'col_names_types': False, 'date_format': False, 'date_range': False, 'duplicates': False, 'inf_values': False, 'missing_types': False, 'missing_values': False, 'numeric_range': False}, unexpected_conditions: dict | None = None, thresholds: dict = {'ccnt_scores': {'incorrect_dtypes_score': 100, 'missing_score': 100}, 'cdf_scores': {'date_format_score': 100}, 'cdr_scores': {'early_dates_score': 100, 'future_dates_score': 100}, 'cmt_scores': {'cat_score': 100, 'date_score': 100, 'numeric_score': 100}, 'cmv_scores': {'missing_values_score': 100}, 'cnr_scores': {'low_numeric_score': 100, 'upper_numeric_score': 100}, 'dup_scores': {'key_dup_score': 100, 'row_dup_score': 100}, 'inf_scores': {'inf_score': 100}}, ids_for_dedup: list | None = None, TEST_DUV_FLAGS_PATH: str | None = None, types_dict_str: dict | None = None, expected_date_format: str = '%Y-%m-%d', earliest_date: str = '1900-08-25', latest_date: str = '2100-01-01', numeric_lower_bound: float = 0, numeric_upper_bound: float = inf, print_score: bool = True, logger: Logger | None = None) → dict[source]

Detects unexpected values in a pandas DataFrame.

Parameters:

dataframe (pandas DataFrame):: The DataFrame to be checked.
MISSING_TYPES (dict):: Dictionary that maps column names to the values considered as missing for that column.
unexpected_exceptions (dict):: Dictionary that lists column exceptions for each of the following checks: col_names_types, missing_values, missing_types, inf_values, date_format, duplicates, date_range, and numeric_range.
unexpected_exceptions_error (dict):: Dictionary indicating whether to throw errors for each type of unexpected exception.
unexpected_conditions (dict):: Dictionary containing additional conditions to check for unexpected values.
thresholds (dict):: Dictionary containing threshold scores for different checks.
ids_for_dedup (list):: List of columns to identify duplicates (default is None).
TEST_DUV_FLAGS_PATH (str):: Path for checking unexpected values (default is None).
types_dict_str (str):: String that describes the expected types of the columns (default is None).
expected_date_format (str):: The expected date format (default is ‘%Y-%m-%d’).
earliest_date (str):: The earliest acceptable date (default is “1900-08-25”).
latest_date (str):: The latest acceptable date (default is “2100-01-01”).
numeric_lower_bound (float):: The lowest acceptable value for numeric columns (default is 0).
numeric_upper_bound (float):: The highest acceptable value for numeric columns (default is infinity).
print_score (bool):: Whether to print the duv score (default is True).
logger (logging.Logger):: Logger object for logging messages (default is logging).

Returns:

dict:: duv_score (float): Number between 0 and 1 representing the percentage of passed tests. check_scores (dict): Scores for each check. unexpected_exceptions_scaned (dict): Unexpected exceptions based on detected unexpected values.

Examples:

Example usage and expected outputs.

from refineryframe.detect_unexpected import check_missing_types
from refineryframe.demo import tiny_example

data = tiny_example['dataframe']

detect_unexpected_values(dataframe = data)

Raises:

Exception:: If any errors occur during the detection process.

refineryframe.other module

Module Name: other.py

This module contains various utility functions for logging, data manipulation, and data type handling.

Functions:

shoutOUT(type=”dline”, mess=None, dotline_length=50, logger: logging.Logger = logging):: Print a line of text with a specified length and format.
get_type_dict(dataframe: pd.DataFrame, explicit: bool = True, stringout: bool = False,: logger: logging.Logger = logging) -> str:

Returns a string representation of a dictionary containing the data types of each column in the given pandas DataFrame.
set_types(dataframe: pd.DataFrame, types_dict_str: dict, replace_dict: dict = None,: expected_date_format: str = ‘%Y-%m-%d’, logger: logging.Logger = logging) -> pd.DataFrame:

Change the data types of the columns in the given DataFrame based on a dictionary of intended data types.
treat_unexpected_cond(df: pd.DataFrame, description: str, group: str, features: list,: query: str, warning: bool, replace, logger: logging.Logger = logging) -> pd.DataFrame:

Replace unexpected values in a pandas DataFrame with replace values.

Dependencies:

logging
pandas as pd
re

Note:

Please refer to the docstrings of individual functions for detailed information and usage examples.

refineryframe.other.add_index_to_duplicate_columns(dataframe: DataFrame, column_name_freq: dict, logger: Logger | None = None) → DataFrame[source]

Adds an index to duplicate column names in a pandas DataFrame.

Parameters:

dataframepandas DataFrame: The DataFrame containing the duplicate columns.
column_name_freqdict: A dictionary where keys are duplicate column names, and values are the number of occurrences.

Returns:

pandas DataFrame: The DataFrame with updated column names.

refineryframe.other.get_type_dict(dataframe: DataFrame, explicit: bool = True, stringout: bool = False, logger: Logger | None = None) → dict[source]

Returns a string representation of a dictionary containing the data types of each column in the given pandas DataFrame.

Numeric columns will have type ‘numeric’, date columns will have type ‘date’, character columns will have type ‘category’, and columns containing ‘id’ at the beginning or end of their name will have type ‘index’.

Parameters

dataframepandas DataFrame: The DataFrame to extract column data types from.

Returns

str: A string representation of a dictionary containing the data types of each column in the given DataFrame. The keys are the column names and the values are the corresponding data types.

refineryframe.other.set_types(dataframe: DataFrame, types_dict_str: dict, replace_dict: dict | None = None, expected_date_format: str = '%Y-%m-%d', logger: Logger | None = None) → DataFrame[source]

Change the data types of the columns in the given DataFrame based on a dictionary of intended data types.

Args:

dataframe (pandas.DataFrame):: The DataFrame to change the data types of.
types_dict_str (dict):: A dictionary where the keys are the column names and the values are the intended data types for those columns.
replace_dict (dict, optional):: A dictionary containing replacement values for specific columns. Defaults to None.
expected_date_format (str, optional):: The expected date format for date columns. Defaults to ‘%Y-%m-%d’.

Returns:

pandas.DataFrame: The DataFrame with the changed data types.

Raises:

ValueError: If the keys in the dictionary do not match the columns in the DataFrame. TypeError: If the data types cannot be changed successfully.

refineryframe.other.shoutOUT(output_type: str = 'dline', mess: str | None = None, dotline_length: int = 50, logger: Logger | None = None) → None[source]

Print a line of text with a specified length and format.

Args:

output_type (str):: The type of line to print. Valid values are “dline” (default), “line”, “pline”, “HEAD1”, “title”, “subtitle”, “subtitle2”, “subtitle3”, and “warning”.
mess (str):: The text to print out.
dotline_length (int):: The length of the line to print.

Returns:

None

Examples:

shoutOUT(“HEAD1”, mess=”Header”, dotline_length=50) shoutOUT(output_type=”dline”, dotline_length=50)

refineryframe.other.treat_unexpected_cond(df: DataFrame, description: str, group: str, features: list, query: str, warning: bool, replace, logger: Logger | None = None) → DataFrame[source]

Replace unexpected values in a pandas DataFrame with replace values.

Parameters:

df (pandas DataFrame):: The DataFrame to be checked.
description (str):: Description of the unexpected condition being treated.
group (str):: Group identifier for the unexpected condition.
features (list):: List of column names or regex pattern for selecting columns.
query (str):: Query string for selecting rows based on the unexpected condition.
warning (str):: Warning message to be logged if unexpected condition is found.
replace (object):: Value to replace the unexpected values with.

Returns:

df (pandas DataFrame): The DataFrame with replaced unexpected values, if replace is not None.

refineryframe.refiner module

refineryframe Module

This module provides a Refiner class to encapsulate functions for data refinement and validation. The Refiner class is designed to work with pandas DataFrames and perform various checks and replacements for data preprocessing.

class refineryframe.refiner.Refiner(dataframe: ~pandas.core.frame.DataFrame, replace_dict: dict | None = None, MISSING_TYPES: dict = {'character_not_delivered': 'missing', 'date_not_delivered': '1850-01-09', 'numeric_not_delivered': -999}, expected_date_format: str = '%Y-%m-%d', mess: str = 'INITIAL PREPROCESSING', shout_type: str = 'HEAD2', logger=<module 'logging' from '/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/logging/__init__.py'>, logger_name='Refiner', loggerLvl=20, dotline_length: int = 50, lower_bound=-inf, upper_bound=inf, earliest_date='1900-08-25', latest_date='2100-01-01', ids_for_dedup: list = 'ALL', unexpected_exceptions_duv: dict = {'col_names_types': 'NONE', 'date_format': 'NONE', 'date_range': 'NONE', 'duplicates': 'NONE', 'inf_values': 'NONE', 'missing_types': 'NONE', 'missing_values': 'NONE', 'numeric_range': 'NONE'}, unexpected_exceptions_ruv: dict = {'capitalization': 'NONE', 'date_range': 'NONE', 'irregular_values': 'NONE', 'numeric_range': 'NONE', 'unicode_character': 'NONE'}, unexpected_exceptions_error: dict = {'col_name_duplicates': False, 'col_names_types': False, 'date_format': False, 'date_range': False, 'duplicates': False, 'inf_values': False, 'missing_types': False, 'missing_values': False, 'numeric_range': False}, thresholds: dict = {'ccnt_scores': {'incorrect_dtypes_score': 100, 'missing_score': 100}, 'cdf_scores': {'date_format_score': 100}, 'cdr_scores': {'early_dates_score': 100, 'future_dates_score': 100}, 'cmt_scores': {'cat_score': 100, 'date_score': 100, 'numeric_score': 100}, 'cmv_scores': {'missing_values_score': 100}, 'cnr_scores': {'low_numeric_score': 100, 'upper_numeric_score': 100}, 'dup_scores': {'key_dup_score': 100, 'row_dup_score': 100}, 'inf_scores': {'inf_score': 100}}, unexpected_conditions=None, ignore_values=[], ignore_dates=[])[source]

Bases: object

Class that encapsulates functions for data refining and validation.

Attributes:

dataframe (pd.DataFrame):: The input pandas DataFrame to be processed.
replace_dict (dict, optional):: A dictionary to define replacements for specific values in the DataFrame.
MISSING_TYPES (dict, optional):: Default values for missing types in different columns of the DataFrame.
expected_date_format (str, optional):: The expected date format for date columns in the DataFrame.
mess (str, optional):: A custom message used in the shout method for printing.
shout_type (str, optional):: The type of output for the shout method (e.g., ‘HEAD2’).
logger (logging.Logger, optional):: A custom logger object for logging messages.
logger_name (str, optional):: The name of the logger for the class instance.
loggerLvl (int, optional):: The logging level for the logger.
dotline_length (int, optional):: The length of the line to be printed in the shout method.
lower_bound (float, optional):: The lower bound for numeric range validation.
upper_bound (float, optional):: The upper bound for numeric range validation.
earliest_date (str, optional):: The earliest allowed date for date range validation.
latest_date (str, optional):: The latest allowed date for date range validation.
ids_for_dedup (list, optional):: A list of column names to be used for duplicate detection.
unexpected_exceptions_duv (dict, optional):: A dictionary of unexpected exceptions for data value validation.
unexpected_exceptions_ruv (dict, optional):: A dictionary of unexpected exceptions for data replacement validation.
unexpected_exceptions_error (dict, optional):: A dictionary that indicates if error should be raised during duv.
unexpected_conditions (None or callable, optional):: A callable function for custom unexpected conditions.
ignore_values (list, optional):: A list of values to ignore during numeric range validation.
ignore_dates (list, optional):: A list of dates to ignore during date range validation.

Methods:

shout(mess=None): Prints a line of text with a specified length and format. get_type_dict_from_dataframe(explicit=True, stringout=False): Returns a dictionary containing the data types

of each column in the given pandas DataFrame.

set_type_dict(type_dict=None, explicit=True, stringout=False): Changes the data types of the columns in the: DataFrame based on a dictionary of intended data types.
set_types(type_dict=None, replace_dict=None, expected_date_format=None): Changes the data types of the columns: in the DataFrame based on a dictionary of intended data types.

get_refiner_settings(): Extracts values of parameters from the Refiner and saves them in a dictionary for later use. set_refiner_settings(settings: dict): Updates input parameters with values from the provided settings dict. check_duplicate_col_names(throw_error=None): Checks for duplicate column names in a pandas DataFrame. add_index_to_duplicate_columns(column_names_freq: dict): Adds an index to duplicate column names in a pandas DataFrame. check_missing_types(): Searches for instances of missing types in each column of the DataFrame. check_missing_values(): Counts the number of NaN, None, and NaT values in each column of the DataFrame. check_inf_values(): Counts the inf values in each column of the DataFrame. check_date_format(): Checks if the values in the datetime columns have the expected ‘YYYY-MM-DD’ format. check_duplicates(subset=None): Checks for duplicates in the DataFrame. check_col_names_types(): Checks if the DataFrame has the same column names as the types_dict_str dictionary

and those columns have the same types as items in the dictionary.

check_numeric_range(numeric_cols=None, lower_bound=None, upper_bound=None, ignore_values=None): Checks if: numeric values are in expected ranges.

check_date_range(earliest_date=None, latest_date=None, ignore_dates=None): Checks if dates are in expected ranges. detect_unexpected_values(MISSING_TYPES=None, unexpected_exceptions=None, unexpected_conditions=None,

ids_for_dedup=None, TEST_DUV_FLAGS_PATH=None, types_dict_str=None, expected_date_format=None, earliest_date=None, latest_date=None, numeric_lower_bound=None, numeric_upper_bound=None, print_score=True): Detects unexpected values in the DataFrame.

get_unexpected_exceptions_scaned(dataframe=None): Returns unexpected_exceptions with appropriate settings for the: values in the DataFrame.
replace_unexpected_values(MISSING_TYPES=None, unexpected_exceptions=None, unexpected_conditions=None,: TEST_RUV_FLAGS_PATH=None, earliest_date=None, latest_date=None, numeric_lower_bound=None, numeric_upper_bound=None): Replaces unexpected values in the DataFrame with missing types based on a dictionary of unexpected exceptions.

add_index_to_duplicate_columns(column_name_freq=None) → None[source]: Adds an index to duplicate column names in a pandas DataFrame.

check_col_names_types() → None[source]: Checks if a given dataframe has the same column names as keys in a given dictionary and those columns have the same types as items in the dictionary.

check_date_format() → None[source]: Checks if the values in the datetime columns of the input dataframe have the expected ‘YYYY-MM-DD’ format.

check_date_range(earliest_date=None, latest_date=None, ignore_dates=None) → None[source]: Checks if dates are in expected ranges.

check_duplicate_col_names(throw_error=None) → None[source]: Checks for duplicate column names in a pandas DataFrame.

check_duplicates(subset=None) → None[source]: Checks for duplicates in a pandas DataFrame.

check_inf_values() → None[source]: Counts the inf values in each column of a pandas DataFrame.

check_missing_types() → None[source]

Takes a DataFrame and a dictionary of missing types as input, and searches for any instances of these missing types in each column of the DataFrame.

If any instances are found, a warning message is logged containing the column name, the missing value, and the count of missing values found.

check_missing_values() → None[source]: Counts the number of NaN, None, and NaT values in each column of a pandas DataFrame.

check_numeric_range(numeric_cols: list | None = None, lower_bound=None, upper_bound=None, ignore_values=None) → None[source]: Checks if numeric values are in expected ranges.

detect_unexpected_values(dataframe=None, MISSING_TYPES=None, unexpected_exceptions=None, unexpected_conditions=None, ids_for_dedup=None, TEST_DUV_FLAGS_PATH=None, types_dict_str=None, expected_date_format=None, earliest_date=None, latest_date=None, numeric_lower_bound=None, numeric_upper_bound=None, thresholds=None, print_score=True) → None[source]: Detects unexpected values in a pandas DataFrame.

get_refiner_settings() → dict[source]: Extracts values of parameters from refiner and saves them in dictionary for later use.

get_type_dict_from_dataframe(explicit=True, stringout=False) → dict[source]

Returns a dictionary or string representation of a dictionary containing the data types of each column in the given pandas DataFrame.

Numeric columns will have type ‘numeric’, date columns will have type ‘date’, character columns will have type ‘category’, and columns containing ‘id’ at the beginning or end of their name will have type ‘index’.

get_unexpected_exceptions_scaned(dataframe=None) → dict[source]: Returns unexpected_exceptions with appropriate settings to the values in the dataframe.

initialize_logger()[source]: Initialize a logger for the class instance based on the specified logging level and logger name.

replace_unexpected_values(MISSING_TYPES=None, unexpected_exceptions=None, unexpected_conditions=None, TEST_RUV_FLAGS_PATH=None, earliest_date=None, latest_date=None, numeric_lower_bound=None, numeric_upper_bound=None) → None[source]: Replaces unexpected values in a pandas DataFrame with missing types.

set_refiner_settings(settings: dict) → None[source]: Updates input parameters with values from provided settings dict.

set_type_dict(type_dict=None, explicit=True, stringout=False) → None[source]: Changes the data types of the columns in the given DataFrame based on a dictionary of intended data types.

set_types(type_dict=None, replace_dict=None, expected_date_format=None)[source]: Changes the data types of the columns in the given DataFrame based on a dictionary of intended data types.

set_updated_dataframe(dataframe: DataFrame) → None[source]: Updates dataframe inside Refiner class. Usefull when some manipulations with the dataframe are done in between steps.

shout(mess=None) → None[source]: Prints a line of text with a specified length and format.

refineryframe.replace_unexpected module

replace_unexpected.py - Data Replacement Module

This module contains a function to replace unexpected values in a pandas DataFrame with specified missing types. It covers various aspects of data validation, including replacing missing values, out-of-range numeric values, date values outside specified date ranges, and handling character-related issues such as capitalization and Unicode characters.

refineryframe.replace_unexpected.replace_unexpected_values(dataframe: DataFrame, MISSING_TYPES: dict = {'character_not_delivered': 'missing', 'date_not_delivered': '1850-01-09', 'numeric_not_delivered': -999}, unexpected_exceptions: dict = {'capitalization': 'NONE', 'date_range': 'NONE', 'irregular_values': 'NONE', 'numeric_range': 'NONE', 'unicode_character': 'NONE'}, unexpected_conditions=None, TEST_RUV_FLAGS_PATH: str | None = None, earliest_date: str = '1900-08-25', latest_date: str = '2100-01-01', numeric_lower_bound: float = 0, numeric_upper_bound: float = inf, logger: Logger | None = None) → dict[source]

Replace unexpected values in a pandas DataFrame with missing types.

Parameters:

dataframe (pandas DataFrame):: The DataFrame to be checked.
MISSING_TYPES (dict):: Dictionary that maps column names to the values considered as missing for that column.
unexpected_exceptions (dict):: Dictionary that lists column exceptions for each of the following checks: col_names_types, missing_values, missing_types, inf_values, date_format, duplicates, date_range, and numeric_range.
TEST_DUV_FLAGS_PATH (str):: Path for checking unexpected values (default is None).
earliest_date (str):: The earliest acceptable date (default is “1900-08-25”).
latest_date (str):: The latest acceptable date (default is “2100-01-01”).
numeric_lower_bound (float):: The lowest acceptable value for numeric columns (default is 0).
numeric_upper_bound (float):: The highest acceptable value for numeric columns (default is infinity).
Returns:: ruv_score - number between 0 and 1 that means data quality score

refineryframe package

Submodules

refineryframe.demo module

refineryframe.detect_unexpected module

Parameters:

Returns:

Examples:

Raises:

Parameters:

Returns:

Examples:

Raises:

Parameters:

Returns:

Examples:

Raises:

Parameters:

Returns:

Examples:

Raises:

Parameters:

Returns:

Examples:

Raises:

Parameters:

Returns:

Examples:

Raises:

Parameters:

Returns:

Examples:

Raises:

Parameters:

Returns:

Examples:

Raises:

Parameters:

Returns:

Examples:

Raises:

Parameters:

Returns:

Examples:

Raises:

refineryframe.other module

Parameters:

Returns:

Parameters

Returns

Parameters:

Returns:

refineryframe.refiner module

refineryframe.replace_unexpected module

Parameters:

Module contents