Skip to content

EDA Toolkit 0.0.3

Compare
Choose a tag to compare
@lshpaner lshpaner released this 02 Aug 20:27
· 66 commits to main since this release

Changelog

[0.0.3] - 2024-08-02

  • Stable release:
    • Updated logo size, fixed citation title, and some minor readme cleanup:
      • added additional section for documentation, cleaned up verbiage, moved acknowledgements section before licensing and support

[0.0.2] - 2024-08-01

  • First stable release:
    • No new updates to codebase; just minimal documentation updates to readme and setup.py files
    • Added logo, badges, and Zenodo-certified citation to readme

[0.0.1rc0] - 2024-08-01

  • No new updates to codebase; just minimal documentation updates to readme and setup.py files

[0.0.1b0] - 2024-08-01

New scatter_fit_plot() and additional updayes

  • Added new scatter_fit_plot(), removed unused data_types(), added comment section headers

Added xlim and ylim inputs in kde_distribution()

  • Added xlim and ylim inputs to allow user to customize axes limits in kde_distribution()

Added xlim and ylim params to stacked_crosstab_plot()

  • Added xlim and ylim input parameters to stacked_crosstab_plot() to give users more flexibility on controlling axes limits

Added x and y limits to box_violin_plot()

  • Changed function name from metrics_box_violin() to box_violion_plot()
  • Added xlim and ylim inputs to control x and y-axis limits of box_violion_plot() (formerly known as metrics_box_violin)

Added ability to remove stacks from plots, plot all or one at a time

Key Changes

  1. plot_type Parameter:
    This parameter allows the user to choose between "regular", "normalized", or "both" plot types.

  2. remove_stacks Parameter:
    This parameter, when set to True, generates a regular bar plot using only the col parameter instead of a stacked bar plot. It only works when plot_type is set to "regular". If remove_stacks is set to True while plot_type is anything other than "regular", the function will raise an exception.

Explanation of Changes:

  • plot_type Parameter:

    • This parameter provides flexibility to the user, allowing them to specify the type of plot to generate. The options are:
      • "regular": Generates a standard bar plot.
      • "normalized": Generates a normalized bar plot.
      • "both": Generates both regular and normalized bar plots.
  • remove_stacks Parameter:

    • This parameter, when set to True, will generate a regular bar plot using only the col parameter. It effectively removes the stacking of the bars. This parameter is only applicable when plot_type is set to "regular". If used with any other plot_type, an exception will be raised to ensure proper usage.

These changes enhance the flexibility and functionality of the stacked_crosstab_plot function, allowing for more customizable and specific plot generation based on user requirements.

[0.0.1b0] - 2024-07-31

Refined kde_distributions()

Key Changes

  1. Alpha Transparency for Histogram Fill:

    • Added a fill_alpha parameter to control the transparency of the histogram bars' fill color.
    • The default value is 0.6. An exception is raised if fill=False and fill_alpha is specified.
  2. Custom Font Sizes:

    • Introduced label_fontsize and tick_fontsize parameters to allow control over the font size of axis labels and tick marks independently.
  3. Scientific Notation Toggle:

    • Added a disable_sci_notation parameter to enable or disable scientific notation on axes.
  4. Improved Error Handling:

    • Added validation for the stat parameter to ensure that only valid options are accepted.
    • Added checks to ensure proper usage of fill_alpha and hist_edgecolor when fill is set to False.
  5. General Enhancements:

    • Updated the function's docstring to reflect the new parameters and provide comprehensive guidance on its usage.

[0.0.1b0] - 2024-07-30

Enhance kde_distributions Function

Added Parameters

  1. grid_figsize and single_figsize:

    • Control the size of the overall grid figure and individual figures separately.
  2. hist_color and kde_color:

    • Allow customization of histogram and KDE plot colors.
  3. hist_edgecolor:

    • Allows customization of the histogram bar edges.
  4. hue:

    • Allows grouping data by a column.
  5. fill:

    • Controls whether to fill the histogram bars with color.
  6. y_axis_label:

    • Customizable y-axis label.
  7. log_scale_vars:

    • Specifies which variables to apply log scale.
  8. bins and binwidth:

    • Control the number and width of bins.
  9. stat:

    • Allows different statistics for the histogram (count, density, frequency, probability, proportion, percent).

Improvements

  1. Validation and Error Handling:

    • Checks for invalid log_scale_vars and throws a ValueError if any are found.
    • Throws a ValueError if edgecolor is changed while fill is set to False.
    • Issues a PerformanceWarning if both bins and binwidth are specified, warning of potential performance impacts.
  2. Customizable y-axis label:

    • Allows users to specify custom y-axis labels.
  3. Warning for KDE with Count:

    • Issues a warning if KDE is used with stat='count', as it may produce misleading plots.

Updated add_ids to ensure unique ids and idx check

This pull request updates the add_ids() function to enhance its functionality by:

  • Ensuring that each generated ID starts with a non-zero digit.
  • Adding a check to verify that the DataFrame index is unique.
  • Printing a warning message if duplicate index entries are found.

These changes improve the robustness of the function, ensuring that the IDs generated are always unique and valid, and provide necessary feedback when the DataFrame index is not unique.

Check for Unique Indices:

  • Before generating IDs, the function now checks if the DataFrame index is unique.
  • If duplicates are found, a warning is printed along with the list of duplicate index entries.

Generate Non-Zero Starting IDs:

  • The ID generation process is updated to ensure that the first digit of each ID is always non-zero.

Ensure Unique IDs:

  • A set is used to store the generated IDs, ensuring all IDs are unique before adding them to the DataFrame.

Updated the add_ids() function to enhance its functionality by:

  • Ensuring that each generated ID starts with a non-zero digit.
  • Adding a check to verify that the DataFrame index is unique.
  • Printing a warning message if duplicate index entries are found.

These changes improve the robustness of the function, ensuring that the IDs generated are always unique and valid, and provide necessary feedback when the DataFrame index is not unique.

Check for Unique Indices:

  • Before generating IDs, the function now checks if the DataFrame index is unique.
  • If duplicates are found, a warning is printed along with the list of duplicate index entries.

Generate Non-Zero Starting IDs:

  • The ID generation process is updated to ensure that the first digit of each ID is always non-zero.

Ensure Unique IDs:

  • A set is used to store the generated IDs, ensuring all IDs are unique before adding them to the DataFrame.

Fix int conversion for numeric cols, reset decimal_places=0

This PR fixes the integer conversion issue for numeric columns when decimal_places=0 in the save_dataframes_to_excel function. Additionally, it resets decimal_places to 0 as the default value.

Changes include:

  • Convert only numeric columns to integers when decimal_places=0.
  • Reset decimal_places default value to 0.

This ensures correct formatting and avoids errors during conversion.

Contingency Table Updates

  1. Error Handling for Columns:

    • Added a check to ensure at least one column is specified.
    • Updated the function to accept a single column as a string or multiple columns as a list.
    • Updated the function to raise a ValueError if no columns are provided or if cols is not correctly specified.
  2. Function Parameters:

    • Changed the parameters from col1 and col2 to a single parameter cols which can be either a string or a list.
  3. Error Handling for SortBy:

    • Renamed SortBy to sort_by to standardize nomenclature.
    • Added a check to ensure sort_by is either 0 or 1.
    • Updated the function to raise a ValueError if sort_by is not 0 or 1.
  4. Sorting Logic:

    • Updated the sorting logic to handle the new cols parameter structure.
  5. Handling Categorical Data:

    • Modified the code to convert categorical columns to strings to avoid issues with fillna("").
  6. Handling Missing Values:

    • Added df = df.fillna('') to fill NA values within the function to account for missing data.
  7. Improved Function Documentation:

    • Updated the function documentation to reflect the new parameters and error handling.

[0.0.1b0] - 2024-07-29

Contingency Table Updates

fillna('') to output so that null values come through, removed 'All' col name from output, sort options 0 and 1, updated docstring documentation. Tested successfully on Python 3.7.3

Updated datatime Imports to accomodate different Python versions

Compatibility Enhancement:

  1. Added a version check for Python 3.7 and above.
    • Conditional import of datetime to handle different Python versions.
if sys.version_info >= (3, 7):
    from datetime import datetime
else:
    import datetime
  1. In dataframe_columns():
start_time = (datetime.now() if sys.version_info >= (3, 7)
              else datetime.datetime.now())

stop_time = (datetime.now() if sys.version_info >= (3, 7)
             else datetime.datetime.now())

Changed set_as_index to False as default for add_ids()

  1. Parameter Customization:
    • Updated parameter set_as_index to allow users to choose whether to set the new ID column as the index. Defaults to False.

Enhance add_ids function: Parameter customization and improved default column name

  1. Parameter Customization:
    • Added a new parameter set_as_index to allow users to choose whether to set the new ID column as the index. Defaults to True.

Added num_dig, changed id_col in add_ids

  1. Parameter Customization:

    • Introduced a new parameter num_digits to allow users to specify the number of digits for the unique IDs.
  2. Improved Default Column Name:

    • Changed the default column name from Patient_ID to ID to be more generic and widely understood.
    • Updated the parameter name from column_name to id_colname for better clarity and consistency.

[0.0.1b0] - 2024-07-28

Refactored metrics_box_violin()

  1. Parameter Renaming and Addition:

    • Added rotate_plot parameter to allow pivoting plots.
    • Added save_plots parameter to control saving options with values "all", "individual", and "grid".
    • Added show_plot parameter to control plot display with values "individual", "grid", and "both".
    • Added grid_figsize parameter to control the figure size of grid plots.
    • Added xlabel_rot to allow rotating of the x-axis text.
      • Renamed save_individual to save_plots.
      • Renamed x_label_rotation to xlabel_rot.
  2. Parameter Validation:

    • Added validation to ensure show_plot is one of "individual", "grid", or "both".
    • Added validation to ensure save_plots is one of None, "all", "individual", or "grid".
    • Added validation to ensure save_plots requires image_path_png or image_path_svg.
    • Added validation to ensure rotate_plot is a Boolean value.
    • Added validation to ensure individual_figsize is a tuple or list of two numbers.
    • Added validation to ensure grid_figsize is a tuple or list of two numbers if specified.
  3. Default Values:

    • Added default values for individual_figsize and grid_figsize.
    • Set default grid figure size to (5 * n_cols, 5 * n_rows) if grid_figsize is not specified.
  4. Plot Display and Saving:

    • Adjusted code to handle new parameters for plot display and saving.
    • Implemented logic to pivot plots based on rotate_plot parameter.
    • Used xlabel_rot to rotate x-axis labels.
    • Incorporated custom figure sizes for individual and grid plots.
  5. File Saving:

    • Updated file naming to include plot_type in filenames.

[0.0.1b0] - 2024-07-24

  • First working version of EDA Toolkit.
  • Included the following functions:
    • ensure_directory: Ensures the existence of a directory, creating it if necessary.
    • add_ids: Adds a column of unique 9-digit IDs to the dataframe.
    • strip_trailing_period: Removes trailing periods from floats in a specified column.
    • parse_date_with_rule: Parses and standardizes date strings based on day/month/year or month/day/year formats.
    • data_types: Provides a report on data types, null values, and their percentages for dataframe columns.
    • dataframe_columns: Analyzes dataframe columns for data type, nulls, and unique value counts.
    • summarize_all_combinations: Generates summary tables for all possible combinations of specified variables.
    • save_dataframes_to_excel: Saves multiple DataFrames to separate sheets in an Excel file with customized formatting.
    • contingency_table: Creates contingency tables with sorting options.
    • highlight_columns: Highlights specific columns in a DataFrame with a specified background color.
    • kde_distributions: Generates KDE or histogram distribution plots for specified columns.
    • stacked_crosstab_plot: Generates stacked bar plots and crosstabs for specified columns.
    • plot_filtered_dataframes: Filters dataframes based on conditions and generates plots and crosstabs.
    • metrics_box_violin: Creates and saves individual boxplots or violin plots for given metrics and comparisons.

[0.0.1a0] - 2024-07-24

  • Initial alpha release of EDA Toolkit.
  • Note: This release was mistakenly published without placing the __init__.py and main.py files inside a subdirectory, causing the pip install to work but resulting in failed imports. This issue will be addressed in subsequent releases.