Skip to content

Releases: lshpaner/eda_toolkit

EDA Toolkit 0.0.5

04 Aug 20:02
Compare
Choose a tag to compare

Ensure Consistent Font Size and Text Wrapping Across Plot Elements

Description

This PR addresses inconsistencies in font sizes and text wrapping across various plot elements in the stacked_crosstab_plot function. The following updates have been implemented to ensure uniformity and improve the readability of plots:

  1. Title Font Size and Text Wrapping:

    • Added a text_wrap parameter to control the wrapping of plot titles.
    • Ensured that title font sizes are consistent with axis label font sizes by explicitly setting the font size using ax.set_title() after plot generation.
  2. Legend Font Size Consistency:

    • Incorporated label_fontsize into the legend font size by directly setting the font size of the legend text using plt.setp(legend.get_texts(), fontsize=label_fontsize).
    • This ensures that the legend labels are consistent with the title and axis labels.

Testing

  • Verified that titles now wrap correctly and match the specified label_fontsize.
  • Confirmed that legend text scales according to label_fontsize, ensuring consistent font sizes across all plot elements.

Outcome

These changes improve the visual consistency of plots generated by the stacked_crosstab_plot function, making the plots more professional and easier to read. This PR should be reviewed and merged to standardize font sizing and text presentation across the codebase.

EDA Toolkit 0.0.4

02 Aug 21:30
Compare
Choose a tag to compare

Changelog

[0.0.4] - 2024-08-02

  • Stable release:
    • No new updates to codebase
    • Updated project description variable in setup.py to re-emphasize key elements of library
    • minor readme cleanup:
      • added icons for sections that did not have them

[0.0.3] - 2024-08-02

  • Stable release:
    • Updated logo size, fixed citation title, and some minor readme cleanup:
      • added additional section for documentation, cleaned up verbiage, moved acknowledgements section before licensing and support

[0.0.2] - 2024-08-01

  • First stable release:
    • No new updates to codebase; just minimal documentation updates to readme and setup.py files
    • Added logo, badges, and Zenodo-certified citation to readme

[0.0.1rc0] - 2024-08-01

  • No new updates to codebase; just minimal documentation updates to readme and setup.py files

[0.0.1b0] - 2024-08-01

New scatter_fit_plot() and additional updayes

  • Added new scatter_fit_plot(), removed unused data_types(), added comment section headers

Added xlim and ylim inputs in kde_distribution()

  • Added xlim and ylim inputs to allow user to customize axes limits in kde_distribution()

Added xlim and ylim params to stacked_crosstab_plot()

  • Added xlim and ylim input parameters to stacked_crosstab_plot() to give users more flexibility on controlling axes limits

Added x and y limits to box_violin_plot()

  • Changed function name from metrics_box_violin() to box_violion_plot()
  • Added xlim and ylim inputs to control x and y-axis limits of box_violion_plot() (formerly known as metrics_box_violin)

Added ability to remove stacks from plots, plot all or one at a time

Key Changes

  1. plot_type Parameter:
    This parameter allows the user to choose between "regular", "normalized", or "both" plot types.

  2. remove_stacks Parameter:
    This parameter, when set to True, generates a regular bar plot using only the col parameter instead of a stacked bar plot. It only works when plot_type is set to "regular". If remove_stacks is set to True while plot_type is anything other than "regular", the function will raise an exception.

Explanation of Changes:

  • plot_type Parameter:

    • This parameter provides flexibility to the user, allowing them to specify the type of plot to generate. The options are:
      • "regular": Generates a standard bar plot.
      • "normalized": Generates a normalized bar plot.
      • "both": Generates both regular and normalized bar plots.
  • remove_stacks Parameter:

    • This parameter, when set to True, will generate a regular bar plot using only the col parameter. It effectively removes the stacking of the bars. This parameter is only applicable when plot_type is set to "regular". If used with any other plot_type, an exception will be raised to ensure proper usage.

These changes enhance the flexibility and functionality of the stacked_crosstab_plot function, allowing for more customizable and specific plot generation based on user requirements.

[0.0.1b0] - 2024-07-31

Refined kde_distributions()

Key Changes

  1. Alpha Transparency for Histogram Fill:

    • Added a fill_alpha parameter to control the transparency of the histogram bars' fill color.
    • The default value is 0.6. An exception is raised if fill=False and fill_alpha is specified.
  2. Custom Font Sizes:

    • Introduced label_fontsize and tick_fontsize parameters to allow control over the font size of axis labels and tick marks independently.
  3. Scientific Notation Toggle:

    • Added a disable_sci_notation parameter to enable or disable scientific notation on axes.
  4. Improved Error Handling:

    • Added validation for the stat parameter to ensure that only valid options are accepted.
    • Added checks to ensure proper usage of fill_alpha and hist_edgecolor when fill is set to False.
  5. General Enhancements:

    • Updated the function's docstring to reflect the new parameters and provide comprehensive guidance on its usage.

[0.0.1b0] - 2024-07-30

Enhance kde_distributions Function

Added Parameters

  1. grid_figsize and single_figsize:

    • Control the size of the overall grid figure and individual figures separately.
  2. hist_color and kde_color:

    • Allow customization of histogram and KDE plot colors.
  3. hist_edgecolor:

    • Allows customization of the histogram bar edges.
  4. hue:

    • Allows grouping data by a column.
  5. fill:

    • Controls whether to fill the histogram bars with color.
  6. y_axis_label:

    • Customizable y-axis label.
  7. log_scale_vars:

    • Specifies which variables to apply log scale.
  8. bins and binwidth:

    • Control the number and width of bins.
  9. stat:

    • Allows different statistics for the histogram (count, density, frequency, probability, proportion, percent).

Improvements

  1. Validation and Error Handling:

    • Checks for invalid log_scale_vars and throws a ValueError if any are found.
    • Throws a ValueError if edgecolor is changed while fill is set to False.
    • Issues a PerformanceWarning if both bins and binwidth are specified, warning of potential performance impacts.
  2. Customizable y-axis label:

    • Allows users to specify custom y-axis labels.
  3. Warning for KDE with Count:

    • Issues a warning if KDE is used with stat='count', as it may produce misleading plots.

Updated add_ids to ensure unique ids and idx check

This pull request updates the add_ids() function to enhance its functionality by:

  • Ensuring that each generated ID starts with a non-zero digit.
  • Adding a check to verify that the DataFrame index is unique.
  • Printing a warning message if duplicate index entries are found.

These changes improve the robustness of the function, ensuring that the IDs generated are always unique and valid, and provide necessary feedback when the DataFrame index is not unique.

Check for Unique Indices:

  • Before generating IDs, the function now checks if the DataFrame index is unique.
  • If duplicates are found, a warning is printed along with the list of duplicate index entries.

Generate Non-Zero Starting IDs:

  • The ID generation process is updated to ensure that the first digit of each ID is always non-zero.

Ensure Unique IDs:

  • A set is used to store the generated IDs, ensuring all IDs are unique before adding them to the DataFrame.

Updated the add_ids() function to enhance its functionality by:

  • Ensuring that each generated ID starts with a non-zero digit.
  • Adding a check to verify that the DataFrame index is unique.
  • Printing a warning message if duplicate index entries are found.

These changes improve the robustness of the function, ensuring that the IDs generated are always unique and valid, and provide necessary feedback when the DataFrame index is not unique.

Check for Unique Indices:

  • Before generating IDs, the function now checks if the DataFrame index is unique.
  • If duplicates are found, a warning is printed along with the list of duplicate index entries.

Generate Non-Zero Starting IDs:

  • The ID generation process is updated to ensure that the first digit of each ID is always non-zero.

Ensure Unique IDs:

  • A set is used to store the generated IDs, ensuring all IDs are unique before adding them to the DataFrame.

Fix int conversion for numeric cols, reset decimal_places=0

This PR fixes the integer conversion issue for numeric columns when decimal_places=0 in the save_dataframes_to_excel function. Additionally, it resets decimal_places to 0 as the default value.

Changes include:

  • Convert only numeric columns to integers when decimal_places=0.
  • Reset decimal_places default value to 0.

This ensures correct formatting and avoids errors during conversion.

Contingency Table Updates

  1. Error Handling for Columns:

    • Added a check to ensure at least one column is specified.
    • Updated the function to accept a single column as a string or multiple columns as a list.
    • Updated the function to raise a ValueError if no columns are provided or if cols is not correctly specified.
  2. Function Parameters:

    • Changed the parameters from col1 and col2 to a single parameter cols which can be either a string or a list.
  3. Error Handling for SortBy:

    • Renamed SortBy to sort_by to standardize nomenclature.
    • Added a check to ensure sort_by is either 0 or 1.
    • Updated the function to raise a ValueError if sort_by is not 0 or 1.
  4. Sorting Logic:

    • Updated the sorting logic to handle the new cols parameter structure.
  5. Handling Categorical Data:

    • Modified the code to convert categorical columns to strings to avoid issues with fillna("").
  6. Handling Missing Values:

    • Added df = df.fillna('') to fill NA values within the function to account for missing data.
  7. Improved Function Documentation:

    • Updated the function documentation to reflect the new parameters and error handling.

[0.0.1b0] - 2024-07-29

Contingency Table Updates

fillna('') to output so that null values come through, removed 'All' col name from output, sort options 0 and 1, updated docstring documentation. Tested successfully on Python 3.7.3

Updated datatime Imports to accomodate different Python versions

Compatibility Enhancement:

  1. Added a version check for Python 3.7 and above.
    • Conditional import of datetime to handle different Python versions.
if sys.version_info >= (3, 7):
    from datetime import ...
Read more

EDA Toolkit 0.0.3

02 Aug 20:27
Compare
Choose a tag to compare

Changelog

[0.0.3] - 2024-08-02

  • Stable release:
    • Updated logo size, fixed citation title, and some minor readme cleanup:
      • added additional section for documentation, cleaned up verbiage, moved acknowledgements section before licensing and support

[0.0.2] - 2024-08-01

  • First stable release:
    • No new updates to codebase; just minimal documentation updates to readme and setup.py files
    • Added logo, badges, and Zenodo-certified citation to readme

[0.0.1rc0] - 2024-08-01

  • No new updates to codebase; just minimal documentation updates to readme and setup.py files

[0.0.1b0] - 2024-08-01

New scatter_fit_plot() and additional updayes

  • Added new scatter_fit_plot(), removed unused data_types(), added comment section headers

Added xlim and ylim inputs in kde_distribution()

  • Added xlim and ylim inputs to allow user to customize axes limits in kde_distribution()

Added xlim and ylim params to stacked_crosstab_plot()

  • Added xlim and ylim input parameters to stacked_crosstab_plot() to give users more flexibility on controlling axes limits

Added x and y limits to box_violin_plot()

  • Changed function name from metrics_box_violin() to box_violion_plot()
  • Added xlim and ylim inputs to control x and y-axis limits of box_violion_plot() (formerly known as metrics_box_violin)

Added ability to remove stacks from plots, plot all or one at a time

Key Changes

  1. plot_type Parameter:
    This parameter allows the user to choose between "regular", "normalized", or "both" plot types.

  2. remove_stacks Parameter:
    This parameter, when set to True, generates a regular bar plot using only the col parameter instead of a stacked bar plot. It only works when plot_type is set to "regular". If remove_stacks is set to True while plot_type is anything other than "regular", the function will raise an exception.

Explanation of Changes:

  • plot_type Parameter:

    • This parameter provides flexibility to the user, allowing them to specify the type of plot to generate. The options are:
      • "regular": Generates a standard bar plot.
      • "normalized": Generates a normalized bar plot.
      • "both": Generates both regular and normalized bar plots.
  • remove_stacks Parameter:

    • This parameter, when set to True, will generate a regular bar plot using only the col parameter. It effectively removes the stacking of the bars. This parameter is only applicable when plot_type is set to "regular". If used with any other plot_type, an exception will be raised to ensure proper usage.

These changes enhance the flexibility and functionality of the stacked_crosstab_plot function, allowing for more customizable and specific plot generation based on user requirements.

[0.0.1b0] - 2024-07-31

Refined kde_distributions()

Key Changes

  1. Alpha Transparency for Histogram Fill:

    • Added a fill_alpha parameter to control the transparency of the histogram bars' fill color.
    • The default value is 0.6. An exception is raised if fill=False and fill_alpha is specified.
  2. Custom Font Sizes:

    • Introduced label_fontsize and tick_fontsize parameters to allow control over the font size of axis labels and tick marks independently.
  3. Scientific Notation Toggle:

    • Added a disable_sci_notation parameter to enable or disable scientific notation on axes.
  4. Improved Error Handling:

    • Added validation for the stat parameter to ensure that only valid options are accepted.
    • Added checks to ensure proper usage of fill_alpha and hist_edgecolor when fill is set to False.
  5. General Enhancements:

    • Updated the function's docstring to reflect the new parameters and provide comprehensive guidance on its usage.

[0.0.1b0] - 2024-07-30

Enhance kde_distributions Function

Added Parameters

  1. grid_figsize and single_figsize:

    • Control the size of the overall grid figure and individual figures separately.
  2. hist_color and kde_color:

    • Allow customization of histogram and KDE plot colors.
  3. hist_edgecolor:

    • Allows customization of the histogram bar edges.
  4. hue:

    • Allows grouping data by a column.
  5. fill:

    • Controls whether to fill the histogram bars with color.
  6. y_axis_label:

    • Customizable y-axis label.
  7. log_scale_vars:

    • Specifies which variables to apply log scale.
  8. bins and binwidth:

    • Control the number and width of bins.
  9. stat:

    • Allows different statistics for the histogram (count, density, frequency, probability, proportion, percent).

Improvements

  1. Validation and Error Handling:

    • Checks for invalid log_scale_vars and throws a ValueError if any are found.
    • Throws a ValueError if edgecolor is changed while fill is set to False.
    • Issues a PerformanceWarning if both bins and binwidth are specified, warning of potential performance impacts.
  2. Customizable y-axis label:

    • Allows users to specify custom y-axis labels.
  3. Warning for KDE with Count:

    • Issues a warning if KDE is used with stat='count', as it may produce misleading plots.

Updated add_ids to ensure unique ids and idx check

This pull request updates the add_ids() function to enhance its functionality by:

  • Ensuring that each generated ID starts with a non-zero digit.
  • Adding a check to verify that the DataFrame index is unique.
  • Printing a warning message if duplicate index entries are found.

These changes improve the robustness of the function, ensuring that the IDs generated are always unique and valid, and provide necessary feedback when the DataFrame index is not unique.

Check for Unique Indices:

  • Before generating IDs, the function now checks if the DataFrame index is unique.
  • If duplicates are found, a warning is printed along with the list of duplicate index entries.

Generate Non-Zero Starting IDs:

  • The ID generation process is updated to ensure that the first digit of each ID is always non-zero.

Ensure Unique IDs:

  • A set is used to store the generated IDs, ensuring all IDs are unique before adding them to the DataFrame.

Updated the add_ids() function to enhance its functionality by:

  • Ensuring that each generated ID starts with a non-zero digit.
  • Adding a check to verify that the DataFrame index is unique.
  • Printing a warning message if duplicate index entries are found.

These changes improve the robustness of the function, ensuring that the IDs generated are always unique and valid, and provide necessary feedback when the DataFrame index is not unique.

Check for Unique Indices:

  • Before generating IDs, the function now checks if the DataFrame index is unique.
  • If duplicates are found, a warning is printed along with the list of duplicate index entries.

Generate Non-Zero Starting IDs:

  • The ID generation process is updated to ensure that the first digit of each ID is always non-zero.

Ensure Unique IDs:

  • A set is used to store the generated IDs, ensuring all IDs are unique before adding them to the DataFrame.

Fix int conversion for numeric cols, reset decimal_places=0

This PR fixes the integer conversion issue for numeric columns when decimal_places=0 in the save_dataframes_to_excel function. Additionally, it resets decimal_places to 0 as the default value.

Changes include:

  • Convert only numeric columns to integers when decimal_places=0.
  • Reset decimal_places default value to 0.

This ensures correct formatting and avoids errors during conversion.

Contingency Table Updates

  1. Error Handling for Columns:

    • Added a check to ensure at least one column is specified.
    • Updated the function to accept a single column as a string or multiple columns as a list.
    • Updated the function to raise a ValueError if no columns are provided or if cols is not correctly specified.
  2. Function Parameters:

    • Changed the parameters from col1 and col2 to a single parameter cols which can be either a string or a list.
  3. Error Handling for SortBy:

    • Renamed SortBy to sort_by to standardize nomenclature.
    • Added a check to ensure sort_by is either 0 or 1.
    • Updated the function to raise a ValueError if sort_by is not 0 or 1.
  4. Sorting Logic:

    • Updated the sorting logic to handle the new cols parameter structure.
  5. Handling Categorical Data:

    • Modified the code to convert categorical columns to strings to avoid issues with fillna("").
  6. Handling Missing Values:

    • Added df = df.fillna('') to fill NA values within the function to account for missing data.
  7. Improved Function Documentation:

    • Updated the function documentation to reflect the new parameters and error handling.

[0.0.1b0] - 2024-07-29

Contingency Table Updates

fillna('') to output so that null values come through, removed 'All' col name from output, sort options 0 and 1, updated docstring documentation. Tested successfully on Python 3.7.3

Updated datatime Imports to accomodate different Python versions

Compatibility Enhancement:

  1. Added a version check for Python 3.7 and above.
    • Conditional import of datetime to handle different Python versions.
if sys.version_info >= (3, 7):
    from datetime import datetime
else:
    import datetime
  1. In dataframe_columns():
start_time = (datetime.now() if sys.version_info >= (3, 7)
              else datetime.datetime.now())

stop_time = (datetime.now() if sys.version_info >= (3, 7)
      ...
Read more

eda_toolkit 0.0.2

02 Aug 05:31
Compare
Choose a tag to compare

First Stable Release - EDA Toolkit 0.0.2

  • added zenodo citation to PyPI
  • added logo to PyPI
  • added acknowledgements and references to PyPI

eda_toolkit 0.0.1c

02 Aug 04:47
Compare
Choose a tag to compare
eda_toolkit 0.0.1c Pre-release
Pre-release

Release Notes

Changes from main_old.py to main.py

1. Imports and Dependencies:

  • Removed:
    • import datetime.
  • Added:
    • import matplotlib.ticker as mticker for formatting.
    • import sys for conditional imports.
    • Conditional import of datetime depending on Python version (sys.version_info).

2. Function Changes:

  • Function add_ids:

    • Parameter Modifications:
      • column_name renamed to id_colname.
      • New parameters num_digits to specify ID digit length, and set_as_index to optionally set the new ID column as the index.
    • Functionality Update:
      • Allows setting the new ID column as an index and customizes the number of digits in generated IDs.
    • Documentation Update: Expanded to describe new parameters and functionality.
  • Function scatter_plots_grid:

    • New Functionality:
      • Added options for customizing axis limits (xlim, ylim).
      • Enhanced options for saving plots as PNG or SVG.
      • Added support for showing or hiding plots based on user preference (show_plot parameter).
      • Improved legend management with the option to show or remove the legend.
      • Introduced parameters to control tick labels and axis label font sizes.
  • Plot Saving and Displaying Enhancements:

    • The code for saving and displaying plots has been refactored:
      • Saving individual plots: Can now be conditionally saved based on user input (save_individual).
      • Saving and displaying grids: Improved support for saving grids of plots as a single image and displaying them conditionally.
  • Additional Enhancements:

    • Introduced new function for fitting and plotting best-fit lines in scatter plots.

3. New Functions:

  • add_best_fit:
    • Description: Adds a best-fit line to a scatter plot with customizable line style and color.
    • Parameters: ax, x_data, y_data, line_style, line_color.

4. Code Structural Changes:

  • Comment and Documentation Updates:

    • Improved and expanded docstrings and inline comments.
    • Added detailed explanations for new parameters and functionalities.
  • Refactoring:

    • Plotting-related code was reorganized for better readability and maintainability.

5. Bug Fixes and Enhancements:

  • Plotting Enhancements:

    • Improved handling of axis visibility and edge cases.
    • Enhanced plot saving mechanisms to ensure correct file naming and prevent overwriting.
  • Error Handling:

    • Improved error handling and warnings related to file saving and plotting.

0.0.1b

25 Jul 04:05
Compare
Choose a tag to compare
0.0.1b Pre-release
Pre-release

Initial release via GitHub after manual release of 0.0.1a