Releases: lshpaner/eda_toolkit
EDA Toolkit 0.0.5
Ensure Consistent Font Size and Text Wrapping Across Plot Elements
Description
This PR addresses inconsistencies in font sizes and text wrapping across various plot elements in the stacked_crosstab_plot
function. The following updates have been implemented to ensure uniformity and improve the readability of plots:
-
Title Font Size and Text Wrapping:
- Added a
text_wrap
parameter to control the wrapping of plot titles. - Ensured that title font sizes are consistent with axis label font sizes by explicitly setting the font size using
ax.set_title()
after plot generation.
- Added a
-
Legend Font Size Consistency:
- Incorporated
label_fontsize
into the legend font size by directly setting the font size of the legend text usingplt.setp(legend.get_texts(), fontsize=label_fontsize)
. - This ensures that the legend labels are consistent with the title and axis labels.
- Incorporated
Testing
- Verified that titles now wrap correctly and match the specified
label_fontsize
. - Confirmed that legend text scales according to
label_fontsize
, ensuring consistent font sizes across all plot elements.
Outcome
These changes improve the visual consistency of plots generated by the stacked_crosstab_plot
function, making the plots more professional and easier to read. This PR should be reviewed and merged to standardize font sizing and text presentation across the codebase.
EDA Toolkit 0.0.4
Changelog
[0.0.4] - 2024-08-02
- Stable release:
- No new updates to codebase
- Updated project
description
variable insetup.py
to re-emphasize key elements of library - minor readme cleanup:
- added icons for sections that did not have them
[0.0.3] - 2024-08-02
- Stable release:
- Updated logo size, fixed citation title, and some minor readme cleanup:
- added additional section for documentation, cleaned up verbiage, moved acknowledgements section before licensing and support
- Updated logo size, fixed citation title, and some minor readme cleanup:
[0.0.2] - 2024-08-01
- First stable release:
- No new updates to codebase; just minimal documentation updates to readme and
setup.py
files - Added logo, badges, and Zenodo-certified citation to readme
- No new updates to codebase; just minimal documentation updates to readme and
[0.0.1rc0] - 2024-08-01
- No new updates to codebase; just minimal documentation updates to readme and
setup.py
files
[0.0.1b0] - 2024-08-01
New scatter_fit_plot()
and additional updayes
- Added new
scatter_fit_plot()
, removed unuseddata_types()
, added comment section headers
Added xlim
and ylim
inputs in kde_distribution()
- Added
xlim
andylim
inputs to allow user to customize axes limits inkde_distribution()
Added xlim
and ylim
params to stacked_crosstab_plot()
- Added
xlim
andylim
input parameters tostacked_crosstab_plot()
to give users more flexibility on controlling axes limits
Added x
and y
limits to box_violin_plot()
- Changed function name from
metrics_box_violin()
tobox_violion_plot()
- Added
xlim
andylim
inputs to control x and y-axis limits ofbox_violion_plot()
(formerly known asmetrics_box_violin
)
Added ability to remove stacks from plots, plot all or one at a time
Key Changes
-
plot_type
Parameter:
This parameter allows the user to choose between "regular", "normalized", or "both" plot types. -
remove_stacks
Parameter:
This parameter, when set toTrue
, generates a regular bar plot using only thecol
parameter instead of a stacked bar plot. It only works whenplot_type
is set to"regular"
. Ifremove_stacks
is set toTrue
whileplot_type
is anything other than"regular"
, the function will raise an exception.
Explanation of Changes:
-
plot_type
Parameter:- This parameter provides flexibility to the user, allowing them to specify the type of plot to generate. The options are:
"regular"
: Generates a standard bar plot."normalized"
: Generates a normalized bar plot."both"
: Generates both regular and normalized bar plots.
- This parameter provides flexibility to the user, allowing them to specify the type of plot to generate. The options are:
-
remove_stacks
Parameter:- This parameter, when set to
True
, will generate a regular bar plot using only thecol
parameter. It effectively removes the stacking of the bars. This parameter is only applicable whenplot_type
is set to"regular"
. If used with any otherplot_type
, an exception will be raised to ensure proper usage.
- This parameter, when set to
These changes enhance the flexibility and functionality of the stacked_crosstab_plot
function, allowing for more customizable and specific plot generation based on user requirements.
[0.0.1b0] - 2024-07-31
Refined kde_distributions()
Key Changes
-
Alpha Transparency for Histogram Fill:
- Added a
fill_alpha
parameter to control the transparency of the histogram bars' fill color. - The default value is
0.6
. An exception is raised iffill=False
andfill_alpha
is specified.
- Added a
-
Custom Font Sizes:
- Introduced
label_fontsize
andtick_fontsize
parameters to allow control over the font size of axis labels and tick marks independently.
- Introduced
-
Scientific Notation Toggle:
- Added a
disable_sci_notation
parameter to enable or disable scientific notation on axes.
- Added a
-
Improved Error Handling:
- Added validation for the
stat
parameter to ensure that only valid options are accepted. - Added checks to ensure proper usage of
fill_alpha
andhist_edgecolor
whenfill
is set toFalse
.
- Added validation for the
-
General Enhancements:
- Updated the function's docstring to reflect the new parameters and provide comprehensive guidance on its usage.
[0.0.1b0] - 2024-07-30
Enhance kde_distributions
Function
Added Parameters
-
grid_figsize and single_figsize:
- Control the size of the overall grid figure and individual figures separately.
-
hist_color and kde_color:
- Allow customization of histogram and KDE plot colors.
-
hist_edgecolor:
- Allows customization of the histogram bar edges.
-
hue:
- Allows grouping data by a column.
-
fill:
- Controls whether to fill the histogram bars with color.
-
y_axis_label:
- Customizable y-axis label.
-
log_scale_vars:
- Specifies which variables to apply log scale.
-
bins and binwidth:
- Control the number and width of bins.
-
stat:
- Allows different statistics for the histogram (count, density, frequency, probability, proportion, percent).
Improvements
-
Validation and Error Handling:
- Checks for invalid
log_scale_vars
and throws aValueError
if any are found. - Throws a
ValueError
ifedgecolor
is changed whilefill
is set toFalse
. - Issues a
PerformanceWarning
if bothbins
andbinwidth
are specified, warning of potential performance impacts.
- Checks for invalid
-
Customizable y-axis label:
- Allows users to specify custom y-axis labels.
-
Warning for KDE with Count:
- Issues a warning if KDE is used with
stat='count'
, as it may produce misleading plots.
- Issues a warning if KDE is used with
Updated add_ids
to ensure unique ids and idx check
This pull request updates the add_ids()
function to enhance its functionality by:
- Ensuring that each generated ID starts with a non-zero digit.
- Adding a check to verify that the DataFrame index is unique.
- Printing a warning message if duplicate index entries are found.
These changes improve the robustness of the function, ensuring that the IDs generated are always unique and valid, and provide necessary feedback when the DataFrame index is not unique.
Check for Unique Indices:
- Before generating IDs, the function now checks if the DataFrame index is unique.
- If duplicates are found, a warning is printed along with the list of duplicate index entries.
Generate Non-Zero Starting IDs:
- The ID generation process is updated to ensure that the first digit of each ID is always non-zero.
Ensure Unique IDs:
- A set is used to store the generated IDs, ensuring all IDs are unique before adding them to the DataFrame.
Updated the add_ids()
function to enhance its functionality by:
- Ensuring that each generated ID starts with a non-zero digit.
- Adding a check to verify that the DataFrame index is unique.
- Printing a warning message if duplicate index entries are found.
These changes improve the robustness of the function, ensuring that the IDs generated are always unique and valid, and provide necessary feedback when the DataFrame index is not unique.
Check for Unique Indices:
- Before generating IDs, the function now checks if the DataFrame index is unique.
- If duplicates are found, a warning is printed along with the list of duplicate index entries.
Generate Non-Zero Starting IDs:
- The ID generation process is updated to ensure that the first digit of each ID is always non-zero.
Ensure Unique IDs:
- A set is used to store the generated IDs, ensuring all IDs are unique before adding them to the DataFrame.
Fix int conversion for numeric cols, reset decimal_places=0
This PR fixes the integer conversion issue for numeric columns when decimal_places=0
in the save_dataframes_to_excel
function. Additionally, it resets decimal_places
to 0 as the default value.
Changes include:
- Convert only numeric columns to integers when
decimal_places=0
. - Reset
decimal_places
default value to0
.
This ensures correct formatting and avoids errors during conversion.
Contingency Table Updates
-
Error Handling for Columns:
- Added a check to ensure at least one column is specified.
- Updated the function to accept a single column as a string or multiple columns as a list.
- Updated the function to raise a
ValueError
if no columns are provided or ifcols
is not correctly specified.
-
Function Parameters:
- Changed the parameters from
col1
andcol2
to a single parametercols
which can be either a string or a list.
- Changed the parameters from
-
Error Handling for
SortBy
:- Renamed
SortBy
tosort_by
to standardize nomenclature. - Added a check to ensure
sort_by
is either 0 or 1. - Updated the function to raise a
ValueError
ifsort_by
is not 0 or 1.
- Renamed
-
Sorting Logic:
- Updated the sorting logic to handle the new
cols
parameter structure.
- Updated the sorting logic to handle the new
-
Handling Categorical Data:
- Modified the code to convert categorical columns to strings to avoid issues with
fillna("")
.
- Modified the code to convert categorical columns to strings to avoid issues with
-
Handling Missing Values:
- Added
df = df.fillna('')
to fill NA values within the function to account for missing data.
- Added
-
Improved Function Documentation:
- Updated the function documentation to reflect the new parameters and error handling.
[0.0.1b0] - 2024-07-29
Contingency Table Updates
fillna('')
to output so that null values come through, removed 'All'
col name from output, sort options 0
and 1
, updated docstring documentation. Tested successfully on Python 3.7.3
Updated datatime
Imports to accomodate different Python versions
Compatibility Enhancement:
- Added a version check for
Python 3.7
and above.- Conditional import of
datetime
to handle different Python versions.
- Conditional import of
if sys.version_info >= (3, 7):
from datetime import ...
EDA Toolkit 0.0.3
Changelog
[0.0.3] - 2024-08-02
- Stable release:
- Updated logo size, fixed citation title, and some minor readme cleanup:
- added additional section for documentation, cleaned up verbiage, moved acknowledgements section before licensing and support
- Updated logo size, fixed citation title, and some minor readme cleanup:
[0.0.2] - 2024-08-01
- First stable release:
- No new updates to codebase; just minimal documentation updates to readme and
setup.py
files - Added logo, badges, and Zenodo-certified citation to readme
- No new updates to codebase; just minimal documentation updates to readme and
[0.0.1rc0] - 2024-08-01
- No new updates to codebase; just minimal documentation updates to readme and
setup.py
files
[0.0.1b0] - 2024-08-01
New scatter_fit_plot()
and additional updayes
- Added new
scatter_fit_plot()
, removed unuseddata_types()
, added comment section headers
Added xlim
and ylim
inputs in kde_distribution()
- Added
xlim
andylim
inputs to allow user to customize axes limits inkde_distribution()
Added xlim
and ylim
params to stacked_crosstab_plot()
- Added
xlim
andylim
input parameters tostacked_crosstab_plot()
to give users more flexibility on controlling axes limits
Added x
and y
limits to box_violin_plot()
- Changed function name from
metrics_box_violin()
tobox_violion_plot()
- Added
xlim
andylim
inputs to control x and y-axis limits ofbox_violion_plot()
(formerly known asmetrics_box_violin
)
Added ability to remove stacks from plots, plot all or one at a time
Key Changes
-
plot_type
Parameter:
This parameter allows the user to choose between "regular", "normalized", or "both" plot types. -
remove_stacks
Parameter:
This parameter, when set toTrue
, generates a regular bar plot using only thecol
parameter instead of a stacked bar plot. It only works whenplot_type
is set to"regular"
. Ifremove_stacks
is set toTrue
whileplot_type
is anything other than"regular"
, the function will raise an exception.
Explanation of Changes:
-
plot_type
Parameter:- This parameter provides flexibility to the user, allowing them to specify the type of plot to generate. The options are:
"regular"
: Generates a standard bar plot."normalized"
: Generates a normalized bar plot."both"
: Generates both regular and normalized bar plots.
- This parameter provides flexibility to the user, allowing them to specify the type of plot to generate. The options are:
-
remove_stacks
Parameter:- This parameter, when set to
True
, will generate a regular bar plot using only thecol
parameter. It effectively removes the stacking of the bars. This parameter is only applicable whenplot_type
is set to"regular"
. If used with any otherplot_type
, an exception will be raised to ensure proper usage.
- This parameter, when set to
These changes enhance the flexibility and functionality of the stacked_crosstab_plot
function, allowing for more customizable and specific plot generation based on user requirements.
[0.0.1b0] - 2024-07-31
Refined kde_distributions()
Key Changes
-
Alpha Transparency for Histogram Fill:
- Added a
fill_alpha
parameter to control the transparency of the histogram bars' fill color. - The default value is
0.6
. An exception is raised iffill=False
andfill_alpha
is specified.
- Added a
-
Custom Font Sizes:
- Introduced
label_fontsize
andtick_fontsize
parameters to allow control over the font size of axis labels and tick marks independently.
- Introduced
-
Scientific Notation Toggle:
- Added a
disable_sci_notation
parameter to enable or disable scientific notation on axes.
- Added a
-
Improved Error Handling:
- Added validation for the
stat
parameter to ensure that only valid options are accepted. - Added checks to ensure proper usage of
fill_alpha
andhist_edgecolor
whenfill
is set toFalse
.
- Added validation for the
-
General Enhancements:
- Updated the function's docstring to reflect the new parameters and provide comprehensive guidance on its usage.
[0.0.1b0] - 2024-07-30
Enhance kde_distributions
Function
Added Parameters
-
grid_figsize and single_figsize:
- Control the size of the overall grid figure and individual figures separately.
-
hist_color and kde_color:
- Allow customization of histogram and KDE plot colors.
-
hist_edgecolor:
- Allows customization of the histogram bar edges.
-
hue:
- Allows grouping data by a column.
-
fill:
- Controls whether to fill the histogram bars with color.
-
y_axis_label:
- Customizable y-axis label.
-
log_scale_vars:
- Specifies which variables to apply log scale.
-
bins and binwidth:
- Control the number and width of bins.
-
stat:
- Allows different statistics for the histogram (count, density, frequency, probability, proportion, percent).
Improvements
-
Validation and Error Handling:
- Checks for invalid
log_scale_vars
and throws aValueError
if any are found. - Throws a
ValueError
ifedgecolor
is changed whilefill
is set toFalse
. - Issues a
PerformanceWarning
if bothbins
andbinwidth
are specified, warning of potential performance impacts.
- Checks for invalid
-
Customizable y-axis label:
- Allows users to specify custom y-axis labels.
-
Warning for KDE with Count:
- Issues a warning if KDE is used with
stat='count'
, as it may produce misleading plots.
- Issues a warning if KDE is used with
Updated add_ids
to ensure unique ids and idx check
This pull request updates the add_ids()
function to enhance its functionality by:
- Ensuring that each generated ID starts with a non-zero digit.
- Adding a check to verify that the DataFrame index is unique.
- Printing a warning message if duplicate index entries are found.
These changes improve the robustness of the function, ensuring that the IDs generated are always unique and valid, and provide necessary feedback when the DataFrame index is not unique.
Check for Unique Indices:
- Before generating IDs, the function now checks if the DataFrame index is unique.
- If duplicates are found, a warning is printed along with the list of duplicate index entries.
Generate Non-Zero Starting IDs:
- The ID generation process is updated to ensure that the first digit of each ID is always non-zero.
Ensure Unique IDs:
- A set is used to store the generated IDs, ensuring all IDs are unique before adding them to the DataFrame.
Updated the add_ids()
function to enhance its functionality by:
- Ensuring that each generated ID starts with a non-zero digit.
- Adding a check to verify that the DataFrame index is unique.
- Printing a warning message if duplicate index entries are found.
These changes improve the robustness of the function, ensuring that the IDs generated are always unique and valid, and provide necessary feedback when the DataFrame index is not unique.
Check for Unique Indices:
- Before generating IDs, the function now checks if the DataFrame index is unique.
- If duplicates are found, a warning is printed along with the list of duplicate index entries.
Generate Non-Zero Starting IDs:
- The ID generation process is updated to ensure that the first digit of each ID is always non-zero.
Ensure Unique IDs:
- A set is used to store the generated IDs, ensuring all IDs are unique before adding them to the DataFrame.
Fix int conversion for numeric cols, reset decimal_places=0
This PR fixes the integer conversion issue for numeric columns when decimal_places=0
in the save_dataframes_to_excel
function. Additionally, it resets decimal_places
to 0 as the default value.
Changes include:
- Convert only numeric columns to integers when
decimal_places=0
. - Reset
decimal_places
default value to0
.
This ensures correct formatting and avoids errors during conversion.
Contingency Table Updates
-
Error Handling for Columns:
- Added a check to ensure at least one column is specified.
- Updated the function to accept a single column as a string or multiple columns as a list.
- Updated the function to raise a
ValueError
if no columns are provided or ifcols
is not correctly specified.
-
Function Parameters:
- Changed the parameters from
col1
andcol2
to a single parametercols
which can be either a string or a list.
- Changed the parameters from
-
Error Handling for
SortBy
:- Renamed
SortBy
tosort_by
to standardize nomenclature. - Added a check to ensure
sort_by
is either 0 or 1. - Updated the function to raise a
ValueError
ifsort_by
is not 0 or 1.
- Renamed
-
Sorting Logic:
- Updated the sorting logic to handle the new
cols
parameter structure.
- Updated the sorting logic to handle the new
-
Handling Categorical Data:
- Modified the code to convert categorical columns to strings to avoid issues with
fillna("")
.
- Modified the code to convert categorical columns to strings to avoid issues with
-
Handling Missing Values:
- Added
df = df.fillna('')
to fill NA values within the function to account for missing data.
- Added
-
Improved Function Documentation:
- Updated the function documentation to reflect the new parameters and error handling.
[0.0.1b0] - 2024-07-29
Contingency Table Updates
fillna('')
to output so that null values come through, removed 'All'
col name from output, sort options 0
and 1
, updated docstring documentation. Tested successfully on Python 3.7.3
Updated datatime
Imports to accomodate different Python versions
Compatibility Enhancement:
- Added a version check for
Python 3.7
and above.- Conditional import of
datetime
to handle different Python versions.
- Conditional import of
if sys.version_info >= (3, 7):
from datetime import datetime
else:
import datetime
- In
dataframe_columns()
:
start_time = (datetime.now() if sys.version_info >= (3, 7)
else datetime.datetime.now())
stop_time = (datetime.now() if sys.version_info >= (3, 7)
...
eda_toolkit 0.0.2
First Stable Release - EDA Toolkit 0.0.2
- added zenodo citation to
PyPI
- added logo to
PyPI
- added acknowledgements and references to
PyPI
eda_toolkit 0.0.1c
Release Notes
Changes from main_old.py
to main.py
1. Imports and Dependencies:
- Removed:
import datetime
.
- Added:
import matplotlib.ticker as mticker
for formatting.import sys
for conditional imports.- Conditional import of
datetime
depending on Python version (sys.version_info
).
2. Function Changes:
-
Function
add_ids
:- Parameter Modifications:
column_name
renamed toid_colname
.- New parameters
num_digits
to specify ID digit length, andset_as_index
to optionally set the new ID column as the index.
- Functionality Update:
- Allows setting the new ID column as an index and customizes the number of digits in generated IDs.
- Documentation Update: Expanded to describe new parameters and functionality.
- Parameter Modifications:
-
Function
scatter_plots_grid
:- New Functionality:
- Added options for customizing axis limits (
xlim
,ylim
). - Enhanced options for saving plots as PNG or SVG.
- Added support for showing or hiding plots based on user preference (
show_plot
parameter). - Improved legend management with the option to show or remove the legend.
- Introduced parameters to control tick labels and axis label font sizes.
- Added options for customizing axis limits (
- New Functionality:
-
Plot Saving and Displaying Enhancements:
- The code for saving and displaying plots has been refactored:
- Saving individual plots: Can now be conditionally saved based on user input (
save_individual
). - Saving and displaying grids: Improved support for saving grids of plots as a single image and displaying them conditionally.
- Saving individual plots: Can now be conditionally saved based on user input (
- The code for saving and displaying plots has been refactored:
-
Additional Enhancements:
- Introduced new function for fitting and plotting best-fit lines in scatter plots.
3. New Functions:
add_best_fit
:- Description: Adds a best-fit line to a scatter plot with customizable line style and color.
- Parameters:
ax
,x_data
,y_data
,line_style
,line_color
.
4. Code Structural Changes:
-
Comment and Documentation Updates:
- Improved and expanded docstrings and inline comments.
- Added detailed explanations for new parameters and functionalities.
-
Refactoring:
- Plotting-related code was reorganized for better readability and maintainability.
5. Bug Fixes and Enhancements:
-
Plotting Enhancements:
- Improved handling of axis visibility and edge cases.
- Enhanced plot saving mechanisms to ensure correct file naming and prevent overwriting.
-
Error Handling:
- Improved error handling and warnings related to file saving and plotting.