EDA Toolkit 0.0.3
Changelog
[0.0.3] - 2024-08-02
- Stable release:
- Updated logo size, fixed citation title, and some minor readme cleanup:
- added additional section for documentation, cleaned up verbiage, moved acknowledgements section before licensing and support
- Updated logo size, fixed citation title, and some minor readme cleanup:
[0.0.2] - 2024-08-01
- First stable release:
- No new updates to codebase; just minimal documentation updates to readme and
setup.py
files - Added logo, badges, and Zenodo-certified citation to readme
- No new updates to codebase; just minimal documentation updates to readme and
[0.0.1rc0] - 2024-08-01
- No new updates to codebase; just minimal documentation updates to readme and
setup.py
files
[0.0.1b0] - 2024-08-01
New scatter_fit_plot()
and additional updayes
- Added new
scatter_fit_plot()
, removed unuseddata_types()
, added comment section headers
Added xlim
and ylim
inputs in kde_distribution()
- Added
xlim
andylim
inputs to allow user to customize axes limits inkde_distribution()
Added xlim
and ylim
params to stacked_crosstab_plot()
- Added
xlim
andylim
input parameters tostacked_crosstab_plot()
to give users more flexibility on controlling axes limits
Added x
and y
limits to box_violin_plot()
- Changed function name from
metrics_box_violin()
tobox_violion_plot()
- Added
xlim
andylim
inputs to control x and y-axis limits ofbox_violion_plot()
(formerly known asmetrics_box_violin
)
Added ability to remove stacks from plots, plot all or one at a time
Key Changes
-
plot_type
Parameter:
This parameter allows the user to choose between "regular", "normalized", or "both" plot types. -
remove_stacks
Parameter:
This parameter, when set toTrue
, generates a regular bar plot using only thecol
parameter instead of a stacked bar plot. It only works whenplot_type
is set to"regular"
. Ifremove_stacks
is set toTrue
whileplot_type
is anything other than"regular"
, the function will raise an exception.
Explanation of Changes:
-
plot_type
Parameter:- This parameter provides flexibility to the user, allowing them to specify the type of plot to generate. The options are:
"regular"
: Generates a standard bar plot."normalized"
: Generates a normalized bar plot."both"
: Generates both regular and normalized bar plots.
- This parameter provides flexibility to the user, allowing them to specify the type of plot to generate. The options are:
-
remove_stacks
Parameter:- This parameter, when set to
True
, will generate a regular bar plot using only thecol
parameter. It effectively removes the stacking of the bars. This parameter is only applicable whenplot_type
is set to"regular"
. If used with any otherplot_type
, an exception will be raised to ensure proper usage.
- This parameter, when set to
These changes enhance the flexibility and functionality of the stacked_crosstab_plot
function, allowing for more customizable and specific plot generation based on user requirements.
[0.0.1b0] - 2024-07-31
Refined kde_distributions()
Key Changes
-
Alpha Transparency for Histogram Fill:
- Added a
fill_alpha
parameter to control the transparency of the histogram bars' fill color. - The default value is
0.6
. An exception is raised iffill=False
andfill_alpha
is specified.
- Added a
-
Custom Font Sizes:
- Introduced
label_fontsize
andtick_fontsize
parameters to allow control over the font size of axis labels and tick marks independently.
- Introduced
-
Scientific Notation Toggle:
- Added a
disable_sci_notation
parameter to enable or disable scientific notation on axes.
- Added a
-
Improved Error Handling:
- Added validation for the
stat
parameter to ensure that only valid options are accepted. - Added checks to ensure proper usage of
fill_alpha
andhist_edgecolor
whenfill
is set toFalse
.
- Added validation for the
-
General Enhancements:
- Updated the function's docstring to reflect the new parameters and provide comprehensive guidance on its usage.
[0.0.1b0] - 2024-07-30
Enhance kde_distributions
Function
Added Parameters
-
grid_figsize and single_figsize:
- Control the size of the overall grid figure and individual figures separately.
-
hist_color and kde_color:
- Allow customization of histogram and KDE plot colors.
-
hist_edgecolor:
- Allows customization of the histogram bar edges.
-
hue:
- Allows grouping data by a column.
-
fill:
- Controls whether to fill the histogram bars with color.
-
y_axis_label:
- Customizable y-axis label.
-
log_scale_vars:
- Specifies which variables to apply log scale.
-
bins and binwidth:
- Control the number and width of bins.
-
stat:
- Allows different statistics for the histogram (count, density, frequency, probability, proportion, percent).
Improvements
-
Validation and Error Handling:
- Checks for invalid
log_scale_vars
and throws aValueError
if any are found. - Throws a
ValueError
ifedgecolor
is changed whilefill
is set toFalse
. - Issues a
PerformanceWarning
if bothbins
andbinwidth
are specified, warning of potential performance impacts.
- Checks for invalid
-
Customizable y-axis label:
- Allows users to specify custom y-axis labels.
-
Warning for KDE with Count:
- Issues a warning if KDE is used with
stat='count'
, as it may produce misleading plots.
- Issues a warning if KDE is used with
Updated add_ids
to ensure unique ids and idx check
This pull request updates the add_ids()
function to enhance its functionality by:
- Ensuring that each generated ID starts with a non-zero digit.
- Adding a check to verify that the DataFrame index is unique.
- Printing a warning message if duplicate index entries are found.
These changes improve the robustness of the function, ensuring that the IDs generated are always unique and valid, and provide necessary feedback when the DataFrame index is not unique.
Check for Unique Indices:
- Before generating IDs, the function now checks if the DataFrame index is unique.
- If duplicates are found, a warning is printed along with the list of duplicate index entries.
Generate Non-Zero Starting IDs:
- The ID generation process is updated to ensure that the first digit of each ID is always non-zero.
Ensure Unique IDs:
- A set is used to store the generated IDs, ensuring all IDs are unique before adding them to the DataFrame.
Updated the add_ids()
function to enhance its functionality by:
- Ensuring that each generated ID starts with a non-zero digit.
- Adding a check to verify that the DataFrame index is unique.
- Printing a warning message if duplicate index entries are found.
These changes improve the robustness of the function, ensuring that the IDs generated are always unique and valid, and provide necessary feedback when the DataFrame index is not unique.
Check for Unique Indices:
- Before generating IDs, the function now checks if the DataFrame index is unique.
- If duplicates are found, a warning is printed along with the list of duplicate index entries.
Generate Non-Zero Starting IDs:
- The ID generation process is updated to ensure that the first digit of each ID is always non-zero.
Ensure Unique IDs:
- A set is used to store the generated IDs, ensuring all IDs are unique before adding them to the DataFrame.
Fix int conversion for numeric cols, reset decimal_places=0
This PR fixes the integer conversion issue for numeric columns when decimal_places=0
in the save_dataframes_to_excel
function. Additionally, it resets decimal_places
to 0 as the default value.
Changes include:
- Convert only numeric columns to integers when
decimal_places=0
. - Reset
decimal_places
default value to0
.
This ensures correct formatting and avoids errors during conversion.
Contingency Table Updates
-
Error Handling for Columns:
- Added a check to ensure at least one column is specified.
- Updated the function to accept a single column as a string or multiple columns as a list.
- Updated the function to raise a
ValueError
if no columns are provided or ifcols
is not correctly specified.
-
Function Parameters:
- Changed the parameters from
col1
andcol2
to a single parametercols
which can be either a string or a list.
- Changed the parameters from
-
Error Handling for
SortBy
:- Renamed
SortBy
tosort_by
to standardize nomenclature. - Added a check to ensure
sort_by
is either 0 or 1. - Updated the function to raise a
ValueError
ifsort_by
is not 0 or 1.
- Renamed
-
Sorting Logic:
- Updated the sorting logic to handle the new
cols
parameter structure.
- Updated the sorting logic to handle the new
-
Handling Categorical Data:
- Modified the code to convert categorical columns to strings to avoid issues with
fillna("")
.
- Modified the code to convert categorical columns to strings to avoid issues with
-
Handling Missing Values:
- Added
df = df.fillna('')
to fill NA values within the function to account for missing data.
- Added
-
Improved Function Documentation:
- Updated the function documentation to reflect the new parameters and error handling.
[0.0.1b0] - 2024-07-29
Contingency Table Updates
fillna('')
to output so that null values come through, removed 'All'
col name from output, sort options 0
and 1
, updated docstring documentation. Tested successfully on Python 3.7.3
Updated datatime
Imports to accomodate different Python versions
Compatibility Enhancement:
- Added a version check for
Python 3.7
and above.- Conditional import of
datetime
to handle different Python versions.
- Conditional import of
if sys.version_info >= (3, 7):
from datetime import datetime
else:
import datetime
- In
dataframe_columns()
:
start_time = (datetime.now() if sys.version_info >= (3, 7)
else datetime.datetime.now())
stop_time = (datetime.now() if sys.version_info >= (3, 7)
else datetime.datetime.now())
Changed set_as_index
to False
as default for add_ids()
- Parameter Customization:
- Updated parameter
set_as_index
to allow users to choose whether to set the new ID column as the index. Defaults toFalse
.
- Updated parameter
Enhance add_ids
function: Parameter customization and improved default column name
- Parameter Customization:
- Added a new parameter
set_as_index
to allow users to choose whether to set the new ID column as the index. Defaults toTrue
.
- Added a new parameter
Added num_dig
, changed id_col
in add_ids
-
Parameter Customization:
- Introduced a new parameter
num_digits
to allow users to specify the number of digits for the unique IDs.
- Introduced a new parameter
-
Improved Default Column Name:
- Changed the default column name from
Patient_ID
toID
to be more generic and widely understood. - Updated the parameter name from
column_name
toid_colname
for better clarity and consistency.
- Changed the default column name from
[0.0.1b0] - 2024-07-28
Refactored metrics_box_violin()
-
Parameter Renaming and Addition:
- Added
rotate_plot
parameter to allow pivoting plots. - Added
save_plots
parameter to control saving options with values "all", "individual", and "grid". - Added
show_plot
parameter to control plot display with values "individual", "grid", and "both". - Added
grid_figsize
parameter to control the figure size of grid plots. - Added
xlabel_rot
to allow rotating of the x-axis text.- Renamed
save_individual
tosave_plots
. - Renamed
x_label_rotation
toxlabel_rot
.
- Renamed
- Added
-
Parameter Validation:
- Added validation to ensure
show_plot
is one of "individual", "grid", or "both". - Added validation to ensure
save_plots
is one of None, "all", "individual", or "grid". - Added validation to ensure
save_plots
requiresimage_path_png
orimage_path_svg
. - Added validation to ensure
rotate_plot
is a Boolean value. - Added validation to ensure
individual_figsize
is a tuple or list of two numbers. - Added validation to ensure
grid_figsize
is a tuple or list of two numbers if specified.
- Added validation to ensure
-
Default Values:
- Added default values for
individual_figsize
andgrid_figsize
. - Set default grid figure size to (
5 * n_cols, 5 * n_rows
) ifgrid_figsize
is not specified.
- Added default values for
-
Plot Display and Saving:
- Adjusted code to handle new parameters for plot display and saving.
- Implemented logic to pivot plots based on
rotate_plot
parameter. - Used
xlabel_rot
to rotate x-axis labels. - Incorporated custom figure sizes for individual and grid plots.
-
File Saving:
- Updated file naming to include
plot_type
in filenames.
- Updated file naming to include
[0.0.1b0] - 2024-07-24
- First working version of EDA Toolkit.
- Included the following functions:
ensure_directory
: Ensures the existence of a directory, creating it if necessary.add_ids
: Adds a column of unique 9-digit IDs to the dataframe.strip_trailing_period
: Removes trailing periods from floats in a specified column.parse_date_with_rule
: Parses and standardizes date strings based on day/month/year or month/day/year formats.data_types
: Provides a report on data types, null values, and their percentages for dataframe columns.dataframe_columns
: Analyzes dataframe columns for data type, nulls, and unique value counts.summarize_all_combinations
: Generates summary tables for all possible combinations of specified variables.save_dataframes_to_excel
: Saves multiple DataFrames to separate sheets in an Excel file with customized formatting.contingency_table
: Creates contingency tables with sorting options.highlight_columns
: Highlights specific columns in a DataFrame with a specified background color.kde_distributions
: Generates KDE or histogram distribution plots for specified columns.stacked_crosstab_plot
: Generates stacked bar plots and crosstabs for specified columns.plot_filtered_dataframes
: Filters dataframes based on conditions and generates plots and crosstabs.metrics_box_violin
: Creates and saves individual boxplots or violin plots for given metrics and comparisons.
[0.0.1a0] - 2024-07-24
- Initial alpha release of EDA Toolkit.
- Note: This release was mistakenly published without placing the
__init__.py
andmain.py
files inside a subdirectory, causing thepip install
to work but resulting in failed imports. This issue will be addressed in subsequent releases.