The DualDataDistributionAnalysis repository contains a Python script designed to analyze and visualize the distributions of two distinct datasets labeled as 'Data 1' and 'Data 2'. The script employs the powerful data manipulation capabilities of pandas along with the robust plotting functionalities of Matplotlib and Seaborn to create a series of plots that provide insights into the statistical properties of the datasets.
The repository includes code to generate:
- Boxplots: These show the median, quartiles, and outliers for each dataset, providing a quick visual summary of the distributions.
- Violin Plots: These combine box plots with kernel density estimation to give a richer depiction of the data density around the values.
- Histograms: These illustrate the frequency distributions of the datasets, allowing for the observation of data groupings and patterns.
- Kernel Density Estimation (KDE) Plots: These smooth histograms to summarize the data's distribution with a continuous line.
- Cumulative Distribution Function (CDF) Plots: These indicate the probability of a data point falling below a particular value.
- Swarm Plots: These plot each individual data point and are useful for showing data clustering and spotting outliers without any data binning.
Each visualization is saved as an SVG file for high-quality representations, and all plots are displayed inline for immediate review.
This comprehensive suite of visualizations serves as a foundational tool for statistical analysis, suitable for exploratory data analysis (EDA), quality control, and comparison of data from two different conditions or sources. It's an invaluable resource for data analysts, scientists, and statisticians looking to understand and present their data distributions effectively.