This is a Jupyter notebook containing the code and analysis for Exploratory Data Analysis (EDA) of a dataset. The dataset includes the following variables: Name: the name of the student Age:age of name Gender:gender of the student Hours_Studied:number of hours studied Physics_Marks: marks obtained in Physics Chemistry_Marks:marks obtained in Chemistry Maths_Marks: marks obtained in Mathematics Has_Part_Time_Job: whether the student has a part-time job or not
Let's start by importing the required libraries and reading the dataset into a Pandas DataFrame.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns
df=pd.read_csv('student_ml_dataset.csv')
Next, we'll perform any necessary data cleaning and preprocessing steps like handling missing values, removing outliers, etc. print(df.isnull().sum())
In this section, we'll analyze individual variables to uncover patterns, outliers, distributions, etc. We can use various techniques like summary statistics, histograms,bar graph,scatterplot, box plots, etc.
print(df.describe())
df.hist(figsize = (10,10),bins=50, color='red', alpha=0.6, rwidth=0.9) plt.title('Histogram', fontsize=15, fontweight='bold') plt.xlabel('Values', fontsize=10, fontweight='bold') plt.ylabel('Frequency', fontsize=10, fontweight='bold') plt.show()
sns.boxplot(data=data, x='Gender', y='Maths Marks') plt.xlabel('Gender') plt.ylabel('Marks') plt.title('Maths Marks Distribution by Gender') plt.show()
- Bivariate Analysis Next, we'll explore the relationships between variables by performing bivariate analysis. We can use techniques like scatter plots, correlation analysis, etc. to understand the relationships among variables.
sns.scatterplot(data=data, x='IQ', y='Physics Marks') plt.xlabel('IQ') plt.ylabel('Physics Marks') plt.title('Relationship between IQ and Physics Marks') plt.show()
- Multivariate Analysis Lastly, we'll explore the interactions and patterns among multiple variables using multivariate analysis techniques. We can create visualizations like scatter matrix, parallel coordinates plot, etc. to understand these complex relationships.
pd.plotting.scatter_matrix(df[['Physics Marks', 'Chemistry Marks', 'Maths Marks', 'IQ']], figsize=(10, 10)) plt.show()