This repository hosts a data science project analyzing the survival of passengers aboard the RMS Titanic. The analysis investigates factors influencing survival rates and employs logistic regression to predict outcomes. The dataset used in this project is sourced from Kaggle and can be found here.
-
Objective: Identify missing data in the dataset.
-
Method: Used a heatmap to visualize missing values.
-
Plot:
-
Objective: Explore the survival count based on gender.
-
Method: Generated a count plot comparing survival rates between genders.
-
Plot:
-
Objective: Analyze how passenger class affects survival rates.
-
Method: A count plot illustrating survival distribution across different classes.
-
Plot:
- Objective: Observe the age distribution of the passengers.
- Method: Created a distribution plot for the age variable.
- Plot:
-
Objective: Predict survival based on variables such as age, sex, passenger class, etc.
-
Method: A logistic regression model was implemented and trained on the preprocessed data.
-
Results: The model achieved an accuracy score of 0.797752808988764. Model evaluation details are documented using a confusion matrix.
-
Improvement: To potentially enhance this model, I will be returning to this project to implement a Gradient Boosting model. Gradient Boosting can provide better performance through ensemble learning techniques that combine multiple weak learning models to create a strong predictive model, potentially improving the accuracy further.
Ensure you have Python installed, then set up a virtual environment and install the required packages:
pip install pandas numpy matplotlib seaborn scikit-learn