MONATSZAHL - MONTH NUMBER: Category
AUSPRAEGUNG - SPECIFICATION: Accident Type
JAHR - YEAR
MONAT - MONTH
WERT - VALUE: Number of Accidents
NOTE: Some rows in MONAT
contain the value Summe
which denotes the sum of number of accidents for all the months in a particular year.
Alkoholunfälle - Alcohol Accident
Verkehrsunfälle - Traffic Accident
Fluchtunfälle - Escape Accident
insgesamt - total
Verletzte und Getötete - Injured and Killed
mit Personenschäden - with Personal injury
NOTE: The number of accidents for insgesamt
is not equal to the sum of number of accidents of the other 2 categories.
Inferences from Box Plot:-
- Distribution variation different for every Category (Order: Alcohol<Escape<Traffic)
- Increasing Order of Means same as above
Inferences from Strip Plot:-
- Escape and Traffic Accidents making 2 clusters.
Need to go into detail what factor/variable in causing this clustering
Inferences:-
- Distribution variation similar for Injured & Killed, Personal Injury. Total Distribution more variation.
- Increasing Order of Means: Injured & Killed < Personal Injury < Total
Inferences:-
Total
accident-type has visibly separate clusters denoting each Category.- For
Injured & Killed
, Alcohol and Escape Accident clusters overlapping. Personal Injury
has only one cluster: Traffic Accident Category.
Inference:-
- Stable Increasing trend for more than half a year (first half)
- Irregular decreasing trend after July(7)
Might help making a regression plot here that prevents overfitting
Inferences:-
- Regression Plot of order 2 shows a parabolic shape peaking around July as we expected.
NOTE: Only putting one plot here for reference
NOTE: Putting output for only one Time-Series for reference (All of them had the same inference)
Inferences:-
- None of the Time-Series are stationary since the p-value is coming >0.05
- They all have stationary differences
Now we can check for cointegrating relationships (Note: Non-Stationary series is said to be cointegrated if there exists atleast one linear combination of these variables that is stationary)
Inferences:-
- Since trace statistic > critical value for all rows in the summary, we can reject the null hypothesis.
Thus, cointegration relationships exist.
NOTE: Putting output for only one Time-Series for reference (All of them had the same inference)
Inferences:-
- ACF tails off, PACF cuts off.
- At an average, our Auto-Regression model should be of window somewhere between 10-15 lags.
That is when the correlation effect starts to get lesser and lesser as the lags increase.
NOTE: Putting output for only one Time-Series for reference
Inferences:-
- Trend with Alcohol Accidents is generally increasing.
- Trend for all other types of accidents is first increasing, then decreasing and finally increasing again.
For training, used data till 2020 (included)
For testing, used 2021 data.
Root Mean Square Error used as evaluation metric.
Used auto_arima
function to get p
, d
and q
values.
NOTE: Putting output for only one Time-Series for reference
Then, changed parameters on basis of ACF and PACF plot results.
RMSE Values on Testing Data
MONATSZAHL | AUSPRAEGUNG | RMSE |
---|---|---|
Alkoholunfälle | insgesamt | 10.749665459844213 |
Alkoholunfälle | Verletzte und Getötete | 8.55673846281317 |
Fluchtunfälle | insgesamt | 119.85712032903943 |
Fluchtunfälle | Verletzte und Getötete | 16.330702906031284 |
Verkehrsunfälle | insgesamt | 364.02320587667765 |
Verkehrsunfälle | mit Personenschäden | 63.00209713497466 |
Verkehrsunfälle | Verletzte und Getötete | 86.55810641566794 |
Each prediction is based on previous 15 years data.
Epochs set to 20.
NOTE: Putting output plot for only one Time-Series for reference
RMSE Values on Testing Data
MONATSZAHL | AUSPRAEGUNG | RMSE |
---|---|---|
Alkoholunfälle | insgesamt | 11.485624201168354 |
Alkoholunfälle | Verletzte und Getötete | 8.362521074095543 |
Fluchtunfälle | insgesamt | 189.65707940004074 |
Fluchtunfälle | Verletzte und Getötete | 16.14229145295978 |
Verkehrsunfälle | insgesamt | 573.9587146559661 |
Verkehrsunfälle | mit Personenschäden | 158.0322659153656 |
Verkehrsunfälle | Verletzte und Getötete | 99.53329218256376 |
Each prediction is based on previous 5, 10 or 15 years data.
NOTE: Putting output plot for only one Time-Series for reference
RMSE Values on Testing Data
MONATSZAHL | AUSPRAEGUNG | RMSE |
---|---|---|
Alkoholunfälle | insgesamt | 9.522 |
Alkoholunfälle | Verletzte und Getötete | 7.308 |
Fluchtunfälle | insgesamt | 112.555 |
Fluchtunfälle | Verletzte und Getötete | 13.784 |
Verkehrsunfälle | insgesamt | 501.732 |
Verkehrsunfälle | mit Personenschäden | 54.346 |
Verkehrsunfälle | Verletzte und Getötete | 76.001 |
Tried to create a joint model for all the time-series. However, the results came out extremely bad.
My hypothesis is we would need more data to train this model better.
RMSE for combined model: 1280.9237133896029
Individual Output
After looking at the RMSE scores of the models on the test data as well as the prediction plots vs the expected plots, I decided to use XGBoost as my final model for prediction.