DPS AI Challenge: Accident Value Prediction

Translations

Sample Data

Columns

MONATSZAHL - MONTH NUMBER: Category
AUSPRAEGUNG - SPECIFICATION: Accident Type
JAHR - YEAR
MONAT - MONTH
WERT - VALUE: Number of Accidents

NOTE: Some rows in MONAT contain the value Summe which denotes the sum of number of accidents for all the months in a particular year.

Categorical Variables

MONATSZAHL

Alkoholunfälle - Alcohol Accident
Verkehrsunfälle - Traffic Accident
Fluchtunfälle - Escape Accident

AUSPRAEGUNG

insgesamt - total
Verletzte und Getötete - Injured and Killed
mit Personenschäden - with Personal injury

NOTE: The number of accidents for insgesamt is not equal to the sum of number of accidents of the other 2 categories.

Exploratory Data Analysis

Category-wise Count Distribution

Correlation Heatmap

Category-wise WERT Distribution

MONATSZAHL

Inferences from Box Plot:-

Distribution variation different for every Category (Order: Alcohol<Escape<Traffic)
Increasing Order of Means same as above

Inferences from Strip Plot:-

Escape and Traffic Accidents making 2 clusters.
Need to go into detail what factor/variable in causing this clustering

AUSPRAEGUNG

Inferences:-

Distribution variation similar for Injured & Killed, Personal Injury. Total Distribution more variation.
Increasing Order of Means: Injured & Killed < Personal Injury < Total

Inferences:-

Total accident-type has visibly separate clusters denoting each Category.
For Injured & Killed, Alcohol and Escape Accident clusters overlapping.
Personal Injury has only one cluster: Traffic Accident Category.

Time-Series Analysis

Month-wise Distribution

Inference:-

Stable Increasing trend for more than half a year (first half)
Irregular decreasing trend after July(7)
Might help making a regression plot here that prevents overfitting

Inferences:-

Regression Plot of order 2 shows a parabolic shape peaking around July as we expected.

Time-Series Line plot for every Category & Accident-Type pair

NOTE: Only putting one plot here for reference

Check if the Time-Series are stationary (Using Augmented Dickey-Fuller Test)

NOTE: Putting output for only one Time-Series for reference (All of them had the same inference)

Inferences:-

None of the Time-Series are stationary since the p-value is coming >0.05
They all have stationary differences
Now we can check for cointegrating relationships (Note: Non-Stationary series is said to be cointegrated if there exists atleast one linear combination of these variables that is stationary)

Check for Cointegration (Using Johansen Test)

Inferences:-

Since trace statistic > critical value for all rows in the summary, we can reject the null hypothesis.
Thus, cointegration relationships exist.

Auto-Correlation and Partial Auto-Correlation Function Plots

NOTE: Putting output for only one Time-Series for reference (All of them had the same inference)

Inferences:-

ACF tails off, PACF cuts off.
At an average, our Auto-Regression model should be of window somewhere between 10-15 lags.
That is when the correlation effect starts to get lesser and lesser as the lags increase.

Trend & Seasonality Check

NOTE: Putting output for only one Time-Series for reference

Inferences:-

Trend with Alcohol Accidents is generally increasing.
Trend for all other types of accidents is first increasing, then decreasing and finally increasing again.

Modelling

For training, used data till 2020 (included)
For testing, used 2021 data.
Root Mean Square Error used as evaluation metric.

ARIMA

Used auto_arima function to get p, d and q values.
NOTE: Putting output for only one Time-Series for reference

Then, changed parameters on basis of ACF and PACF plot results.

RMSE Values on Testing Data

MONATSZAHL	AUSPRAEGUNG	RMSE
Alkoholunfälle	insgesamt	10.749665459844213
Alkoholunfälle	Verletzte und Getötete	8.55673846281317
Fluchtunfälle	insgesamt	119.85712032903943
Fluchtunfälle	Verletzte und Getötete	16.330702906031284
Verkehrsunfälle	insgesamt	364.02320587667765
Verkehrsunfälle	mit Personenschäden	63.00209713497466
Verkehrsunfälle	Verletzte und Getötete	86.55810641566794

LSTM

Each prediction is based on previous 15 years data.
Epochs set to 20.

NOTE: Putting output plot for only one Time-Series for reference

RMSE Values on Testing Data

MONATSZAHL	AUSPRAEGUNG	RMSE
Alkoholunfälle	insgesamt	11.485624201168354
Alkoholunfälle	Verletzte und Getötete	8.362521074095543
Fluchtunfälle	insgesamt	189.65707940004074
Fluchtunfälle	Verletzte und Getötete	16.14229145295978
Verkehrsunfälle	insgesamt	573.9587146559661
Verkehrsunfälle	mit Personenschäden	158.0322659153656
Verkehrsunfälle	Verletzte und Getötete	99.53329218256376

XGBoost

Each prediction is based on previous 5, 10 or 15 years data.

NOTE: Putting output plot for only one Time-Series for reference

RMSE Values on Testing Data

MONATSZAHL	AUSPRAEGUNG	RMSE
Alkoholunfälle	insgesamt	9.522
Alkoholunfälle	Verletzte und Getötete	7.308
Fluchtunfälle	insgesamt	112.555
Fluchtunfälle	Verletzte und Getötete	13.784
Verkehrsunfälle	insgesamt	501.732
Verkehrsunfälle	mit Personenschäden	54.346
Verkehrsunfälle	Verletzte und Getötete	76.001

DeepAR

Tried to create a joint model for all the time-series. However, the results came out extremely bad.
My hypothesis is we would need more data to train this model better.

RMSE for combined model: 1280.9237133896029

Individual Output

Conclusion

After looking at the RMSE scores of the models on the test data as well as the prediction plots vs the expected plots, I decided to use XGBoost as my final model for prediction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DPS AI Challenge: Accident Value Prediction

Translations

Sample Data

Columns

Categorical Variables

MONATSZAHL

AUSPRAEGUNG

Exploratory Data Analysis

Category-wise Count Distribution

Correlation Heatmap

Category-wise WERT Distribution

MONATSZAHL

AUSPRAEGUNG

Time-Series Analysis

Month-wise Distribution

Time-Series Line plot for every Category & Accident-Type pair

Check if the Time-Series are stationary (Using Augmented Dickey-Fuller Test)

Check for Cointegration (Using Johansen Test)

Auto-Correlation and Partial Auto-Correlation Function Plots

Trend & Seasonality Check

Modelling

ARIMA

LSTM

XGBoost

DeepAR

Conclusion

Files

README.md

Latest commit

History

README.md

File metadata and controls

DPS AI Challenge: Accident Value Prediction

Translations

Sample Data

Columns

Categorical Variables

MONATSZAHL

AUSPRAEGUNG

Exploratory Data Analysis

Category-wise Count Distribution

Correlation Heatmap

Category-wise WERT Distribution

MONATSZAHL

AUSPRAEGUNG

Time-Series Analysis

Month-wise Distribution

Time-Series Line plot for every Category & Accident-Type pair

Check if the Time-Series are stationary (Using Augmented Dickey-Fuller Test)

Check for Cointegration (Using Johansen Test)

Auto-Correlation and Partial Auto-Correlation Function Plots

Trend & Seasonality Check

Modelling

ARIMA

LSTM

XGBoost

DeepAR

Conclusion