diff --git a/hols/dataextraction/client/client.md b/hols/dataextraction/client/client.md index ea5bd47..2e93968 100644 --- a/hols/dataextraction/client/client.md +++ b/hols/dataextraction/client/client.md @@ -65,16 +65,16 @@ You can use two Python scripts, depending on what you want to do (testing / prod Remember to run the code you choose in your local machine, as we need to make HTTP requests through **localhost**. This means that, in the computer where you're playing, you must have a Python environment configured and being able to run the abovementioned code. -If you'd rather just see an example of the data returned, [check the contents of this file](https://static.developer.riotgames.com/docs/lol/liveclientdata_sample.json). You can observe the kind of information we can access from a player. +1. If you'd rather just see an example of the data returned, [check the contents of this file](https://static.developer.riotgames.com/docs/lol/liveclientdata_sample.json). You can observe the kind of information we can access from a player. -In [this file](https://github.com/oracle-devrel/leagueoflegends-optimizer/blob/livelabs/src/aux_files/example_live_client.txt), you can find a a sample JSON returned by the game. +2. In [this file](https://github.com/oracle-devrel/leagueoflegends-optimizer/blob/livelabs/src/aux_files/example_live_client.txt), you can find a a sample JSON returned by the game. -If you're using your own ML model after training it, and you have successfully joined a game, you should see some stats and recommendations from the ML model start to appear when you run the file: +3. If you're using your own ML model after training it, and you have successfully joined a game, you should see some stats and recommendations from the ML model start to appear when you run the file: -![ML model recommendations](./images/model_output.png) -> **Note**: at the beginning of the match, and until you have at least one kill and one death, it will just tell you that you're missing some data to start calculating your current performance (as 0 divided by any natural number is 0, and I didn't want to depress you telling you that you have a terrible amount of kills at the beginning.) + ![ML model recommendations](./images/model_output.png) + > **Note**: at the beginning of the match, and until you have at least one kill and one death, it will just tell you that you're missing some data to start calculating your current performance (as 0 divided by any natural number is 0, and I didn't want to depress you telling you that you have a terrible amount of kills at the beginning.) -Also note it's recommended to run `run_new_live_model.py` with a Python version equal to the one that you used during the **Model Building** phase. Otherwise you will run into warnings, or even errors if the Python versions are too different. + Also note it's recommended to run `run_new_live_model.py` with a Python version equal to the one that you used during the **Model Building** phase. Otherwise you will run into warnings, or even errors if the Python versions are too different. ## Conclusions @@ -97,4 +97,4 @@ A recap of what we've learned: * **Author** - Nacho Martinez, Data Science Advocate @ DevRel * **Contributors** - Victor Martin, Product Strategy Director -* **Last Updated By/Date** - July 24th, 2023 +* **Last Updated By/Date** - July 26th, 2023 diff --git a/hols/dataextraction/creatingmodel/creatingmodel.md b/hols/dataextraction/creatingmodel/creatingmodel.md index 13225a2..006aa7b 100644 --- a/hols/dataextraction/creatingmodel/creatingmodel.md +++ b/hols/dataextraction/creatingmodel/creatingmodel.md @@ -68,115 +68,114 @@ In this lab, you will learn how to create Machine Learning models with the data 3. Now, with our Python dependencies installed and our repository and notebook ready, we're ready to run it from the first cell. Make sure to select the correct Kernel (the one that you have configured and has all Python dependencies installed within it) from the Kernel dropdown menu: -![selecting kernel](./images/select_kernel.PNG) + ![selecting kernel](./images/select_kernel.PNG) ## Task 2: The Data Structure -From our dataset, we can observe an example of the data structure we're going to use to build our model: +1. From our dataset, we can observe an example of the data structure we're going to use to build our model: -![example data structure](./images/structure_2023.webp) + ![example data structure](./images/structure_2023.webp) + It is important to remember that structuring and manipulating data in the data science process takes about 80 to 90% of the time, according to expert sources (image courtesy of [“2020 State of Data Science: Moving From Hype Toward Maturity.”](https://www.anaconda.com/state-of-data-science-2020)), and we shouldn't be discouraged when spending most of our time processing and manipulating data structures. The ML algorithm is the easy part if you've correctly identified the correct data structure and adapted it to the structure ML algorithms and pipelines expect. -It is important to remember that structuring and manipulating data in the data science process takes about 80 to 90% of the time, according to expert sources (image courtesy of [“2020 State of Data Science: Moving From Hype Toward Maturity.”](https://www.anaconda.com/state-of-data-science-2020)), and we shouldn't be discouraged when spending most of our time processing and manipulating data structures. The ML algorithm is the easy part if you've correctly identified the correct data structure and adapted it to the structure ML algorithms and pipelines expect. - -![Breakdown of effort to train model](../../../images/lab1-anaconda_1.png?raw=true) + ![Breakdown of effort to train model](../../../images/lab1-anaconda_1.png?raw=true) ## Task 3: Load Data / Generate Dataset -First, we load the model and train-test split it. +1. First, we load the model and train-test split it. -To perform ML properly, we need to take the dataset we're going to work with, and split it into two: + To perform ML properly, we need to take the dataset we're going to work with, and split it into two: -* A **training** dataset, from which our ML model will learn to make predictions. -* A **testing** dataset, from which our ML model will validate the predictions it makes, and check how accurate it was compared to the truth. + * A **training** dataset, from which our ML model will learn to make predictions. + * A **testing** dataset, from which our ML model will validate the predictions it makes, and check how accurate it was compared to the truth. -In ML, it's very typical to find values around 80% train / 20% test proportions, as it provides enough data for the model to be trained, and enough data to check the accuracy of the model without having too much / too little data in either of the datasets. + In ML, it's very typical to find values around 80% train / 20% test proportions, as it provides enough data for the model to be trained, and enough data to check the accuracy of the model without having too much / too little data in either of the datasets. -After this split, we divide the whole dataset into two separate files, one containing training data (85% of the original dataset) and testing data (15%). +2. After this split, we divide the whole dataset into two separate files, one containing training data (85% of the original dataset) and testing data (15%). -![reading dataset](images/read_dataset.png) + ![reading dataset](images/read_dataset.png) -Then, we begin with simple data exploration of our initial dataset. Histograms are particularly useful to find the distribution of one (or many!) variables in our dataset and see if they follow any known statistical distribution. +3. Then, we begin with simple data exploration of our initial dataset. Histograms are particularly useful to find the distribution of one (or many!) variables in our dataset and see if they follow any known statistical distribution. -![histogram example](images/histogram_example.PNG) + ![histogram example](images/histogram_example.PNG) -It's also good to look at our new variables `f1...f3` and their minimum, average, maximum: +4. It's also good to look at our new variables `f1...f3` and their minimum, average, maximum: -![describe](images/minmax_f.PNG) + ![describe](images/minmax_f.PNG) -This will also help us determine what to return to the user when they're playing a game in the end: the closer they are to the maximum, the better they will have performed, and we also need to adjust that accordingly. + This will also help us determine what to return to the user when they're playing a game in the end: the closer they are to the maximum, the better they will have performed, and we also need to adjust that accordingly. -We're also interested in other variables' histograms, especially the ones around people getting multiple kills in a row, number of wards, big jungle objectives, match durations... Basically statistics that I find personally interesting after finishing a League match myself. +5. We're also interested in other variables' histograms, especially the ones around people getting multiple kills in a row, number of wards, big jungle objectives, match durations... Basically statistics that I find personally interesting after finishing a League match myself. -![all_histograms](images/histograms.webp) + ![all_histograms](images/histograms.webp) -After getting a rough idea of what our dataset and some of our variables contain, it's time to tell the ML model which variables we want as input and which ones as output. +6. After getting a rough idea of what our dataset and some of our variables contain, it's time to tell the ML model which variables we want as input and which ones as output. -For this example (in the notebook, we create several models), we'll first drop those columns we don't want to use as inputs or outputs - in this first model, we don't want to use any `f1...f5` variables in our dataset, as we're going to create a model with League's original data to begin with: + For this example (in the notebook, we create several models), we'll first drop those columns we don't want to use as inputs or outputs - in this first model, we don't want to use any `f1...f5` variables in our dataset, as we're going to create a model with League's original data to begin with: -![dropping columns](images/dropping_columns.PNG) + ![dropping columns](images/dropping_columns.PNG) -After we create our `TabularDataset()` object (which extends a pandas dataframe and therefore has most of panda's functions available), we're ready to start training. + After we create our `TabularDataset()` object (which extends a pandas dataframe and therefore has most of panda's functions available), we're ready to start training. ## Task 4: Model Training -Now that we've seen the shape of our dataset and we have the variable we want to predict (in this case, calculated_player_performance), we train as many models as possible for 10 minutes. We can instantiate a `TabularPredictor()` which takes most of the difficulties of usually writing this kind of code out of the equation: +1. Now that we've seen the shape of our dataset and we have the variable we want to predict (in this case, calculated_player_performance), we train as many models as possible for 10 minutes. We can instantiate a `TabularPredictor()` which takes most of the difficulties of usually writing this kind of code out of the equation: -![training simplified](images/training_simplified.PNG) + ![training simplified](images/training_simplified.PNG) -We need to specify that this problem is a regression problem (we're predicting numerical, continuous values (not integers)) and we specify which variable it is we're trying to predict through the `label` parameter. + We need to specify that this problem is a regression problem (we're predicting numerical, continuous values (not integers)) and we specify which variable it is we're trying to predict through the `label` parameter. -The preset is a pre-configuration that restraints the amount of iterations, models, and time dedicated to train each model, to achieve some "quality" defined as a preset. -> **Note**: find all available presets [here](https://auto.gluon.ai/0.5.2/tutorials/tabular_prediction/tabular-quickstart.html#presets). + The preset is a pre-configuration that restraints the amount of iterations, models, and time dedicated to train each model, to achieve some "quality" defined as a preset. + > **Note**: find all available presets [here](https://auto.gluon.ai/0.5.2/tutorials/tabular_prediction/tabular-quickstart.html#presets). -After our training is done (about 10 minutes), we can display some results: + After our training is done (about 10 minutes), we can display some results. -First, we display a leaderboard of the best trained models ordered by decreasing RMSE. If you're not familiar with this concept, don't worry, we'll revisit all metrics right below, in our Model Testing task. This will help us see which models perform better against the target variable that we specified before: +2. First, we display a leaderboard of the best trained models ordered by decreasing RMSE. If you're not familiar with this concept, don't worry, we'll revisit all metrics right below, in our Model Testing task. This will help us see which models perform better against the target variable that we specified before: -![leaderboard](images/leaderboard.PNG) + ![leaderboard](images/leaderboard.PNG) -Note that our Level 2 Weighted Ensemble has the lowest RMSE of all: we'll probably want to use this model. + Note that our Level 2 Weighted Ensemble has the lowest RMSE of all: we'll probably want to use this model. -![example of an ensemble model in computer vision](./images/example_ensemble.png) -> **Note**: this is an example of an weighted ensemble model, in which decisions are taken using a technique called **bagging**: every model makes a prediction, and the best models will weigh more upon the final decision. + ![example of an ensemble model in computer vision](./images/example_ensemble.png) + > **Note**: this is an example of an weighted ensemble model, in which decisions are taken using a technique called **bagging**: every model makes a prediction, and the best models will weigh more upon the final decision. ## Task 5: Model Testing -After training is done, we need to check whether the training we did was actually useful or a waste of time. To achieve this, we make use of some metrics, which depend on the type of problem we're dealing with. +1. After training is done, we need to check whether the training we did was actually useful or a waste of time. To achieve this, we make use of some metrics, which depend on the type of problem we're dealing with. -For example, in a binary classification problem (where we're trying to predict if something is either 0 or 1), I typically use **accuracy, precision, recall and f1-score** as standard evaluation metrics: + For example, in a binary classification problem (where we're trying to predict if something is either 0 or 1), I typically use **accuracy, precision, recall and f1-score** as standard evaluation metrics: -![classification metrics](./images/classification_metrics.png) -> **Note**: an example on how each one of these 4 metrics are calculated just by looking at the Confusion Matrix. + ![classification metrics](./images/classification_metrics.png) + > **Note**: an example on how each one of these 4 metrics are calculated just by looking at the Confusion Matrix. -However, as we're dealing with a regression problem, the most popular metrics are: the MSE, MAE, RMSE, R-Squared and variants of these coefficients. + However, as we're dealing with a regression problem, the most popular metrics are: the MSE, MAE, RMSE, R-Squared and variants of these coefficients. -The MSE, MAE, RMSE, and R-Squared metrics are mainly used to evaluate the prediction error rates and model performance in regression analysis. + The MSE, MAE, RMSE, and R-Squared metrics are mainly used to evaluate the prediction error rates and model performance in regression analysis. -* MAE (Mean absolute error) represents the difference between the original and predicted values extracted by averaged the absolute difference over the data set. -* MSE (Mean Squared Error) represents the difference between the original and predicted values extracted by squared the average difference over the data set. -* RMSE (Root Mean Squared Error) is the error rate by the square root of MSE. -* R-squared (Coefficient of determination) represents the coefficient of how well the values fit compared to the original values. - The value from 0 to 1 interpreted as percentages. The higher the value is, the better the model is. -* The Pearson correlation coefficient is a descriptive statistic, meaning that it summarizes the characteristics of a dataset - Specifically, it describes the strength and direction of the linear relationship between two quantitative variables. + * MAE (Mean absolute error) represents the difference between the original and predicted values extracted by averaged the absolute difference over the data set. + * MSE (Mean Squared Error) represents the difference between the original and predicted values extracted by squared the average difference over the data set. + * RMSE (Root Mean Squared Error) is the error rate by the square root of MSE. + * R-squared (Coefficient of determination) represents the coefficient of how well the values fit compared to the original values. + The value from 0 to 1 interpreted as percentages. The higher the value is, the better the model is. + * The Pearson correlation coefficient is a descriptive statistic, meaning that it summarizes the characteristics of a dataset + Specifically, it describes the strength and direction of the linear relationship between two quantitative variables. -Note that, in our code, we need to use our **testing dataset** as the one to validate our metrics (we'll have the testing data to check against). +2. Note that, in our code, we need to use our **testing dataset** as the one to validate our metrics (we'll have the testing data to check against). -![metrics 1](./images/metrics_1.PNG) + ![metrics 1](./images/metrics_1.PNG) -We're also able to extract the feature importance of our model. This is an awesome calculation provided to us automatically by the AutoML library. -Feature importance is an index that represents the measure of contribution of each feature in our ML model. -> **Note**: it's important to note that feature importance depends on the model and dataset used, and different algorithms may assign different importance values to the same set of features. +3. We're also able to extract the feature importances of our model. This is an awesome calculation provided to us automatically by the AutoML library. + Feature importance is an index that represents the measure of contribution of each feature in our ML model. + > **Note**: it's important to note that feature importance depends on the model and dataset used, and different algorithms may assign different importance values to the same set of features. -For more advanced Machine Learning practicioners, there's a caveat I need to make here about certain types of regularization (like *L1*/*Lasso* regularization) - a technique that's used to often prevent overfitting and improve the generalized performance of a model -: it can force some coefficients to become zero, rendering those coefficients' variables useless in a model. + For more advanced Machine Learning practicioners, there's a caveat I need to make here about certain types of regularization (like *L1*/*Lasso* regularization) - a technique that's used to often prevent overfitting and improve the generalized performance of a model -: it can force some coefficients to become zero, rendering those coefficients' variables useless in a model. -![metrics 2](./images/metrics_2.PNG) -> **Note**: if I have two variables with importances N and M, the first variable will have an importance N/M times higher than the second variable, and viceversa. + ![metrics 2](./images/metrics_2.PNG) + > **Note**: if I have two variables with importances N and M, the first variable will have an importance N/M times higher than the second variable, and viceversa. -This means that our model takes `deaths, assists, kills` as the three most important variables, and the fourth most important variable is the game duration. + This means that our model takes `deaths, assists, kills` as the three most important variables, and the fourth most important variable is the game duration. -After creating this function and invoking it, we will obtain a resulting CSV file or dataframe object. We'll use this new object to create our model. + After creating this function and invoking it, we will obtain a resulting CSV file or dataframe object. We'll use this new object to create our model. ## Task 6: Creating Extra Models @@ -184,58 +183,58 @@ The rest of the notebook is similar to the process we've followed until now, wit ### Win Prediction Model (2nd Model) -The second model we create is a winning predictor (a binary classifier that tries to predict whether the player won or lost, based on all input variables). We specify that this is a binary classification problem this way: +1. The second model we create is a winning predictor (a binary classifier that tries to predict whether the player won or lost, based on all input variables). We specify that this is a binary classification problem this way: -![binary classification](./images/binary_classification.PNG) + ![binary classification](./images/binary_classification.PNG) -The results from this model are very promising, reaching up to 99.37% accuracy: +2. The results from this model are very promising, reaching up to 99.37% accuracy: -![2nd importances](./images/2nd_importances.PNG) -> **Note**: if you're planning to run inference (deploy your model and make predictions) on a low-end computer, you might be better off with the Light Gradient Boosted Model, as it's prediction times are about 100 times faster than our L2 Weighted Ensemble. + ![2nd importances](./images/2nd_importances.PNG) + > **Note**: if you're planning to run inference (deploy your model and make predictions) on a low-end computer, you might be better off with the Light Gradient Boosted Model, as it's prediction times are about 100 times faster than our L2 Weighted Ensemble. ### Live Client API Compatible Model -And now that we have one model for each, we attempt to create a model only using our `f1...f3` variables as discussed during the workshop. We call this model _Live Client API Compatible Model_ as it utilizes as much temporal data as possible from the API return object. +1. And now that we have one model for each, we attempt to create a model only using our `f1...f3` variables as discussed during the workshop. We call this model _Live Client API Compatible Model_ as it utilizes as much temporal data as possible from the API return object. -These variables were calculated like this: + These variables were calculated like this: -* Kills + assists / gameTime ==> kills + assists ratio ==> `f2` -* Deaths / gameTime ==> death ratio ==> `f1` -* xp / gameTime ==> xp per min ==> `f3` + * Kills + assists / gameTime ==> kills + assists ratio ==> `f2` + * Deaths / gameTime ==> death ratio ==> `f1` + * xp / gameTime ==> xp per min ==> `f3` -In our dataset, we also had two other variables that I was hoping I could also calculate with Live Client API data, but these variables weren't possible to accurately calculate: + In our dataset, we also had two other variables that I was hoping I could also calculate with Live Client API data, but these variables weren't possible to accurately calculate: -* `f4`, which represented the total amount of damage per minute, wasn't present in the Live Client API in any field -* `f5`, which represented the total amount of gold per minute, wasn't either. You can only extract the **current** amount of gold, which doesn't add any real value to the model. + * `f4`, which represented the total amount of damage per minute, wasn't present in the Live Client API in any field + * `f5`, which represented the total amount of gold per minute, wasn't either. You can only extract the **current** amount of gold, which doesn't add any real value to the model. -So, the idea now is to create a model that, given f1, f2 and f3, and the champion name, is **able to predict any player's performance**. + So, the idea now is to create a model that, given f1, f2 and f3, and the champion name, is **able to predict any player's performance**. -![3rd data manipulation](./images/3rd_manipulation.PNG) + ![3rd data manipulation](./images/3rd_manipulation.PNG) -![3rd fit](./images/3rd_fit.PNG) + ![3rd fit](./images/3rd_fit.PNG) -![3rd leaderboard](./images/3rd_leaderboard.PNG) -> **Note**: the RMSE in this third experiment, compared to the first model (both models predict the same target variable `calculated_player_performance`) is higher, which I expected, since we're using only 4 input variables for this model instead of 100+. However, as our leaderboard indicates, all these models are able to properly **infer** a player's performance, even if the RMSE is a bit more elevated. + ![3rd leaderboard](./images/3rd_leaderboard.PNG) + > **Note**: the RMSE in this third experiment, compared to the first model (both models predict the same target variable `calculated_player_performance`) is higher, which I expected, since we're using only 4 input variables for this model instead of 100+. However, as our leaderboard indicates, all these models are able to properly **infer** a player's performance, even if the RMSE is a bit more elevated. -![3rd leaderboard](./images/3rd_importances.PNG) + ![3rd leaderboard](./images/3rd_importances.PNG) -Just as an interesting observation, our model's importance is mostly based around `f1` and `f2`, being `f3` about 5-8 times less important than the other two. + Just as an interesting observation, our model's importance is mostly based around `f1` and `f2`, being `f3` about 5-8 times less important than the other two. ## Task 7: Downloading Models -If you want to use these models in your computer, while you play League, you will need to *zip* all generated models into a file and download it to your computer. +If you want to use these models in your computer, while you play League, you will need to _zip_ all generated models into a file and download it to your computer. -In a terminal, you can run the following command to bundle all directories into one file: +1. In a terminal, you can run the following command to bundle all directories into one file: -```bash - -zip -r all_models.zip /home/datascience/leagueoflegends-optimizer/notebooks/live_model_1/ /home/datascience/leagueoflegends-optimizer/notebooks/player_performance_models /home/datascience/leagueoflegends-optimizer/notebooks/winner_models/ - -``` + ```bash + + zip -r all_models.zip /home/datascience/leagueoflegends-optimizer/notebooks/live_model_1/ /home/datascience/leagueoflegends-optimizer/notebooks/player_performance_models /home/datascience/leagueoflegends-optimizer/notebooks/winner_models/ + + ``` -Then go to the file explorer and download the selected file: +2. Go to the file explorer and download the selected file: -![download all models](./images/download_models.PNG) + ![download all models](./images/download_models.PNG) Now, we have successfully completed the Data Science process! We have: @@ -251,4 +250,4 @@ You may now [proceed to the next lab](#next). * **Author** - Nacho Martinez, Data Science Advocate @ DevRel * **Contributors** - Victor Martin, Product Strategy Director -* **Last Updated By/Date** - July 24th, 2023 +* **Last Updated By/Date** - July 26th, 2023 diff --git a/hols/dataextraction/optimizer/optimizer.md b/hols/dataextraction/optimizer/optimizer.md index 20b607e..d150430 100644 --- a/hols/dataextraction/optimizer/optimizer.md +++ b/hols/dataextraction/optimizer/optimizer.md @@ -40,13 +40,13 @@ There are some things to consider. In League of Legends, and since there are sev The more you repeat this process, the more data your dataset will have. If you want to use my dataset, [check out this Kaggle dataset](https://www.kaggle.com/datasets/jasperan/league-of-legends-optimizer-dataset?select=sqlite_report_performance.csv) or refer to the Infrastructure lab to download it (step 6) if you haven't already. -0. Before executing anything, we need to create a local sqlite3 database by running: +1. Before executing anything, we need to create a local sqlite3 database by running: ```bash λ python src/init_db.py ``` -1. To extract player data, we can run: +2. To extract player data, we can run: ```bash λ python src/cloudshell_league.py @@ -57,7 +57,7 @@ The more you repeat this process, the more data your dataset will have. If you w This execution option will iteratively look for League of Legends leaderboards in every region in the world, and insert these players' information into our database. If the user has already been inserted, it will prevent re-insertion. -2. To extract previously played matches' IDs from our pool of players in the database, we can do this: +3. To extract previously played matches' IDs from our pool of players in the database, we can do this: ```bash λ python src/cloudshell_league.py --mode="match_list" @@ -71,28 +71,30 @@ The more you repeat this process, the more data your dataset will have. If you w ## Task 2: Process Player Performance -```bash -python src/cloudshell_process_performance.py +1. Let's process each player's performance: -# it will then start extracting individual player matches' info and processing their performance. -``` + ```bash + python src/cloudshell_process_performance.py + + # it will then start extracting individual player matches' info and processing their performance. + ``` ![process player performance result](images/output_player_performance.gif) ## Task 3: Creating Final Dataset -Now that we have loads of players' performances calculated, we just have to pass this to a `csv` format. +1. Now that we have loads of players' performances calculated, we just have to pass this to a `csv` format. -```bash -python src/read_data.py + ```bash + python src/read_data.py -# this script will generate 3 csv files: + # this script will generate 3 csv files: -# - performance_report.csv, with the processed data ready for ML -# - player_report.csv, with various player information (Masters+) -# - match_report.csv, with every player's extracted matches. -``` + # - performance_report.csv, with the processed data ready for ML + # - player_report.csv, with various player information (Masters+) + # - match_report.csv, with every player's extracted matches. + ``` From `performance_report.csv`, we'll be able to create our Machine Learning pipeline in the next chapter. @@ -102,4 +104,4 @@ You may now [proceed to the next lab](#next). * **Author** - Nacho Martinez, Data Science Advocate @ DevRel * **Contributors** - Victor Martin, Product Strategy Director -* **Last Updated By/Date** - July 24th, 2023 +* **Last Updated By/Date** - July 26th, 2023 diff --git a/hols/dataextraction/the_problem/problem.md b/hols/dataextraction/the_problem/problem.md index 89cbf93..9c95e3d 100644 --- a/hols/dataextraction/the_problem/problem.md +++ b/hols/dataextraction/the_problem/problem.md @@ -108,59 +108,60 @@ When problems like these arise, we need to work around these incosistencies and ## Task 3: Calculating Player's Performance -Now that we have a harmonized dataset, we're ready to calculate a player's performance. But how do we begin? For that, I like to use an AutoML tool called [mljar-supervised](https://github.com/mljar/mljar-supervised), that allows me to easily perform some automatic analysis for the dataset to predict the `win` variable (already provided by the API and present in our dataset). I can launch an experiment like this: +1. Now that we have a harmonized dataset, we're ready to calculate a player's performance. But how do we begin? For that, I like to use an AutoML tool called [mljar-supervised](https://github.com/mljar/mljar-supervised), that allows me to easily perform some automatic analysis for the dataset to predict the `win` variable (already provided by the API and present in our dataset). I can launch an experiment like this: -![mljar output](./images/mljar-output.PNG) -> **Note**: Check out more information about the parameters I've used [here.](https://supervised.mljar.com/) + ![mljar output](./images/mljar-output.PNG) -This generated a lot of visualizations for me, that gave me an idea of what's necessary to accurately predict the `win` (and `calculated_player_performance`) variable: + > **Note**: Check out more information about the parameters I've used [here.](https://supervised.mljar.com/) -* For example, in my generated FastAI Neural Network Model (one of the models with the highest accuracy), I got to see the most important variables: + This generated a lot of visualizations for me, that gave me an idea of what's necessary to accurately predict the `win` (and `calculated_player_performance`) variable: -![mljar output](./images/nn_importance.png) + * For example, in my generated FastAI Neural Network Model (one of the models with the highest accuracy), I got to see the most important variables: -* It's also important, if taking decisions to see whether this model's performance is good or not. In our case, + ![mljar output](./images/nn_importance.png) -![learning curves from nn_model](./images/learning_curves.png) + * It's also important, if taking decisions to see whether this model's performance is good or not. In our case, -We can see that the loss of our ML model is low enough for our model to have taken the correct approach to predict the target variable. We can confirm that the model is telling us the most important variables by checking other models' predictions as well. + ![learning curves from nn_model](./images/learning_curves.png) -As we can see, the model is able to deduce whether we're going to win or not by just looking at four or five weighted variables. By comparing these stats to what we already have in the Live Client API, we'll determine which variables we can use from that data structure to arrive at the conclusion. + We can see that the loss of our ML model is low enough for our model to have taken the correct approach to predict the target variable. We can confirm that the model is telling us the most important variables by checking other models' predictions as well. -Considering that we're working with time-dependent data, from the variables mentioned above, we can extract the same statistics (deaths, kills)... per minute. This will introduce the time dimension into our dataset: + As we can see, the model is able to deduce whether we're going to win or not by just looking at four or five weighted variables. By comparing these stats to what we already have in the Live Client API, we'll determine which variables we can use from that data structure to arrive at the conclusion. -* deaths/min -* champLevel/min -* assists/min -* kills/min -* duration (which is inferred into the above 4 variables already by adding it in the denominator as a factor of the variables). + Considering that we're working with time-dependent data, from the variables mentioned above, we can extract the same statistics (deaths, kills)... per minute. This will introduce the time dimension into our dataset: -From these variables, and for each one of our matches, our Data Extraction pipeline is robust enough so that, any time you download a new match using our repository, all these additional variables will be calculated for us. More specifically, if you look at the dataset, you will see some variables called `f1...f5` which represent: + * deaths/min + * champLevel/min + * assists/min + * kills/min + * duration (which is inferred into the above 4 variables already by adding it in the denominator as a factor of the variables). -* f1: `deaths_per_min` (deaths/min), -* f2: `k_a_per_min` (kills+assists/min), -* f3: `level_per_min` (xp/min), -* f4: `total_damage_per_min` (**NOT** present in Live Client API yet -> not used), -* f5: `gold_per_min` (**NOT** present in Live Client API yet -> not used), + From these variables, and for each one of our matches, our Data Extraction pipeline is robust enough so that, any time you download a new match using our repository, all these additional variables will be calculated for us. More specifically, if you look at the dataset, you will see some variables called `f1...f5` which represent: -According to [this Medium post](https://maddcog.medium.com/measure-league-of-legends-performance-with-this-game-grade-778c2fe832cb), the optimal game grade / player performance is calculated with this formula: + * f1: `deaths_per_min` (deaths/min), + * f2: `k_a_per_min` (kills+assists/min), + * f3: `level_per_min` (xp/min), + * f4: `total_damage_per_min` (**NOT** present in Live Client API yet -> not used), + * f5: `gold_per_min` (**NOT** present in Live Client API yet -> not used), -```bash - -Game Grade = 0.336 — (1.437 x Deaths per min) + (0.000117 x gold per min) + (0.443 x K_A per min) + (0.264 x Level per min) + (0.000013 x TD per min) - -``` + According to [this Medium post](https://maddcog.medium.com/measure-league-of-legends-performance-with-this-game-grade-778c2fe832cb), the optimal game grade / player performance is calculated with this formula: -> **Note**: a game grade closer to 1 means the player had a ‘winning’ performance, while a grade closer to 0 equated to a ‘losing performance’. + ```bash + + Game Grade = 0.336 — (1.437 x Deaths per min) + (0.000117 x gold per min) + (0.443 x K_A per min) + (0.264 x Level per min) + (0.000013 x TD per min) + + ``` -This can also be updated with our models, by taking the standardized coefficients for each one of these variables' importances, and create our formula. + > **Note**: a game grade closer to 1 means the player had a ‘winning’ performance, while a grade closer to 0 equated to a ‘losing performance’. -Adding creep score per minute didn't offer them any improvement to the model so I chose to ignore it as well. However, only using Diamond matches in their training dataset increased the accuracy by 3% in the end. This is good for us as we've only considered Masters+ games in our training dataset with the hopes of reducing variability in our data. + This can also be updated with our models, by taking the standardized coefficients for each one of these variables' importances, and create our formula. -As a conclusion, there is no noticeable improvement by adding variables (eg. creep score) or making the model more specific. Therefore, the simpler, generic model is what we'll aim for. So, we'll take the abovementioned variables (only three out of the five are present in the Live Client API) and build a new model from it: + Adding creep score per minute didn't offer them any improvement to the model so I chose to ignore it as well. However, only using Diamond matches in their training dataset increased the accuracy by 3% in the end. This is good for us as we've only considered Masters+ games in our training dataset with the hopes of reducing variability in our data. -* Input variables: deaths/min, kills+assists/min, xp/min. -* Output variables: model 1 will predict the `win` variable and model 2 will predict `calculated_player_performance` for any given player. + As a conclusion, there is no noticeable improvement by adding variables (eg. creep score) or making the model more specific. Therefore, the simpler, generic model is what we'll aim for. So, we'll take the abovementioned variables (only three out of the five are present in the Live Client API) and build a new model from it: + + * Input variables: deaths/min, kills+assists/min, xp/min. + * Output variables: model 1 will predict the `win` variable and model 2 will predict `calculated_player_performance` for any given player. ## Task 4: Conclusion @@ -170,7 +171,7 @@ Now that we have things clear: * Inputs / outputs of each model * Expected RMSE, accuracy for each one of the models -And the fact that we have some additional model explainability thanks to `mljar-supervised`, **NOW** we're ready to begin building our models / a pipeline in OCI Data Science. +And the fact that we have some additional model explainability thanks to `mljar-supervised`, **NOW** we're ready to begin building our models / a pipeline in OCI Data Science. In order to build these models, we will also use AutoML, but a different tool. The tool you choose, in the end, must be parametrizable enough so that, if I'm unhappy with what's provided by default (like default hyperparameters) I still have enough control over the implementation of the AutoML library to be able to modify them to my convenience. diff --git a/hols/workshops/dataextraction/manifest.json b/hols/workshops/dataextraction/manifest.json index 068d176..53fddef 100644 --- a/hols/workshops/dataextraction/manifest.json +++ b/hols/workshops/dataextraction/manifest.json @@ -38,13 +38,13 @@ "type": "dbcs" }, { - "title": "Lab 4: Create a Model", + "title": "Lab 5: Create a Model", "description": "Creat Model", "filename": "../../dataextraction/creatingmodel/creatingmodel.md", "type": "dbcs" }, { - "title": "Lab 5: Client", + "title": "Lab 6: Client", "description": "Client API.", "filename": "../../dataextraction/client/client.md", "type": "dbcs"