Skip to content

Commit

Permalink
uploading before the cluster dies - not expected to work
Browse files Browse the repository at this point in the history
  • Loading branch information
nstrug committed Jun 27, 2024
1 parent 04d1c37 commit 6eac31c
Show file tree
Hide file tree
Showing 6 changed files with 1,521 additions and 240 deletions.
156 changes: 76 additions & 80 deletions 1-prep_and_gather_data.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,7 @@
"outputs": [],
"source": [
"!pip install s3fs\n",
"# Install more modules that you need here\n",
"!pip install seaborn"
"# Install more modules that you need here\n"
]
},
{
Expand All @@ -63,9 +62,7 @@
"outputs": [],
"source": [
"import pandas\n",
"# Import more modules and classes that you need here - REMEMBER TO RERUN THE CELL AFTER MODIFYING!\n",
"import os\n",
"import seaborn"
"# Import more modules and classes that you need here - REMEMBER TO RERUN THE CELL AFTER MODIFYING!\n"
]
},
{
Expand Down Expand Up @@ -117,9 +114,7 @@
"source": [
"AWS_ACCESS_KEY_ID = os.environ['AWS_ACCESS_KEY_ID']\n",
"AWS_SECRET_ACCESS_KEY = os.environ['AWS_SECRET_ACCESS_KEY']\n",
"# Add variable assignments for AWS_S3_ENDPOINT and AWS_S3_Bucket below.\n",
"AWS_S3_ENDPOINT = os.environ['AWS_S3_ENDPOINT']\n",
"AWS_S3_BUCKET = os.environ['AWS_S3_BUCKET']\n"
"# Add variable assignments for AWS_S3_ENDPOINT and AWS_S3_Bucket below.\n"
]
},
{
Expand Down Expand Up @@ -150,7 +145,6 @@
"id": "6e94b018-a5c8-4814-8619-0eece4e5d246",
"metadata": {},
"source": [
"## Exploratory data analysis <a class=\"anchor\" id=\"third-bullet\"></a>\n",
"Have a look in the Minio UI and you will see that you have two datafiles in your bucket, called winequality-red.csv and winequality-white.csv. Let's set up some code to pull these from the storage into memory so that we can start some statistical exploration and visualisation. We will use the Pandas module to do this."
]
},
Expand Down Expand Up @@ -226,15 +220,14 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"id": "7dac8273-016a-4771-9026-0b8b77f44bf7",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# add a command in this cell to inspect our red wine data\n",
"red_wine.head(5)"
"# add a command in this cell to inspect our red wine data"
]
},
{
Expand Down Expand Up @@ -285,17 +278,23 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"id": "fa3e4777-76c2-4687-a8da-6e10c18fe536",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Write your code here\n",
"# data = insert your method call here\n",
"data = transformdata(red_wine,white_wine)\n",
"data.head(5)"
"# data = insert your method call here"
]
},
{
"cell_type": "markdown",
"id": "0cd7ec53-4f24-4582-a45c-bb7a1c49039d",
"metadata": {},
"source": [
"## Exploratory data analysis <a class=\"anchor\" id=\"third-bullet\"></a>"
]
},
{
Expand Down Expand Up @@ -330,7 +329,25 @@
},
"outputs": [],
"source": [
"seaborn.displot(data.quality, kde=False)"
"seaborn.displot(data=data[\"quality\"])"
]
},
{
"cell_type": "markdown",
"id": "fd6f7611-560c-4179-bca6-2274e0ca4ae7",
"metadata": {},
"source": [
"We can also compare multiple features in a single graph:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d6dfc88e-04a1-4a1e-b0c2-50be55a2f442",
"metadata": {},
"outputs": [],
"source": [
"seaborn.displot(data=data[[\"residual_sugar\",\"fixed_acidity\"]])"
]
},
{
Expand All @@ -350,37 +367,26 @@
},
"outputs": [],
"source": [
"def settarget(data):\n",
" high_quality = (data.quality >= 7).astype(int)\n",
" data.quality = high_quality\n",
" return data\n",
"data[\"high_quality\"] = (data.quality >= 7) # modify to return an int\n",
"data.tail(5)\n",
"\n",
"data = settarget(data)\n",
"data.tail(5)"
"# Add code below to plot the new boolean quality feature on a histogram"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "661daf61-7b01-4dcd-9f26-30a0dfa3c51a",
"metadata": {
"tags": []
},
"outputs": [],
"cell_type": "markdown",
"id": "ecf9d485-0983-4c9b-bf03-afcd6d7e91d3",
"metadata": {},
"source": [
"import seaborn as sns\n",
"sns.displot(data.quality, kde=False)"
"The kind of models that we will be using don't handle booleans, so modify your code above so that high_quality is an integer rather than a boolean."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e4fec16b-9490-46a1-9f98-fe917796f219",
"cell_type": "markdown",
"id": "d618acce-b1c9-4b8a-990d-060ff16093fe",
"metadata": {},
"outputs": [],
"source": [
"## median, upper and lower quartile, IQR\n",
"## histogram for distribution"
"Let's see if there is a quantitative difference between high and low quality wines:"
]
},
{
Expand All @@ -392,41 +398,20 @@
},
"outputs": [],
"source": [
"dims = (3, 4)\n",
"\n",
"f, axes = plt.subplots(dims[0], dims[1], figsize=(25, 15))\n",
"axis_i, axis_j = 0, 0\n",
"for col in data.columns:\n",
" if col == 'is_red' or col == 'quality':\n",
" continue # Box plots cannot be used on indicator variables\n",
" sns.boxplot(x=data['quality'], y=data[col], ax=axes[axis_i, axis_j])\n",
" axis_j += 1\n",
" if axis_j == dims[1]:\n",
" axis_i += 1\n",
" axis_j = 0"
" if col in [\"is_red\", \"quality\", \"high_quality\"]:\n",
" continue # Box plots cannot be used on indicator variables\n",
" seaborn.boxplot(x=data['high_quality'], y=data[col])\n",
" matplotlib.pyplot.show()\n",
" "
]
},
{
"cell_type": "markdown",
"id": "206b8310-cd76-4012-ba6f-f2f621cd3fde",
"metadata": {},
"source": [
"Check missing value"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dee4b83a-a069-4ed8-a9c5-935e45539cd3",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"## scenarios for missing data - decision for the missing data\n",
"## if alcohol is not an indicator, delete that record\n",
"\n",
"## what are we going to do with the outliers? are they real outliers?"
"Finally, let's check if we have any missing values."
]
},
{
Expand All @@ -447,7 +432,7 @@
"metadata": {},
"source": [
"## Prepare dataset for training model <a class=\"anchor\" id=\"forth-bullet\"></a>\n",
"Split the input data into 3 sets:\n",
"We are going to split the input data into 3 sets:\n",
"\n",
"- Train (60% of the dataset used to train the model)\n",
"- Validation (20% of the dataset used to tune the hyperparameters)\n",
Expand All @@ -464,14 +449,14 @@
"outputs": [],
"source": [
"def get_trainingdata(data):\n",
" X = data.drop([\"quality\"], axis=1)\n",
" y = data.quality\n",
" X = data.drop([\"high_quality\"], axis=1)\n",
" y = data.high_quality\n",
"\n",
" # Split out the training data\n",
" X_train, X_rem, y_train, y_rem = train_test_split(X, y, train_size=0.6, random_state=123)\n",
" X_train, X_rem, y_train, y_rem = sklearn.model_selection.train_test_split(X, y, train_size=0.6, random_state=123)\n",
"\n",
" # Split the remaining data equally into validation and test\n",
" X_val, X_test, y_val, y_test = train_test_split(X_rem, y_rem, test_size=0.5, random_state=123)\n",
" X_val, X_test, y_val, y_test = sklearn.model_selection.train_test_split(X_rem, y_rem, test_size=0.5, random_state=123)\n",
" return (X_train,X_val,X_test,y_train,y_val,y_test)"
]
},
Expand All @@ -489,22 +474,33 @@
},
{
"cell_type": "markdown",
"id": "cfe73322-99ac-4e6b-8ed6-c475d418e108",
"metadata": {
"tags": []
},
"id": "13189921-64a0-482f-ae36-d7538218facd",
"metadata": {},
"source": [
"## Build a baseline model (random forest classifier) <a class=\"anchor\" id=\"fifth-bullet\"></a>\n",
"Build a simple classifier using scikit-learn. Use MLflow to keep track of the model accuracy. You can read about Classification - ROC and AUC here if you want \n",
"https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc"
"Use the info methods to explore the training, testing and validation datasets."
]
},
{
"cell_type": "markdown",
"id": "4ea89452-6ec8-43cc-923c-de443450ff35",
"cell_type": "code",
"execution_count": null,
"id": "76e9bff9-c73a-4803-a387-fbbe17b12930",
"metadata": {},
"outputs": [],
"source": [
"# Use this cell to explotre the training, validation, and test datasets"
]
},
{
"cell_type": "markdown",
"id": "cfe73322-99ac-4e6b-8ed6-c475d418e108",
"metadata": {
"tags": []
},
"source": [
"Enable MLflow autologging"
"## Build a baseline model <a class=\"anchor\" id=\"fifth-bullet\"></a>\n",
"Let's use a random forest classifier as a baseline model for our wine quality predictor. This isn't necessarily the fastest model, but is easy to understand, and fast to train, so it's good to use as a baseline. You can learn more about the random forest algorithm here: https://en.wikipedia.org/wiki/Random_forest\n",
"\n",
"We are going to use MLFlow to determine our model's accuracy. This generates two metrics, ROC and AUC, which will help us determine the accuracy of the model, read more about ROC and AUC here: https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc"
]
},
{
Expand Down
Loading

0 comments on commit 6eac31c

Please sign in to comment.