uploading before the cluster dies - not expected to work

rht-labs · Jun 27, 2024 · 6eac31c · 6eac31c
1 parent 04d1c37
commit 6eac31c
Show file tree

Hide file tree

Showing 6 changed files with 1,521 additions and 240 deletions.
diff --git a/1-prep_and_gather_data.ipynb b/1-prep_and_gather_data.ipynb
@@ -49,8 +49,7 @@
    "outputs": [],
    "source": [
     "!pip install s3fs\n",
-    "# Install more modules that you need here\n",
-    "!pip install seaborn"
+    "# Install more modules that you need here\n"
    ]
   },
   {
@@ -63,9 +62,7 @@
    "outputs": [],
    "source": [
     "import pandas\n",
-    "# Import more modules and classes that you need here - REMEMBER TO RERUN THE CELL AFTER MODIFYING!\n",
-    "import os\n",
-    "import seaborn"
+    "# Import more modules and classes that you need here - REMEMBER TO RERUN THE CELL AFTER MODIFYING!\n"
    ]
   },
   {
@@ -117,9 +114,7 @@
    "source": [
     "AWS_ACCESS_KEY_ID = os.environ['AWS_ACCESS_KEY_ID']\n",
     "AWS_SECRET_ACCESS_KEY = os.environ['AWS_SECRET_ACCESS_KEY']\n",
-    "# Add variable assignments for AWS_S3_ENDPOINT and AWS_S3_Bucket below.\n",
-    "AWS_S3_ENDPOINT = os.environ['AWS_S3_ENDPOINT']\n",
-    "AWS_S3_BUCKET = os.environ['AWS_S3_BUCKET']\n"
+    "# Add variable assignments for AWS_S3_ENDPOINT and AWS_S3_Bucket below.\n"
    ]
   },
   {
@@ -150,7 +145,6 @@
    "id": "6e94b018-a5c8-4814-8619-0eece4e5d246",
    "metadata": {},
    "source": [
-    "## Exploratory data analysis <a class=\"anchor\" id=\"third-bullet\"></a>\n",
     "Have a look in the Minio UI and you will see that you have two datafiles in your bucket, called winequality-red.csv and winequality-white.csv. Let's set up some code to pull these from the storage into memory so that we can start some statistical exploration and visualisation. We will use the Pandas module to do this."
    ]
   },
@@ -226,15 +220,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "id": "7dac8273-016a-4771-9026-0b8b77f44bf7",
    "metadata": {
     "tags": []
    },
    "outputs": [],
    "source": [
-    "# add a command in this cell to inspect our red wine data\n",
-    "red_wine.head(5)"
+    "# add a command in this cell to inspect our red wine data"
    ]
   },
   {
@@ -285,17 +278,23 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 2,
    "id": "fa3e4777-76c2-4687-a8da-6e10c18fe536",
    "metadata": {
     "tags": []
    },
    "outputs": [],
    "source": [
     "# Write your code here\n",
-    "# data = insert your method call here\n",
-    "data = transformdata(red_wine,white_wine)\n",
-    "data.head(5)"
+    "# data = insert your method call here"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0cd7ec53-4f24-4582-a45c-bb7a1c49039d",
+   "metadata": {},
+   "source": [
+    "## Exploratory data analysis <a class=\"anchor\" id=\"third-bullet\"></a>"
    ]
   },
   {
@@ -330,7 +329,25 @@
    },
    "outputs": [],
    "source": [
-    "seaborn.displot(data.quality, kde=False)"
+    "seaborn.displot(data=data[\"quality\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fd6f7611-560c-4179-bca6-2274e0ca4ae7",
+   "metadata": {},
+   "source": [
+    "We can also compare multiple features in a single graph:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d6dfc88e-04a1-4a1e-b0c2-50be55a2f442",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "seaborn.displot(data=data[[\"residual_sugar\",\"fixed_acidity\"]])"
    ]
   },
   {
@@ -350,37 +367,26 @@
    },
    "outputs": [],
    "source": [
-    "def settarget(data):\n",
-    "    high_quality = (data.quality >= 7).astype(int)\n",
-    "    data.quality = high_quality\n",
-    "    return data\n",
+    "data[\"high_quality\"] = (data.quality >= 7) # modify to return an int\n",
+    "data.tail(5)\n",
     "\n",
-    "data = settarget(data)\n",
-    "data.tail(5)"
+    "# Add code below to plot the new boolean quality feature on a histogram"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "661daf61-7b01-4dcd-9f26-30a0dfa3c51a",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
+   "cell_type": "markdown",
+   "id": "ecf9d485-0983-4c9b-bf03-afcd6d7e91d3",
+   "metadata": {},
    "source": [
-    "import seaborn as sns\n",
-    "sns.displot(data.quality, kde=False)"
+    "The kind of models that we will be using don't handle booleans, so modify your code above so that high_quality is an integer rather than a boolean."
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "e4fec16b-9490-46a1-9f98-fe917796f219",
+   "cell_type": "markdown",
+   "id": "d618acce-b1c9-4b8a-990d-060ff16093fe",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "## median, upper and lower quartile, IQR\n",
-    "## histogram for distribution"
+    "Let's see if there is a quantitative difference between high and low quality wines:"
    ]
   },
   {
@@ -392,41 +398,20 @@
    },
    "outputs": [],
    "source": [
-    "dims = (3, 4)\n",
-    "\n",
-    "f, axes = plt.subplots(dims[0], dims[1], figsize=(25, 15))\n",
-    "axis_i, axis_j = 0, 0\n",
     "for col in data.columns:\n",
-    "  if col == 'is_red' or col == 'quality':\n",
-    "    continue # Box plots cannot be used on indicator variables\n",
-    "  sns.boxplot(x=data['quality'], y=data[col], ax=axes[axis_i, axis_j])\n",
-    "  axis_j += 1\n",
-    "  if axis_j == dims[1]:\n",
-    "    axis_i += 1\n",
-    "    axis_j = 0"
+    "    if col in [\"is_red\", \"quality\", \"high_quality\"]:\n",
+    "        continue # Box plots cannot be used on indicator variables\n",
+    "    seaborn.boxplot(x=data['high_quality'], y=data[col])\n",
+    "    matplotlib.pyplot.show()\n",
+    "  "
    ]
   },
   {
    "cell_type": "markdown",
    "id": "206b8310-cd76-4012-ba6f-f2f621cd3fde",
    "metadata": {},
    "source": [
-    "Check missing value"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "dee4b83a-a069-4ed8-a9c5-935e45539cd3",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "## scenarios for missing data - decision for the missing data\n",
-    "## if alcohol is not an indicator, delete that record\n",
-    "\n",
-    "## what are we going to do with the outliers? are they real outliers?"
+    "Finally, let's check if we have any missing values."
    ]
   },
   {
@@ -447,7 +432,7 @@
    "metadata": {},
    "source": [
     "## Prepare dataset for training model <a class=\"anchor\" id=\"forth-bullet\"></a>\n",
-    "Split the input data into 3 sets:\n",
+    "We are going to split the input data into 3 sets:\n",
     "\n",
     "- Train (60% of the dataset used to train the model)\n",
     "- Validation (20% of the dataset used to tune the hyperparameters)\n",
@@ -464,14 +449,14 @@
    "outputs": [],
    "source": [
     "def get_trainingdata(data):\n",
-    "    X = data.drop([\"quality\"], axis=1)\n",
-    "    y = data.quality\n",
+    "    X = data.drop([\"high_quality\"], axis=1)\n",
+    "    y = data.high_quality\n",
     "\n",
     "    # Split out the training data\n",
-    "    X_train, X_rem, y_train, y_rem = train_test_split(X, y, train_size=0.6, random_state=123)\n",
+    "    X_train, X_rem, y_train, y_rem = sklearn.model_selection.train_test_split(X, y, train_size=0.6, random_state=123)\n",
     "\n",
     "    # Split the remaining data equally into validation and test\n",
-    "    X_val, X_test, y_val, y_test = train_test_split(X_rem, y_rem, test_size=0.5, random_state=123)\n",
+    "    X_val, X_test, y_val, y_test = sklearn.model_selection.train_test_split(X_rem, y_rem, test_size=0.5, random_state=123)\n",
     "    return (X_train,X_val,X_test,y_train,y_val,y_test)"
    ]
   },
@@ -489,22 +474,33 @@
   },
   {
    "cell_type": "markdown",
-   "id": "cfe73322-99ac-4e6b-8ed6-c475d418e108",
-   "metadata": {
-    "tags": []
-   },
+   "id": "13189921-64a0-482f-ae36-d7538218facd",
+   "metadata": {},
    "source": [
-    "## Build a baseline model (random forest classifier) <a class=\"anchor\" id=\"fifth-bullet\"></a>\n",
-    "Build a simple classifier using scikit-learn. Use MLflow to keep track of the model accuracy. You can read about Classification - ROC and AUC here if you want \n",
-    "https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc"
+    "Use the info methods to explore the training, testing and validation datasets."
    ]
   },
   {
-   "cell_type": "markdown",
-   "id": "4ea89452-6ec8-43cc-923c-de443450ff35",
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "76e9bff9-c73a-4803-a387-fbbe17b12930",
    "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Use this cell to explotre the training, validation, and test datasets"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cfe73322-99ac-4e6b-8ed6-c475d418e108",
+   "metadata": {
+    "tags": []
+   },
    "source": [
-    "Enable MLflow autologging"
+    "## Build a baseline model <a class=\"anchor\" id=\"fifth-bullet\"></a>\n",
+    "Let's use a random forest classifier as a baseline model for our wine quality predictor. This isn't necessarily the fastest model, but is easy to understand, and fast to train, so it's good to use as a baseline. You can learn more about the random forest algorithm here: https://en.wikipedia.org/wiki/Random_forest\n",
+    "\n",
+    "We are going to use MLFlow to determine our model's accuracy. This generates two metrics, ROC and AUC, which will help us determine the accuracy of the model, read more about ROC and AUC here: https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc"
    ]
   },
   {