diff --git a/1-prep_and_gather_data.ipynb b/1-prep_and_gather_data.ipynb
new file mode 100644
index 0000000..8a017a6
--- /dev/null
+++ b/1-prep_and_gather_data.ipynb
@@ -0,0 +1,822 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "73878580-3986-4fc3-81b6-f598e0783b4f",
+ "metadata": {},
+ "source": [
+ "# Demo project - Wine quality prediction"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "53536e98-387e-4457-9d03-5a1d435837a6",
+ "metadata": {},
+ "source": [
+ "## Contents:\n",
+ "* [Import packages](#first-bullet)\n",
+ "* [Load Data](#second-bullet)\n",
+ "* [Exploratory data analysis](#third-bullet)\n",
+ "* [Prepare dataset for training model](#forth-bullet)\n",
+ "* [Build a baseline model](#fifth-bullet)\n",
+ "* [Experiment with a new model](#sixth-bullet)\n",
+ "* [Predict](#seventh-bullet)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ff902f5d-7bc1-49c9-8994-d4c3e1cff7e8",
+ "metadata": {},
+ "source": [
+ "## Import packages "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d73dd05b-5b58-4587-a1df-02a9c96d1996",
+ "metadata": {},
+ "source": [
+ "We will need to install and import packages as we develop our notebook. We've created a couple of starter cells for you but you will need to add more as you work through the notebook."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e3744f2b-4d07-4f10-84f0-1daa05cb8573",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "!pip install s3fs\n",
+ "# Install more modules that you need here\n",
+ "!pip install seaborn"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fa3a0921-36d0-4b26-8378-cd5adab57fb8",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "import pandas\n",
+ "# Import more modules and classes that you need here - REMEMBER TO RERUN THE CELL AFTER MODIFYING!\n",
+ "import os\n",
+ "import seaborn"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c5ac7837-ae31-4b73-9fd0-e8d8dddbafb6",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Load Data "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7e0d066b-1b2f-47ac-a9a3-f01d62f22969",
+ "metadata": {},
+ "source": [
+ "You have access to a Minio-based S3 storage where your datasets are available, and where you will eventually push models. This storage is defined using a 'Data Connection' in your Data Science Project. You can access this data connection using environment variables. Run the following shell block to determine the environment variable names:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "780f041d-f811-4927-b039-13eed4dd151e",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "!env | grep AWS"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "edbe5e0f-9470-4228-88be-43fb588ede9d",
+ "metadata": {},
+ "source": [
+ "You will need to assign these to Python variables to be able to use them in code blocks. We've started you off with some code below, but you'll also need variables set for the endpoint and bucket. Remember to import modules as needed in the import block at the top of the Notebook and re-run the cell below again after importing any modules."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f2aa0f94-5b37-4873-9467-44e6703af9c8",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "AWS_ACCESS_KEY_ID = os.environ['AWS_ACCESS_KEY_ID']\n",
+ "AWS_SECRET_ACCESS_KEY = os.environ['AWS_SECRET_ACCESS_KEY']\n",
+ "# Add variable assignments for AWS_S3_ENDPOINT and AWS_S3_Bucket below.\n",
+ "AWS_S3_ENDPOINT = os.environ['AWS_S3_ENDPOINT']\n",
+ "AWS_S3_BUCKET = os.environ['AWS_S3_BUCKET']\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d0fdc153-bb03-4e42-8df2-01de01b4f47d",
+ "metadata": {},
+ "source": [
+ "Check that your variables have been correctly set:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2dfbd5d1-2f81-4432-be43-fc6101d97ea3",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "print(\"AWS_ACCESS_KEY is \" + AWS_ACCESS_KEY_ID)\n",
+ "print(\"AWS_SECRET_ACCESS_KEY is \" + AWS_SECRET_ACCESS_KEY)\n",
+ "print(\"AWS_S3_ENDPOINT is \" + AWS_S3_ENDPOINT)\n",
+ "print(\"AWS_S3_BUCKET is \" + AWS_S3_BUCKET)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6e94b018-a5c8-4814-8619-0eece4e5d246",
+ "metadata": {},
+ "source": [
+ "## Exploratory data analysis \n",
+ "Have a look in the Minio UI and you will see that you have two datafiles in your bucket, called winequality-red.csv and winequality-white.csv. Let's set up some code to pull these from the storage into memory so that we can start some statistical exploration and visualisation. We will use the Pandas module to do this."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0da01133-3435-48e2-b67e-184d25a3270f",
+ "metadata": {},
+ "source": [
+ "First we define a function, read_data() which uses a pandas method to read CSVs directly from S3 storage. Note how we pass our S3 credentials to the method. Because this is a function definition it won't actually do anything when you execute the code cell. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3131849f-2412-4468-93b2-b21e73f91aa7",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "def read_data(datasrc):\n",
+ " data = pandas.read_csv(\n",
+ " \"s3://\" + AWS_S3_BUCKET + \"/\" + datasrc, sep=';',\n",
+ " storage_options={\n",
+ " \"key\": AWS_ACCESS_KEY_ID,\n",
+ " \"secret\": AWS_SECRET_ACCESS_KEY,\n",
+ " \"endpoint_url\": AWS_S3_ENDPOINT,\n",
+ " }\n",
+ " )\n",
+ " return data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d96f7e23-18f1-4749-987c-1b0a1ee1f84e",
+ "metadata": {},
+ "source": [
+ "Let's try reading our two CSV files into memory now."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b0ab7588-f839-4325-a4cb-22486498884d",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "white_wine = read_data('winequality-white.csv')\n",
+ "red_wine = read_data('winequality-red.csv')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "46ce3744-1bcd-4571-9613-d1b9a0ba024a",
+ "metadata": {},
+ "source": [
+ "Let's have a look at our white wine data:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "00ee462f-ff94-44ba-b68a-53d2211e8939",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "white_wine.head(5)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7dac8273-016a-4771-9026-0b8b77f44bf7",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# add a command in this cell to inspect our red wine data\n",
+ "red_wine.head(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9197769d-d945-491a-abdc-3939a8b3a3a5",
+ "metadata": {},
+ "source": [
+ "We would like to run analysis on both our red and white wine datasets simultaneously, so it makes sense to merge these two datasets into one. But how will we then tell the difference between our red and white wines? Well, we simply add another feature - the feature is calles 'is_red' and is essentially a Boolean indicating whether the wine is red, or 'not red' i.e. white.\n",
+ "\n",
+ "(Extra credit for anyone who can point out what might be problematic about this approach!)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bbeed573-c132-4e8b-b1bb-e261812faa71",
+ "metadata": {},
+ "source": [
+ "Let's define a function to definte our additional feature in each dataset, and then concatenate them."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6af49215-b766-4ca4-a140-6e21f8d7ecb7",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "def transformdata(red_wine,white_wine):\n",
+ " red_wine['is_red'] = 1\n",
+ " white_wine['is_red'] = 0\n",
+ " data = pandas.concat([red_wine, white_wine], axis=0)\n",
+ " # lets get rid of those annoying spaces in our column names\n",
+ " data.rename(columns=lambda x: x.replace(' ', '_'), inplace=True)\n",
+ " return data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "27a5d750-0307-43ca-abf5-1562fc8191f2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "Now, invoke your method and show the first 5 lines of the merged data below:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fa3e4777-76c2-4687-a8da-6e10c18fe536",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Write your code here\n",
+ "# data = insert your method call here\n",
+ "data = transformdata(red_wine,white_wine)\n",
+ "data.head(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bd056fae-7cf1-4e54-99d9-0f19c014534b",
+ "metadata": {},
+ "source": [
+ "SLIDES TO DISCUSS EXPLORATORY STATS COVERING:\n",
+ "- visualisation basics\n",
+ "- mean, median, deviation, skew\n",
+ "- quartiles and outliers\n",
+ "- correlation\n",
+ "\n",
+ "\n",
+ "Let's visualise our data using the seaborn module. Remember you may to install and/or import the module in the block at the beginning of the notebook (and re-run). Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.You can read more about it here: https://seaborn.pydata.org/"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4effa559-432c-4130-b41d-2e3f11ec0465",
+ "metadata": {},
+ "source": [
+ "This will plot a histogram of the quality of our wine. Experiement with plotting different features of the dataset, e.g. alcohol content, pH etc."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b4e1d074-f5fb-41b1-afc7-1150550db01f",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "seaborn.displot(data.quality, kde=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "494dfe1e-15a4-4a41-adda-f4700bac5835",
+ "metadata": {},
+ "source": [
+ "Let's simplify things by converting quality from a 1-10 scale to a simple boolean. A wine is either of high quality, or it is not. This quality feature will be our output feature when we run an inference model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5c06e500-1871-4c71-816a-b898cb8633d1",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "def settarget(data):\n",
+ " high_quality = (data.quality >= 7).astype(int)\n",
+ " data.quality = high_quality\n",
+ " return data\n",
+ "\n",
+ "data = settarget(data)\n",
+ "data.tail(5)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "661daf61-7b01-4dcd-9f26-30a0dfa3c51a",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "import seaborn as sns\n",
+ "sns.displot(data.quality, kde=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e4fec16b-9490-46a1-9f98-fe917796f219",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "## median, upper and lower quartile, IQR\n",
+ "## histogram for distribution"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2f886882-1fd1-4633-af8b-602dc90d369a",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "dims = (3, 4)\n",
+ "\n",
+ "f, axes = plt.subplots(dims[0], dims[1], figsize=(25, 15))\n",
+ "axis_i, axis_j = 0, 0\n",
+ "for col in data.columns:\n",
+ " if col == 'is_red' or col == 'quality':\n",
+ " continue # Box plots cannot be used on indicator variables\n",
+ " sns.boxplot(x=data['quality'], y=data[col], ax=axes[axis_i, axis_j])\n",
+ " axis_j += 1\n",
+ " if axis_j == dims[1]:\n",
+ " axis_i += 1\n",
+ " axis_j = 0"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "206b8310-cd76-4012-ba6f-f2f621cd3fde",
+ "metadata": {},
+ "source": [
+ "Check missing value"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "dee4b83a-a069-4ed8-a9c5-935e45539cd3",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "## scenarios for missing data - decision for the missing data\n",
+ "## if alcohol is not an indicator, delete that record\n",
+ "\n",
+ "## what are we going to do with the outliers? are they real outliers?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "31857042-ba20-42b5-9739-6a017f6b1951",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "data.isna().any()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6b89fa04-ee75-4782-ac57-b2f5f2f45239",
+ "metadata": {},
+ "source": [
+ "## Prepare dataset for training model \n",
+ "Split the input data into 3 sets:\n",
+ "\n",
+ "- Train (60% of the dataset used to train the model)\n",
+ "- Validation (20% of the dataset used to tune the hyperparameters)\n",
+ "- Test (20% of the dataset used to report the true performance of the model on an unseen dataset)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "757d77d2-d54e-4a2b-8b8d-1eba01ba61ce",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "def get_trainingdata(data):\n",
+ " X = data.drop([\"quality\"], axis=1)\n",
+ " y = data.quality\n",
+ "\n",
+ " # Split out the training data\n",
+ " X_train, X_rem, y_train, y_rem = train_test_split(X, y, train_size=0.6, random_state=123)\n",
+ "\n",
+ " # Split the remaining data equally into validation and test\n",
+ " X_val, X_test, y_val, y_test = train_test_split(X_rem, y_rem, test_size=0.5, random_state=123)\n",
+ " return (X_train,X_val,X_test,y_train,y_val,y_test)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "61d6a7b2-974c-4c6c-a31f-3df48760c805",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "(X_train,X_val,X_test,y_train,y_val,y_test) = get_trainingdata(data)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cfe73322-99ac-4e6b-8ed6-c475d418e108",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Build a baseline model (random forest classifier) \n",
+ "Build a simple classifier using scikit-learn. Use MLflow to keep track of the model accuracy. You can read about Classification - ROC and AUC here if you want \n",
+ "https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4ea89452-6ec8-43cc-923c-de443450ff35",
+ "metadata": {},
+ "source": [
+ "Enable MLflow autologging"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4b53a514-9eab-491d-9237-80448e4cea20",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "experiment_name = \"WineQuality\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6628b383-14a4-493e-9add-e6076adf6ad5",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# check if experiment name already exists\n",
+ "mlflow.set_tracking_uri(\"http://mlflow:5500\")\n",
+ "mlflow.set_experiment(experiment_name)\n",
+ "\n",
+ "# enable autologging\n",
+ "mlflow.sklearn.autolog(log_input_examples=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4f9e8cf7-0458-49d1-9dfa-a3061bbc00d4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def log_featureimportance(model):\n",
+ " tmpdir = tempfile.mkdtemp()\n",
+ " filepath = os.path.join(tmpdir, 'feature_importance.json')\n",
+ " feature_importances = pd.DataFrame(model.feature_importances_, index=X_train.columns.tolist(), columns=['importance'])\n",
+ " feature_importances.sort_values('importance', ascending=False).to_json(filepath)\n",
+ " mlflow.log_artifact(filepath)\n",
+ " return"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "01bdefde-fc6a-4e7d-8c68-649528667fd4",
+ "metadata": {},
+ "source": [
+ "Train random forest"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9f2832df-baa7-4125-8ef3-681517dbe8b0",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "class SklearnModelWrapper(mlflow.pyfunc.PythonModel):\n",
+ " def __init__(self, model):\n",
+ " self.model = model\n",
+ "\n",
+ " def predict(self, context, model_input):\n",
+ " return self.model.predict_proba(model_input)[:,1]\n",
+ "\n",
+ "def train_randomforest(X_train,y_train,X_test,y_test):\n",
+ "\n",
+ " with mlflow.start_run(run_name='untuned_random_forest'):\n",
+ " n_estimators = 10\n",
+ " model = RandomForestClassifier(n_estimators=n_estimators, random_state=np.random.RandomState(123))\n",
+ " model.fit(X_train, y_train)\n",
+ "\n",
+ " predictions_test = model.predict_proba(X_test)[:,1]\n",
+ " auc_score = roc_auc_score(y_test, predictions_test)\n",
+ " mlflow.log_param('n_estimators', n_estimators) #specify the interested parameter/metric\n",
+ " mlflow.log_metric('auc', auc_score)\n",
+ " wrappedModel = SklearnModelWrapper(model)\n",
+ "\n",
+ " signature = infer_signature(X_train, wrappedModel.predict(None, X_train))\n",
+ "\n",
+ " conda_env = _mlflow_conda_env(\n",
+ " additional_conda_deps=None,\n",
+ " additional_pip_deps=[\"cloudpickle=={}\".format(cloudpickle.__version__), \"scikit-learn=={}\".format(sklearn.__version__)],\n",
+ " additional_conda_channels=None,\n",
+ " )\n",
+ " mlflow.pyfunc.log_model(\"random_forest_model\", python_model=wrappedModel, conda_env=conda_env, signature=signature)\n",
+ " log_featureimportance(model)\n",
+ " return model"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d8957943-d0dc-4803-aab5-6c6ceb8ba34d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "model = train_randomforest(X_train,y_train,X_test,y_test)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6c8cdc21-a1df-429b-8be3-3e4d1ea5ff2d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Sanity-check: This should match the AUC logged by MLflow\n",
+ "print(f'AUC: {roc_auc_score(y_test, model.predict_proba(X_test)[:,1])}')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6eba2f19-7755-412e-bc67-2e586817582c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Sanity-check: This should match the feature importance logged by MLflow\n",
+ "feature_importances = pd.DataFrame(model.feature_importances_, index=X_train.columns.tolist(), columns=['importance'])\n",
+ "feature_importances.sort_values('importance', ascending=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b4206c08-5021-43dd-a11d-e7e7f687ebef",
+ "metadata": {},
+ "source": [
+ "## Experiment with a new model (xgboost) \n",
+ "Use the xgboost library to train a more accurate model. Run hyperparameter tuning to train multiple models. As before, the code tracks the performance of each parameter configuration with MLflow."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "84cf68d6-86a4-4daa-927b-b4dc53f8cc9d",
+ "metadata": {
+ "scrolled": true,
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "search_space = {\n",
+ " 'max_depth': scope.int(hp.quniform('max_depth', 50, 100, 10)),\n",
+ " 'learning_rate': hp.loguniform('learning_rate', -3, 0),\n",
+ " 'reg_alpha': hp.loguniform('reg_alpha', -5, -1),\n",
+ " 'reg_lambda': hp.loguniform('reg_lambda', -6, -1),\n",
+ " 'min_child_weight': hp.loguniform('min_child_weight', -1, 3),\n",
+ " 'objective': 'binary:logistic',\n",
+ " 'seed': 123, # Set a seed for deterministic training\n",
+ "}\n",
+ "\n",
+ "def train_model(params):\n",
+ "\n",
+ " mlflow.xgboost.autolog()\n",
+ " with mlflow.start_run(nested=True):\n",
+ " train = xgb.DMatrix(data=X_train, label=y_train)\n",
+ " validation = xgb.DMatrix(data=X_val, label=y_val)\n",
+ "\n",
+ " booster = xgb.train(params=params, dtrain=train, num_boost_round=100,\\\n",
+ " evals=[(validation, \"validation\")], early_stopping_rounds=50)\n",
+ " validation_predictions = booster.predict(validation)\n",
+ " auc_score = roc_auc_score(y_val, validation_predictions)\n",
+ " mlflow.log_metric('auc', auc_score) #specify the interested parameter/metric\n",
+ "\n",
+ " signature = infer_signature(X_train, booster.predict(train))\n",
+ " mlflow.xgboost.log_model(booster, \"model\", signature=signature)\n",
+ "\n",
+ " return {'status': STATUS_OK, 'loss': -1*auc_score, 'booster': booster.attributes()}\n",
+ "\n",
+ "with mlflow.start_run(run_name='xgboost_models'):\n",
+ " best_params = fmin(\n",
+ " fn=train_model,\n",
+ " space=search_space,\n",
+ " algo=tpe.suggest,\n",
+ " max_evals=10,\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "12fb65af-5545-4563-a70a-ad62d11e6615",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "best_run = mlflow.search_runs(order_by=['metrics.auc DESC']).iloc[0]\n",
+ "best_run_id = best_run[\"run_id\"]\n",
+ "print(f'AUC of Best Run: {best_run[\"metrics.auc\"]}')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2df971e0-1748-489c-b36d-dd481c211a0d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "best_run_id"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1f57f0bc-ec52-463c-9241-0bf897465b1b",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Predict "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "db412034-daab-4942-b3ae-0c5410a3e5a5",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# model = mlflow.pyfunc.load_model(f\"models:/TestModelD/production\")\n",
+ "model = mlflow.pyfunc.load_model(\"runs:/\" + best_run_id + \"/model\")\n",
+ "\n",
+ "test_predictions = model.predict(X_test)\n",
+ "print(f'AUC: {roc_auc_score(y_test, test_predictions)}')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b5593581-c563-4c1a-aa80-f10d77f53209",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import classification_report\n",
+ "\n",
+ "class_labels = ['white wine', 'red wine']\n",
+ "test_predictions = np.where(test_predictions>0.5, 1, 0)\n",
+ "print(classification_report(y_test, test_predictions, target_names=class_labels))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b0ebe7ec-a7e2-49bc-88bb-2b5ea79f3807",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "cm = confusion_matrix(y_test, test_predictions)\n",
+ "disp = ConfusionMatrixDisplay(confusion_matrix=cm)\n",
+ "\n",
+ "disp.plot()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d9f40cc8-9a79-4521-9783-6a3aa0b0127a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# register the best model\n",
+ "new_model_version = mlflow.register_model(f\"runs:/{best_run_id}/model\", \"WineQuality\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d6d78d2a-563d-41de-a2a9-1dbddbdcb3cc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# # Promote the new model version to Production\n",
+ "# client.transition_model_version_stage(\n",
+ "# name=\"TestModelD\",\n",
+ "# version=new_model_version.version,\n",
+ "# stage=\"Production\"\n",
+ "# )"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d112aed5-fe65-47b3-aff3-899a96010bdf",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# # clean up models\n",
+ "# from mlflow.tracking import MlflowClient\n",
+ "# client = MlflowClient()\n",
+ "# client.delete_registered_model(name=\"winequality\")"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3.9",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.16"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/1-prep_and_gather_data_instructor.ipynb b/1-prep_and_gather_data_instructor.ipynb
new file mode 100644
index 0000000..f72a9e5
--- /dev/null
+++ b/1-prep_and_gather_data_instructor.ipynb
@@ -0,0 +1,771 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "73878580-3986-4fc3-81b6-f598e0783b4f",
+ "metadata": {},
+ "source": [
+ "# Demo project - Wine quality prediction"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "53536e98-387e-4457-9d03-5a1d435837a6",
+ "metadata": {},
+ "source": [
+ "## Contents:\n",
+ "* [Import packages](#first-bullet)\n",
+ "* [Load Data](#second-bullet)\n",
+ "* [Exploratory data analysis](#third-bullet)\n",
+ "* [Prepare dataset for training model](#forth-bullet)\n",
+ "* [Build a baseline model](#fifth-bullet)\n",
+ "* [Experiment with a new model](#sixth-bullet)\n",
+ "* [Predict](#seventh-bullet)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ff902f5d-7bc1-49c9-8994-d4c3e1cff7e8",
+ "metadata": {},
+ "source": [
+ "## Import packages "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d73dd05b-5b58-4587-a1df-02a9c96d1996",
+ "metadata": {},
+ "source": [
+ "In addition to the s3fs package, we will need to import hyperopt, cloudpickle, mlflow, and xgboost
\n",
+ "Modify the following cell to make this happen
"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e3744f2b-4d07-4f10-84f0-1daa05cb8573",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "!pip install s3fs"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fa3a0921-36d0-4b26-8378-cd5adab57fb8",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "import boto3\n",
+ "import pandas\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c5ac7837-ae31-4b73-9fd0-e8d8dddbafb6",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Load Data "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f0fb050a-b46a-4de4-8fdd-0fe8e3b61d3e",
+ "metadata": {},
+ "source": [
+ "Assumption: the bucket is already created and \"winequality-red.csv\" & \"winequality-white.csv\" are uploaded into the bucket
\n",
+ "Read data from object store
\n",
+ "Connect to object store and instantiate a client object using boto3 session:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f2aa0f94-5b37-4873-9467-44e6703af9c8",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "AWS_ACCESS_KEY_ID = os.environ['AWS_ACCESS_KEY_ID']\n",
+ "AWS_SECRET_ACCESS_KEY = os.environ['AWS_SECRET_ACCESS_KEY']\n",
+ "AWS_S3_ENDPOINT = os.environ['AWS_S3_ENDPOINT']\n",
+ "AWS_S3_BUCKET = os.environ['AWS_S3_BUCKET']"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "47ae3876-523d-47f1-b895-e26bfb65977f",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "s3conn = boto3.Session(aws_access_key_id=S3ACCESS_KEY,\n",
+ " aws_secret_access_key=S3SECRET_KEY)\n",
+ "s3_client = s3conn.client('s3',endpoint_url = S3ENDPOINT, verify=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4d5bb280-9473-46a0-b575-dfa64f9872cf",
+ "metadata": {
+ "scrolled": true,
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "s3_client.list_objects(Bucket='data')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7cd65395-3135-45a0-8d7a-227ff2a05840",
+ "metadata": {},
+ "source": [
+ "Using the s3_client, retrieve data from objective store:
"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9207c9a6-bcc4-42fc-abf1-ff84c431dbbd",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "objectname = \"winequality-red.csv\"\n",
+ "file_addr = \"data/winequality-red.csv\"\n",
+ "response = s3_client.download_file(bucket_name, objectname, file_addr)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e472d265-819b-4881-9fd4-b2fda3933179",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "objectname = \"winequality-white.csv\"\n",
+ "file_addr = \"data/winequality-white.csv\"\n",
+ "response = s3_client.download_file(bucket_name, objectname, file_addr)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6e94b018-a5c8-4814-8619-0eece4e5d246",
+ "metadata": {},
+ "source": [
+ "## Exploratory data analysis "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3131849f-2412-4468-93b2-b21e73f91aa7",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "import s3fs\n",
+ "def read_data(datasrc):\n",
+ " data = pd.read_csv(\n",
+ " \"s3://\" + AWS_S3_BUCKET + \"/\" + datasrc, sep=';',\n",
+ " storage_options={\n",
+ " \"key\": AWS_ACCESS_KEY_ID,\n",
+ " \"secret\": AWS_SECRET_ACCESS_KEY,\n",
+ " \"endpoint_url\": AWS_S3_ENDPOINT,\n",
+ " }\n",
+ " )\n",
+ " return data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b0ab7588-f839-4325-a4cb-22486498884d",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "## after concatanation, setting the value of is_red for which is a red wine, which is a white wine - feature"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6af49215-b766-4ca4-a140-6e21f8d7ecb7",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "def transformdata(red_wine,white_wine):\n",
+ " red_wine['is_red'] = 1\n",
+ " white_wine['is_red'] = 0\n",
+ " data = pd.concat([red_wine, white_wine], axis=0)\n",
+ " data.rename(columns=lambda x: x.replace(' ', '_'), inplace=True)\n",
+ " return data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fa3e4777-76c2-4687-a8da-6e10c18fe536",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "white_wine = read_data('winequality-white.csv')\n",
+ "red_wine = read_data('winequality-red.csv')\n",
+ "data = transformdata(red_wine, white_wine)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a8620947-ab99-43cb-9a56-128f9ff03fc5",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "data.head(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bd056fae-7cf1-4e54-99d9-0f19c014534b",
+ "metadata": {},
+ "source": [
+ "Visualize data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "cad8a920-4a11-4a5f-b60e-6eaf9d882bb2",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "!pip install seaborn"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b4e1d074-f5fb-41b1-afc7-1150550db01f",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "import seaborn as sns\n",
+ "sns.displot(data.quality, kde=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ee8a4a2e-3f5d-4699-a278-21b2c7c8a17f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "## set type boolean"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5c06e500-1871-4c71-816a-b898cb8633d1",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "def settarget(data):\n",
+ " high_quality = (data.quality >= 7).astype(int)\n",
+ " data.quality = high_quality\n",
+ " return data\n",
+ "\n",
+ "data = settarget(data)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "661daf61-7b01-4dcd-9f26-30a0dfa3c51a",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "import seaborn as sns\n",
+ "sns.displot(data.quality, kde=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e4fec16b-9490-46a1-9f98-fe917796f219",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "## median, upper and lower quartile, IQR\n",
+ "## histogram for distribution"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2f886882-1fd1-4633-af8b-602dc90d369a",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "dims = (3, 4)\n",
+ "\n",
+ "f, axes = plt.subplots(dims[0], dims[1], figsize=(25, 15))\n",
+ "axis_i, axis_j = 0, 0\n",
+ "for col in data.columns:\n",
+ " if col == 'is_red' or col == 'quality':\n",
+ " continue # Box plots cannot be used on indicator variables\n",
+ " sns.boxplot(x=data['quality'], y=data[col], ax=axes[axis_i, axis_j])\n",
+ " axis_j += 1\n",
+ " if axis_j == dims[1]:\n",
+ " axis_i += 1\n",
+ " axis_j = 0"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "206b8310-cd76-4012-ba6f-f2f621cd3fde",
+ "metadata": {},
+ "source": [
+ "Check missing value"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "dee4b83a-a069-4ed8-a9c5-935e45539cd3",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "## scenarios for missing data - decision for the missing data\n",
+ "## if alcohol is not an indicator, delete that record\n",
+ "\n",
+ "## what are we going to do with the outliers? are they real outliers?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "31857042-ba20-42b5-9739-6a017f6b1951",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "data.isna().any()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6b89fa04-ee75-4782-ac57-b2f5f2f45239",
+ "metadata": {},
+ "source": [
+ "## Prepare dataset for training model \n",
+ "Split the input data into 3 sets:\n",
+ "\n",
+ "- Train (60% of the dataset used to train the model)\n",
+ "- Validation (20% of the dataset used to tune the hyperparameters)\n",
+ "- Test (20% of the dataset used to report the true performance of the model on an unseen dataset)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "757d77d2-d54e-4a2b-8b8d-1eba01ba61ce",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "def get_trainingdata(data):\n",
+ " X = data.drop([\"quality\"], axis=1)\n",
+ " y = data.quality\n",
+ "\n",
+ " # Split out the training data\n",
+ " X_train, X_rem, y_train, y_rem = train_test_split(X, y, train_size=0.6, random_state=123)\n",
+ "\n",
+ " # Split the remaining data equally into validation and test\n",
+ " X_val, X_test, y_val, y_test = train_test_split(X_rem, y_rem, test_size=0.5, random_state=123)\n",
+ " return (X_train,X_val,X_test,y_train,y_val,y_test)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "61d6a7b2-974c-4c6c-a31f-3df48760c805",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "(X_train,X_val,X_test,y_train,y_val,y_test) = get_trainingdata(data)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cfe73322-99ac-4e6b-8ed6-c475d418e108",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Build a baseline model (random forest classifier) \n",
+ "Build a simple classifier using scikit-learn. Use MLflow to keep track of the model accuracy. You can read about Classification - ROC and AUC here if you want \n",
+ "https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4ea89452-6ec8-43cc-923c-de443450ff35",
+ "metadata": {},
+ "source": [
+ "Enable MLflow autologging"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4b53a514-9eab-491d-9237-80448e4cea20",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "experiment_name = \"WineQuality\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6628b383-14a4-493e-9add-e6076adf6ad5",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# check if experiment name already exists\n",
+ "mlflow.set_tracking_uri(\"http://mlflow:5500\")\n",
+ "mlflow.set_experiment(experiment_name)\n",
+ "\n",
+ "# enable autologging\n",
+ "mlflow.sklearn.autolog(log_input_examples=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4f9e8cf7-0458-49d1-9dfa-a3061bbc00d4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def log_featureimportance(model):\n",
+ " tmpdir = tempfile.mkdtemp()\n",
+ " filepath = os.path.join(tmpdir, 'feature_importance.json')\n",
+ " feature_importances = pd.DataFrame(model.feature_importances_, index=X_train.columns.tolist(), columns=['importance'])\n",
+ " feature_importances.sort_values('importance', ascending=False).to_json(filepath)\n",
+ " mlflow.log_artifact(filepath)\n",
+ " return"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "01bdefde-fc6a-4e7d-8c68-649528667fd4",
+ "metadata": {},
+ "source": [
+ "Train random forest"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9f2832df-baa7-4125-8ef3-681517dbe8b0",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "class SklearnModelWrapper(mlflow.pyfunc.PythonModel):\n",
+ " def __init__(self, model):\n",
+ " self.model = model\n",
+ "\n",
+ " def predict(self, context, model_input):\n",
+ " return self.model.predict_proba(model_input)[:,1]\n",
+ "\n",
+ "def train_randomforest(X_train,y_train,X_test,y_test):\n",
+ "\n",
+ " with mlflow.start_run(run_name='untuned_random_forest'):\n",
+ " n_estimators = 10\n",
+ " model = RandomForestClassifier(n_estimators=n_estimators, random_state=np.random.RandomState(123))\n",
+ " model.fit(X_train, y_train)\n",
+ "\n",
+ " predictions_test = model.predict_proba(X_test)[:,1]\n",
+ " auc_score = roc_auc_score(y_test, predictions_test)\n",
+ " mlflow.log_param('n_estimators', n_estimators) #specify the interested parameter/metric\n",
+ " mlflow.log_metric('auc', auc_score)\n",
+ " wrappedModel = SklearnModelWrapper(model)\n",
+ "\n",
+ " signature = infer_signature(X_train, wrappedModel.predict(None, X_train))\n",
+ "\n",
+ " conda_env = _mlflow_conda_env(\n",
+ " additional_conda_deps=None,\n",
+ " additional_pip_deps=[\"cloudpickle=={}\".format(cloudpickle.__version__), \"scikit-learn=={}\".format(sklearn.__version__)],\n",
+ " additional_conda_channels=None,\n",
+ " )\n",
+ " mlflow.pyfunc.log_model(\"random_forest_model\", python_model=wrappedModel, conda_env=conda_env, signature=signature)\n",
+ " log_featureimportance(model)\n",
+ " return model"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d8957943-d0dc-4803-aab5-6c6ceb8ba34d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "model = train_randomforest(X_train,y_train,X_test,y_test)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6c8cdc21-a1df-429b-8be3-3e4d1ea5ff2d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Sanity-check: This should match the AUC logged by MLflow\n",
+ "print(f'AUC: {roc_auc_score(y_test, model.predict_proba(X_test)[:,1])}')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6eba2f19-7755-412e-bc67-2e586817582c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Sanity-check: This should match the feature importance logged by MLflow\n",
+ "feature_importances = pd.DataFrame(model.feature_importances_, index=X_train.columns.tolist(), columns=['importance'])\n",
+ "feature_importances.sort_values('importance', ascending=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b4206c08-5021-43dd-a11d-e7e7f687ebef",
+ "metadata": {},
+ "source": [
+ "## Experiment with a new model (xgboost) \n",
+ "Use the xgboost library to train a more accurate model. Run hyperparameter tuning to train multiple models. As before, the code tracks the performance of each parameter configuration with MLflow."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "84cf68d6-86a4-4daa-927b-b4dc53f8cc9d",
+ "metadata": {
+ "scrolled": true,
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "search_space = {\n",
+ " 'max_depth': scope.int(hp.quniform('max_depth', 50, 100, 10)),\n",
+ " 'learning_rate': hp.loguniform('learning_rate', -3, 0),\n",
+ " 'reg_alpha': hp.loguniform('reg_alpha', -5, -1),\n",
+ " 'reg_lambda': hp.loguniform('reg_lambda', -6, -1),\n",
+ " 'min_child_weight': hp.loguniform('min_child_weight', -1, 3),\n",
+ " 'objective': 'binary:logistic',\n",
+ " 'seed': 123, # Set a seed for deterministic training\n",
+ "}\n",
+ "\n",
+ "def train_model(params):\n",
+ "\n",
+ " mlflow.xgboost.autolog()\n",
+ " with mlflow.start_run(nested=True):\n",
+ " train = xgb.DMatrix(data=X_train, label=y_train)\n",
+ " validation = xgb.DMatrix(data=X_val, label=y_val)\n",
+ "\n",
+ " booster = xgb.train(params=params, dtrain=train, num_boost_round=100,\\\n",
+ " evals=[(validation, \"validation\")], early_stopping_rounds=50)\n",
+ " validation_predictions = booster.predict(validation)\n",
+ " auc_score = roc_auc_score(y_val, validation_predictions)\n",
+ " mlflow.log_metric('auc', auc_score) #specify the interested parameter/metric\n",
+ "\n",
+ " signature = infer_signature(X_train, booster.predict(train))\n",
+ " mlflow.xgboost.log_model(booster, \"model\", signature=signature)\n",
+ "\n",
+ " return {'status': STATUS_OK, 'loss': -1*auc_score, 'booster': booster.attributes()}\n",
+ "\n",
+ "with mlflow.start_run(run_name='xgboost_models'):\n",
+ " best_params = fmin(\n",
+ " fn=train_model,\n",
+ " space=search_space,\n",
+ " algo=tpe.suggest,\n",
+ " max_evals=10,\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "12fb65af-5545-4563-a70a-ad62d11e6615",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "best_run = mlflow.search_runs(order_by=['metrics.auc DESC']).iloc[0]\n",
+ "best_run_id = best_run[\"run_id\"]\n",
+ "print(f'AUC of Best Run: {best_run[\"metrics.auc\"]}')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2df971e0-1748-489c-b36d-dd481c211a0d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "best_run_id"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1f57f0bc-ec52-463c-9241-0bf897465b1b",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Predict "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "db412034-daab-4942-b3ae-0c5410a3e5a5",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# model = mlflow.pyfunc.load_model(f\"models:/TestModelD/production\")\n",
+ "model = mlflow.pyfunc.load_model(\"runs:/\" + best_run_id + \"/model\")\n",
+ "\n",
+ "test_predictions = model.predict(X_test)\n",
+ "print(f'AUC: {roc_auc_score(y_test, test_predictions)}')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b5593581-c563-4c1a-aa80-f10d77f53209",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import classification_report\n",
+ "\n",
+ "class_labels = ['white wine', 'red wine']\n",
+ "test_predictions = np.where(test_predictions>0.5, 1, 0)\n",
+ "print(classification_report(y_test, test_predictions, target_names=class_labels))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b0ebe7ec-a7e2-49bc-88bb-2b5ea79f3807",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "cm = confusion_matrix(y_test, test_predictions)\n",
+ "disp = ConfusionMatrixDisplay(confusion_matrix=cm)\n",
+ "\n",
+ "disp.plot()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d9f40cc8-9a79-4521-9783-6a3aa0b0127a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# register the best model\n",
+ "new_model_version = mlflow.register_model(f\"runs:/{best_run_id}/model\", \"WineQuality\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d6d78d2a-563d-41de-a2a9-1dbddbdcb3cc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# # Promote the new model version to Production\n",
+ "# client.transition_model_version_stage(\n",
+ "# name=\"TestModelD\",\n",
+ "# version=new_model_version.version,\n",
+ "# stage=\"Production\"\n",
+ "# )"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d112aed5-fe65-47b3-aff3-899a96010bdf",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# # clean up models\n",
+ "# from mlflow.tracking import MlflowClient\n",
+ "# client = MlflowClient()\n",
+ "# client.delete_registered_model(name=\"winequality\")"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3.9",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.16"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/demo_project.ipynb b/demo_project.ipynb
index d6698e2..7eace38 100644
--- a/demo_project.ipynb
+++ b/demo_project.ipynb
@@ -36,209 +36,25 @@
"id": "d73dd05b-5b58-4587-a1df-02a9c96d1996",
"metadata": {},
"source": [
- "Before import packages, install packages as is required by requirements.txt
\n",
+ "Before import packages, install packages as required
\n",
"Any pypi packages can be installed
"
]
},
{
"cell_type": "code",
- "execution_count": 44,
+ "execution_count": null,
"id": "e3744f2b-4d07-4f10-84f0-1daa05cb8573",
"metadata": {
"tags": []
},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Collecting s3fs\n",
- " Downloading s3fs-2024.6.0-py3-none-any.whl (29 kB)\n",
- "Requirement already satisfied: aiohttp!=4.0.0a0,!=4.0.0a1 in /opt/app-root/lib/python3.9/site-packages (from s3fs) (3.9.5)\n",
- "Collecting aiobotocore<3.0.0,>=2.5.4\n",
- " Downloading aiobotocore-2.13.0-py3-none-any.whl (76 kB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m76.6/76.6 kB\u001b[0m \u001b[31m9.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25hCollecting fsspec==2024.6.0.*\n",
- " Downloading fsspec-2024.6.0-py3-none-any.whl (176 kB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m176.9/176.9 kB\u001b[0m \u001b[31m128.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25hRequirement already satisfied: botocore<1.34.107,>=1.34.70 in /opt/app-root/lib/python3.9/site-packages (from aiobotocore<3.0.0,>=2.5.4->s3fs) (1.34.89)\n",
- "Collecting aioitertools<1.0.0,>=0.5.1\n",
- " Downloading aioitertools-0.11.0-py3-none-any.whl (23 kB)\n",
- "Requirement already satisfied: wrapt<2.0.0,>=1.10.10 in /opt/app-root/lib/python3.9/site-packages (from aiobotocore<3.0.0,>=2.5.4->s3fs) (1.16.0)\n",
- "Requirement already satisfied: attrs>=17.3.0 in /opt/app-root/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->s3fs) (23.2.0)\n",
- "Requirement already satisfied: aiosignal>=1.1.2 in /opt/app-root/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->s3fs) (1.3.1)\n",
- "Requirement already satisfied: multidict<7.0,>=4.5 in /opt/app-root/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->s3fs) (6.0.5)\n",
- "Requirement already satisfied: yarl<2.0,>=1.0 in /opt/app-root/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->s3fs) (1.9.4)\n",
- "Requirement already satisfied: async-timeout<5.0,>=4.0 in /opt/app-root/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->s3fs) (4.0.3)\n",
- "Requirement already satisfied: frozenlist>=1.1.1 in /opt/app-root/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->s3fs) (1.4.1)\n",
- "Requirement already satisfied: typing_extensions>=4.0 in /opt/app-root/lib/python3.9/site-packages (from aioitertools<1.0.0,>=0.5.1->aiobotocore<3.0.0,>=2.5.4->s3fs) (4.11.0)\n",
- "Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/app-root/lib/python3.9/site-packages (from botocore<1.34.107,>=1.34.70->aiobotocore<3.0.0,>=2.5.4->s3fs) (1.0.1)\n",
- "Requirement already satisfied: urllib3<1.27,>=1.25.4 in /opt/app-root/lib/python3.9/site-packages (from botocore<1.34.107,>=1.34.70->aiobotocore<3.0.0,>=2.5.4->s3fs) (1.26.18)\n",
- "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /opt/app-root/lib/python3.9/site-packages (from botocore<1.34.107,>=1.34.70->aiobotocore<3.0.0,>=2.5.4->s3fs) (2.9.0.post0)\n",
- "Requirement already satisfied: idna>=2.0 in /opt/app-root/lib/python3.9/site-packages (from yarl<2.0,>=1.0->aiohttp!=4.0.0a0,!=4.0.0a1->s3fs) (3.7)\n",
- "Requirement already satisfied: six>=1.5 in /opt/app-root/lib/python3.9/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.34.107,>=1.34.70->aiobotocore<3.0.0,>=2.5.4->s3fs) (1.16.0)\n",
- "Installing collected packages: fsspec, aioitertools, aiobotocore, s3fs\n",
- " Attempting uninstall: fsspec\n",
- " Found existing installation: fsspec 2024.3.1\n",
- " Uninstalling fsspec-2024.3.1:\n",
- " Successfully uninstalled fsspec-2024.3.1\n",
- "Successfully installed aiobotocore-2.13.0 aioitertools-0.11.0 fsspec-2024.6.0 s3fs-2024.6.0\n",
- "\n",
- "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip available: \u001b[0m\u001b[31;49m22.2.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n",
- "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
- ]
- }
- ],
- "source": [
- "!pip install s3fs"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "id": "37a5e193-27e4-4fc2-b315-72fbaf745bd1",
- "metadata": {
- "tags": []
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Collecting hyperopt\n",
- " Downloading hyperopt-0.2.7-py2.py3-none-any.whl (1.6 MB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.6/1.6 MB\u001b[0m \u001b[31m18.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
- "\u001b[?25hCollecting cloudpickle\n",
- " Downloading cloudpickle-3.0.0-py3-none-any.whl (20 kB)\n",
- "Collecting mlflow\n",
- " Downloading mlflow-2.13.2-py3-none-any.whl (25.0 MB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m25.0/25.0 MB\u001b[0m \u001b[31m100.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
- "\u001b[?25hCollecting xgboost\n",
- " Downloading xgboost-2.0.3-py3-none-manylinux2014_x86_64.whl (297.1 MB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m297.1/297.1 MB\u001b[0m \u001b[31m81.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
- "\u001b[?25hRequirement already satisfied: tqdm in /opt/app-root/lib/python3.9/site-packages (from hyperopt) (4.66.2)\n",
- "Collecting py4j\n",
- " Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m200.5/200.5 kB\u001b[0m \u001b[31m153.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25hRequirement already satisfied: numpy in /opt/app-root/lib/python3.9/site-packages (from hyperopt) (1.26.4)\n",
- "Requirement already satisfied: six in /opt/app-root/lib/python3.9/site-packages (from hyperopt) (1.16.0)\n",
- "Requirement already satisfied: networkx>=2.2 in /opt/app-root/lib/python3.9/site-packages (from hyperopt) (3.2.1)\n",
- "Requirement already satisfied: scipy in /opt/app-root/lib/python3.9/site-packages (from hyperopt) (1.12.0)\n",
- "Collecting future\n",
- " Downloading future-1.0.0-py3-none-any.whl (491 kB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m491.3/491.3 kB\u001b[0m \u001b[31m168.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25hRequirement already satisfied: docker<8,>=4.0.0 in /opt/app-root/lib/python3.9/site-packages (from mlflow) (7.0.0)\n",
- "Collecting querystring-parser<2\n",
- " Downloading querystring_parser-1.2.4-py2.py3-none-any.whl (7.9 kB)\n",
- "Collecting alembic!=1.10.0,<2\n",
- " Downloading alembic-1.13.1-py3-none-any.whl (233 kB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m233.4/233.4 kB\u001b[0m \u001b[31m99.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25hRequirement already satisfied: Jinja2<4,>=2.11 in /opt/app-root/lib/python3.9/site-packages (from mlflow) (3.1.3)\n",
- "Requirement already satisfied: entrypoints<1 in /opt/app-root/lib/python3.9/site-packages (from mlflow) (0.4)\n",
- "Requirement already satisfied: scikit-learn<2 in /opt/app-root/lib/python3.9/site-packages (from mlflow) (1.4.2)\n",
- "Collecting sqlalchemy<3,>=1.4.0\n",
- " Downloading SQLAlchemy-2.0.30-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.1/3.1 MB\u001b[0m \u001b[31m120.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25hRequirement already satisfied: cachetools<6,>=5.0.0 in /opt/app-root/lib/python3.9/site-packages (from mlflow) (5.3.3)\n",
- "Requirement already satisfied: protobuf<5,>=3.12.0 in /opt/app-root/lib/python3.9/site-packages (from mlflow) (4.25.3)\n",
- "Collecting opentelemetry-sdk<3,>=1.0.0\n",
- " Downloading opentelemetry_sdk-1.25.0-py3-none-any.whl (107 kB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m107.0/107.0 kB\u001b[0m \u001b[31m116.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25hRequirement already satisfied: click<9,>=7.0 in /opt/app-root/lib/python3.9/site-packages (from mlflow) (8.1.7)\n",
- "Requirement already satisfied: gitpython<4,>=3.1.9 in /opt/app-root/lib/python3.9/site-packages (from mlflow) (3.1.43)\n",
- "Collecting Flask<4\n",
- " Downloading flask-3.0.3-py3-none-any.whl (101 kB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m101.7/101.7 kB\u001b[0m \u001b[31m180.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25hRequirement already satisfied: pytz<2025 in /opt/app-root/lib/python3.9/site-packages (from mlflow) (2024.1)\n",
- "Collecting graphene<4\n",
- " Downloading graphene-3.3-py2.py3-none-any.whl (128 kB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m128.2/128.2 kB\u001b[0m \u001b[31m140.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25hRequirement already satisfied: pyyaml<7,>=5.1 in /opt/app-root/lib/python3.9/site-packages (from mlflow) (6.0.1)\n",
- "Collecting pyarrow<16,>=4.0.0\n",
- " Downloading pyarrow-15.0.2-cp39-cp39-manylinux_2_28_x86_64.whl (38.3 MB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m38.3/38.3 MB\u001b[0m \u001b[31m113.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
- "\u001b[?25hCollecting sqlparse<1,>=0.4.0\n",
- " Downloading sqlparse-0.5.0-py3-none-any.whl (43 kB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m44.0/44.0 kB\u001b[0m \u001b[31m128.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25hCollecting opentelemetry-api<3,>=1.0.0\n",
- " Downloading opentelemetry_api-1.25.0-py3-none-any.whl (59 kB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m59.9/59.9 kB\u001b[0m \u001b[31m89.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25hCollecting markdown<4,>=3.3\n",
- " Downloading Markdown-3.6-py3-none-any.whl (105 kB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m105.4/105.4 kB\u001b[0m \u001b[31m100.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25hRequirement already satisfied: matplotlib<4 in /opt/app-root/lib/python3.9/site-packages (from mlflow) (3.8.4)\n",
- "Requirement already satisfied: pandas<3 in /opt/app-root/lib/python3.9/site-packages (from mlflow) (2.2.2)\n",
- "Requirement already satisfied: packaging<25 in /opt/app-root/lib/python3.9/site-packages (from mlflow) (24.0)\n",
- "Requirement already satisfied: importlib-metadata!=4.7.0,<8,>=3.7.0 in /opt/app-root/lib/python3.9/site-packages (from mlflow) (7.1.0)\n",
- "Requirement already satisfied: requests<3,>=2.17.3 in /opt/app-root/lib/python3.9/site-packages (from mlflow) (2.31.0)\n",
- "Collecting gunicorn<23\n",
- " Downloading gunicorn-22.0.0-py3-none-any.whl (84 kB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.4/84.4 kB\u001b[0m \u001b[31m152.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25hCollecting Mako\n",
- " Downloading Mako-1.3.5-py3-none-any.whl (78 kB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m78.6/78.6 kB\u001b[0m \u001b[31m103.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25hRequirement already satisfied: typing-extensions>=4 in /opt/app-root/lib/python3.9/site-packages (from alembic!=1.10.0,<2->mlflow) (4.11.0)\n",
- "Requirement already satisfied: urllib3>=1.26.0 in /opt/app-root/lib/python3.9/site-packages (from docker<8,>=4.0.0->mlflow) (1.26.18)\n",
- "Collecting Werkzeug>=3.0.0\n",
- " Downloading werkzeug-3.0.3-py3-none-any.whl (227 kB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m227.3/227.3 kB\u001b[0m \u001b[31m177.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25hCollecting itsdangerous>=2.1.2\n",
- " Downloading itsdangerous-2.2.0-py3-none-any.whl (16 kB)\n",
- "Collecting blinker>=1.6.2\n",
- " Downloading blinker-1.8.2-py3-none-any.whl (9.5 kB)\n",
- "Requirement already satisfied: gitdb<5,>=4.0.1 in /opt/app-root/lib/python3.9/site-packages (from gitpython<4,>=3.1.9->mlflow) (4.0.11)\n",
- "Collecting graphql-core<3.3,>=3.1\n",
- " Downloading graphql_core-3.2.3-py3-none-any.whl (202 kB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m202.9/202.9 kB\u001b[0m \u001b[31m184.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25hCollecting aniso8601<10,>=8\n",
- " Downloading aniso8601-9.0.1-py2.py3-none-any.whl (52 kB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m52.8/52.8 kB\u001b[0m \u001b[31m151.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25hCollecting graphql-relay<3.3,>=3.1\n",
- " Downloading graphql_relay-3.2.0-py3-none-any.whl (16 kB)\n",
- "Requirement already satisfied: zipp>=0.5 in /opt/app-root/lib/python3.9/site-packages (from importlib-metadata!=4.7.0,<8,>=3.7.0->mlflow) (3.18.1)\n",
- "Requirement already satisfied: MarkupSafe>=2.0 in /opt/app-root/lib/python3.9/site-packages (from Jinja2<4,>=2.11->mlflow) (2.1.5)\n",
- "Requirement already satisfied: fonttools>=4.22.0 in /opt/app-root/lib/python3.9/site-packages (from matplotlib<4->mlflow) (4.51.0)\n",
- "Requirement already satisfied: pillow>=8 in /opt/app-root/lib/python3.9/site-packages (from matplotlib<4->mlflow) (10.3.0)\n",
- "Requirement already satisfied: pyparsing>=2.3.1 in /opt/app-root/lib/python3.9/site-packages (from matplotlib<4->mlflow) (3.1.2)\n",
- "Requirement already satisfied: kiwisolver>=1.3.1 in /opt/app-root/lib/python3.9/site-packages (from matplotlib<4->mlflow) (1.4.5)\n",
- "Requirement already satisfied: cycler>=0.10 in /opt/app-root/lib/python3.9/site-packages (from matplotlib<4->mlflow) (0.12.1)\n",
- "Requirement already satisfied: importlib-resources>=3.2.0 in /opt/app-root/lib/python3.9/site-packages (from matplotlib<4->mlflow) (6.4.0)\n",
- "Requirement already satisfied: contourpy>=1.0.1 in /opt/app-root/lib/python3.9/site-packages (from matplotlib<4->mlflow) (1.2.1)\n",
- "Requirement already satisfied: python-dateutil>=2.7 in /opt/app-root/lib/python3.9/site-packages (from matplotlib<4->mlflow) (2.9.0.post0)\n",
- "Requirement already satisfied: deprecated>=1.2.6 in /opt/app-root/lib/python3.9/site-packages (from opentelemetry-api<3,>=1.0.0->mlflow) (1.2.14)\n",
- "Collecting opentelemetry-semantic-conventions==0.46b0\n",
- " Downloading opentelemetry_semantic_conventions-0.46b0-py3-none-any.whl (130 kB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m130.5/130.5 kB\u001b[0m \u001b[31m116.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25hRequirement already satisfied: tzdata>=2022.7 in /opt/app-root/lib/python3.9/site-packages (from pandas<3->mlflow) (2024.1)\n",
- "Requirement already satisfied: idna<4,>=2.5 in /opt/app-root/lib/python3.9/site-packages (from requests<3,>=2.17.3->mlflow) (3.7)\n",
- "Requirement already satisfied: certifi>=2017.4.17 in /opt/app-root/lib/python3.9/site-packages (from requests<3,>=2.17.3->mlflow) (2024.2.2)\n",
- "Requirement already satisfied: charset-normalizer<4,>=2 in /opt/app-root/lib/python3.9/site-packages (from requests<3,>=2.17.3->mlflow) (3.3.2)\n",
- "Requirement already satisfied: joblib>=1.2.0 in /opt/app-root/lib/python3.9/site-packages (from scikit-learn<2->mlflow) (1.4.0)\n",
- "Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/app-root/lib/python3.9/site-packages (from scikit-learn<2->mlflow) (3.4.0)\n",
- "Collecting greenlet!=0.4.17\n",
- " Downloading greenlet-3.0.3-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (614 kB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m614.3/614.3 kB\u001b[0m \u001b[31m103.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25hRequirement already satisfied: wrapt<2,>=1.10 in /opt/app-root/lib/python3.9/site-packages (from deprecated>=1.2.6->opentelemetry-api<3,>=1.0.0->mlflow) (1.16.0)\n",
- "Requirement already satisfied: smmap<6,>=3.0.1 in /opt/app-root/lib/python3.9/site-packages (from gitdb<5,>=4.0.1->gitpython<4,>=3.1.9->mlflow) (5.0.1)\n",
- "Installing collected packages: py4j, aniso8601, Werkzeug, sqlparse, querystring-parser, pyarrow, Mako, itsdangerous, gunicorn, greenlet, graphql-core, future, cloudpickle, blinker, xgboost, sqlalchemy, opentelemetry-api, markdown, hyperopt, graphql-relay, Flask, opentelemetry-semantic-conventions, graphene, alembic, opentelemetry-sdk, mlflow\n",
- " Attempting uninstall: pyarrow\n",
- " Found existing installation: pyarrow 16.0.0\n",
- " Uninstalling pyarrow-16.0.0:\n",
- " Successfully uninstalled pyarrow-16.0.0\n",
- "Successfully installed Flask-3.0.3 Mako-1.3.5 Werkzeug-3.0.3 alembic-1.13.1 aniso8601-9.0.1 blinker-1.8.2 cloudpickle-3.0.0 future-1.0.0 graphene-3.3 graphql-core-3.2.3 graphql-relay-3.2.0 greenlet-3.0.3 gunicorn-22.0.0 hyperopt-0.2.7 itsdangerous-2.2.0 markdown-3.6 mlflow-2.13.2 opentelemetry-api-1.25.0 opentelemetry-sdk-1.25.0 opentelemetry-semantic-conventions-0.46b0 py4j-0.10.9.7 pyarrow-15.0.2 querystring-parser-1.2.4 sqlalchemy-2.0.30 sqlparse-0.5.0 xgboost-2.0.3\n",
- "\n",
- "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip available: \u001b[0m\u001b[31;49m22.2.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n",
- "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
- ]
- }
- ],
+ "outputs": [],
"source": [
- "!pip install hyperopt cloudpickle mlflow xgboost"
+ "!pip install s3fs hyperopt cloudpickle mlflow xgboost"
]
},
{
"cell_type": "code",
- "execution_count": 22,
+ "execution_count": null,
"id": "fa3a0921-36d0-4b26-8378-cd5adab57fb8",
"metadata": {
"tags": []
@@ -280,7 +96,7 @@
},
{
"cell_type": "code",
- "execution_count": 23,
+ "execution_count": null,
"id": "6b6d8a33-66ff-492d-b85e-0e73587d93f8",
"metadata": {
"tags": []
@@ -342,7 +158,7 @@
},
{
"cell_type": "code",
- "execution_count": 54,
+ "execution_count": null,
"id": "f2aa0f94-5b37-4873-9467-44e6703af9c8",
"metadata": {
"tags": []
@@ -357,7 +173,7 @@
},
{
"cell_type": "code",
- "execution_count": 25,
+ "execution_count": null,
"id": "47ae3876-523d-47f1-b895-e26bfb65977f",
"metadata": {
"tags": []
@@ -371,66 +187,13 @@
},
{
"cell_type": "code",
- "execution_count": 27,
+ "execution_count": null,
"id": "4d5bb280-9473-46a0-b575-dfa64f9872cf",
"metadata": {
"scrolled": true,
"tags": []
},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "{'ResponseMetadata': {'RequestId': '17D8412F76BD3277',\n",
- " 'HostId': 'dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8',\n",
- " 'HTTPStatusCode': 200,\n",
- " 'HTTPHeaders': {'accept-ranges': 'bytes',\n",
- " 'content-length': '1265',\n",
- " 'content-security-policy': 'block-all-mixed-content',\n",
- " 'content-type': 'application/xml',\n",
- " 'server': 'MinIO',\n",
- " 'strict-transport-security': 'max-age=31536000; includeSubDomains',\n",
- " 'vary': 'Origin, Accept-Encoding',\n",
- " 'x-amz-id-2': 'dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8',\n",
- " 'x-amz-request-id': '17D8412F76BD3277',\n",
- " 'x-content-type-options': 'nosniff',\n",
- " 'x-xss-protection': '1; mode=block',\n",
- " 'date': 'Wed, 12 Jun 2024 12:22:09 GMT'},\n",
- " 'RetryAttempts': 0},\n",
- " 'IsTruncated': False,\n",
- " 'Marker': '',\n",
- " 'Contents': [{'Key': 'wine_quality.csv',\n",
- " 'LastModified': datetime.datetime(2024, 6, 12, 11, 43, 12, 11000, tzinfo=tzlocal()),\n",
- " 'ETag': '\"17fbffe83c746612cc247b182e9f7278\"',\n",
- " 'Size': 264425,\n",
- " 'StorageClass': 'STANDARD',\n",
- " 'Owner': {'DisplayName': 'minio',\n",
- " 'ID': '02d6176db174dc93cb1b899f7c6078f08654445fe8cf1b6ce98d8855f66bdbf4'}},\n",
- " {'Key': 'winequality-red.csv',\n",
- " 'LastModified': datetime.datetime(2024, 6, 12, 11, 55, 5, 250000, tzinfo=tzlocal()),\n",
- " 'ETag': '\"2daeecee174368f8a33b82c8cccae3a5\"',\n",
- " 'Size': 84199,\n",
- " 'StorageClass': 'STANDARD',\n",
- " 'Owner': {'DisplayName': 'minio',\n",
- " 'ID': '02d6176db174dc93cb1b899f7c6078f08654445fe8cf1b6ce98d8855f66bdbf4'}},\n",
- " {'Key': 'winequality-white.csv',\n",
- " 'LastModified': datetime.datetime(2024, 6, 12, 11, 55, 5, 250000, tzinfo=tzlocal()),\n",
- " 'ETag': '\"5d9ff0f7f716dace19e3ab4578775fd7\"',\n",
- " 'Size': 264426,\n",
- " 'StorageClass': 'STANDARD',\n",
- " 'Owner': {'DisplayName': 'minio',\n",
- " 'ID': '02d6176db174dc93cb1b899f7c6078f08654445fe8cf1b6ce98d8855f66bdbf4'}}],\n",
- " 'Name': 'data',\n",
- " 'Prefix': '',\n",
- " 'MaxKeys': 1000,\n",
- " 'EncodingType': 'url'}"
- ]
- },
- "execution_count": 27,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
+ "outputs": [],
"source": [
"s3_client.list_objects(Bucket='data')"
]
@@ -445,7 +208,7 @@
},
{
"cell_type": "code",
- "execution_count": 28,
+ "execution_count": null,
"id": "9207c9a6-bcc4-42fc-abf1-ff84c431dbbd",
"metadata": {
"tags": []
@@ -459,7 +222,7 @@
},
{
"cell_type": "code",
- "execution_count": 29,
+ "execution_count": null,
"id": "e472d265-819b-4881-9fd4-b2fda3933179",
"metadata": {
"tags": []
@@ -481,7 +244,7 @@
},
{
"cell_type": "code",
- "execution_count": 81,
+ "execution_count": null,
"id": "3131849f-2412-4468-93b2-b21e73f91aa7",
"metadata": {
"tags": []
@@ -503,7 +266,7 @@
},
{
"cell_type": "code",
- "execution_count": 82,
+ "execution_count": null,
"id": "b0ab7588-f839-4325-a4cb-22486498884d",
"metadata": {
"tags": []
@@ -515,7 +278,7 @@
},
{
"cell_type": "code",
- "execution_count": 83,
+ "execution_count": null,
"id": "6af49215-b766-4ca4-a140-6e21f8d7ecb7",
"metadata": {
"tags": []
@@ -532,7 +295,7 @@
},
{
"cell_type": "code",
- "execution_count": 84,
+ "execution_count": null,
"id": "fa3e4777-76c2-4687-a8da-6e10c18fe536",
"metadata": {
"tags": []
@@ -546,161 +309,12 @@
},
{
"cell_type": "code",
- "execution_count": 85,
+ "execution_count": null,
"id": "a8620947-ab99-43cb-9a56-128f9ff03fc5",
"metadata": {
"tags": []
},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " fixed_acidity | \n",
- " volatile_acidity | \n",
- " citric_acid | \n",
- " residual_sugar | \n",
- " chlorides | \n",
- " free_sulfur_dioxide | \n",
- " total_sulfur_dioxide | \n",
- " density | \n",
- " pH | \n",
- " sulphates | \n",
- " alcohol | \n",
- " quality | \n",
- " is_red | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " 0 | \n",
- " 7.4 | \n",
- " 0.70 | \n",
- " 0.00 | \n",
- " 1.9 | \n",
- " 0.076 | \n",
- " 11.0 | \n",
- " 34.0 | \n",
- " 0.9978 | \n",
- " 3.51 | \n",
- " 0.56 | \n",
- " 9.4 | \n",
- " 5 | \n",
- " 1 | \n",
- "
\n",
- " \n",
- " 1 | \n",
- " 7.8 | \n",
- " 0.88 | \n",
- " 0.00 | \n",
- " 2.6 | \n",
- " 0.098 | \n",
- " 25.0 | \n",
- " 67.0 | \n",
- " 0.9968 | \n",
- " 3.20 | \n",
- " 0.68 | \n",
- " 9.8 | \n",
- " 5 | \n",
- " 1 | \n",
- "
\n",
- " \n",
- " 2 | \n",
- " 7.8 | \n",
- " 0.76 | \n",
- " 0.04 | \n",
- " 2.3 | \n",
- " 0.092 | \n",
- " 15.0 | \n",
- " 54.0 | \n",
- " 0.9970 | \n",
- " 3.26 | \n",
- " 0.65 | \n",
- " 9.8 | \n",
- " 5 | \n",
- " 1 | \n",
- "
\n",
- " \n",
- " 3 | \n",
- " 11.2 | \n",
- " 0.28 | \n",
- " 0.56 | \n",
- " 1.9 | \n",
- " 0.075 | \n",
- " 17.0 | \n",
- " 60.0 | \n",
- " 0.9980 | \n",
- " 3.16 | \n",
- " 0.58 | \n",
- " 9.8 | \n",
- " 6 | \n",
- " 1 | \n",
- "
\n",
- " \n",
- " 4 | \n",
- " 7.4 | \n",
- " 0.70 | \n",
- " 0.00 | \n",
- " 1.9 | \n",
- " 0.076 | \n",
- " 11.0 | \n",
- " 34.0 | \n",
- " 0.9978 | \n",
- " 3.51 | \n",
- " 0.56 | \n",
- " 9.4 | \n",
- " 5 | \n",
- " 1 | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " fixed_acidity volatile_acidity citric_acid residual_sugar chlorides \\\n",
- "0 7.4 0.70 0.00 1.9 0.076 \n",
- "1 7.8 0.88 0.00 2.6 0.098 \n",
- "2 7.8 0.76 0.04 2.3 0.092 \n",
- "3 11.2 0.28 0.56 1.9 0.075 \n",
- "4 7.4 0.70 0.00 1.9 0.076 \n",
- "\n",
- " free_sulfur_dioxide total_sulfur_dioxide density pH sulphates \\\n",
- "0 11.0 34.0 0.9978 3.51 0.56 \n",
- "1 25.0 67.0 0.9968 3.20 0.68 \n",
- "2 15.0 54.0 0.9970 3.26 0.65 \n",
- "3 17.0 60.0 0.9980 3.16 0.58 \n",
- "4 11.0 34.0 0.9978 3.51 0.56 \n",
- "\n",
- " alcohol quality is_red \n",
- "0 9.4 5 1 \n",
- "1 9.8 5 1 \n",
- "2 9.8 5 1 \n",
- "3 9.8 6 1 \n",
- "4 9.4 5 1 "
- ]
- },
- "execution_count": 85,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
+ "outputs": [],
"source": [
"data.head(5)"
]
@@ -715,76 +329,24 @@
},
{
"cell_type": "code",
- "execution_count": 69,
+ "execution_count": null,
"id": "cad8a920-4a11-4a5f-b60e-6eaf9d882bb2",
"metadata": {
"tags": []
},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Collecting seaborn\n",
- " Downloading seaborn-0.13.2-py3-none-any.whl (294 kB)\n",
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m294.9/294.9 kB\u001b[0m \u001b[31m9.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25hRequirement already satisfied: pandas>=1.2 in /opt/app-root/lib/python3.9/site-packages (from seaborn) (2.2.2)\n",
- "Requirement already satisfied: numpy!=1.24.0,>=1.20 in /opt/app-root/lib/python3.9/site-packages (from seaborn) (1.26.4)\n",
- "Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /opt/app-root/lib/python3.9/site-packages (from seaborn) (3.8.4)\n",
- "Requirement already satisfied: cycler>=0.10 in /opt/app-root/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)\n",
- "Requirement already satisfied: pillow>=8 in /opt/app-root/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (10.3.0)\n",
- "Requirement already satisfied: packaging>=20.0 in /opt/app-root/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.0)\n",
- "Requirement already satisfied: python-dateutil>=2.7 in /opt/app-root/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0)\n",
- "Requirement already satisfied: fonttools>=4.22.0 in /opt/app-root/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.51.0)\n",
- "Requirement already satisfied: pyparsing>=2.3.1 in /opt/app-root/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.2)\n",
- "Requirement already satisfied: contourpy>=1.0.1 in /opt/app-root/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.1)\n",
- "Requirement already satisfied: importlib-resources>=3.2.0 in /opt/app-root/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (6.4.0)\n",
- "Requirement already satisfied: kiwisolver>=1.3.1 in /opt/app-root/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.5)\n",
- "Requirement already satisfied: tzdata>=2022.7 in /opt/app-root/lib/python3.9/site-packages (from pandas>=1.2->seaborn) (2024.1)\n",
- "Requirement already satisfied: pytz>=2020.1 in /opt/app-root/lib/python3.9/site-packages (from pandas>=1.2->seaborn) (2024.1)\n",
- "Requirement already satisfied: zipp>=3.1.0 in /opt/app-root/lib/python3.9/site-packages (from importlib-resources>=3.2.0->matplotlib!=3.6.1,>=3.4->seaborn) (3.18.1)\n",
- "Requirement already satisfied: six>=1.5 in /opt/app-root/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)\n",
- "Installing collected packages: seaborn\n",
- "Successfully installed seaborn-0.13.2\n",
- "\n",
- "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip available: \u001b[0m\u001b[31;49m22.2.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n",
- "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
- ]
- }
- ],
+ "outputs": [],
"source": [
"!pip install seaborn"
]
},
{
"cell_type": "code",
- "execution_count": 86,
+ "execution_count": null,
"id": "b4e1d074-f5fb-41b1-afc7-1150550db01f",
"metadata": {
"tags": []
},
- "outputs": [
- {
- "data": {
- "text/plain": [
- ""
- ]
- },
- "execution_count": 86,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "",
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
+ "outputs": [],
"source": [
"import seaborn as sns\n",
"sns.displot(data.quality, kde=False)"
@@ -802,7 +364,7 @@
},
{
"cell_type": "code",
- "execution_count": 87,
+ "execution_count": null,
"id": "5c06e500-1871-4c71-816a-b898cb8633d1",
"metadata": {
"tags": []
@@ -819,33 +381,12 @@
},
{
"cell_type": "code",
- "execution_count": 88,
+ "execution_count": null,
"id": "661daf61-7b01-4dcd-9f26-30a0dfa3c51a",
"metadata": {
"tags": []
},
- "outputs": [
- {
- "data": {
- "text/plain": [
- ""
- ]
- },
- "execution_count": 88,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "",
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
+ "outputs": [],
"source": [
"import seaborn as sns\n",
"sns.displot(data.quality, kde=False)"
@@ -864,23 +405,12 @@
},
{
"cell_type": "code",
- "execution_count": 89,
+ "execution_count": null,
"id": "2f886882-1fd1-4633-af8b-602dc90d369a",
"metadata": {
"tags": []
},
- "outputs": [
- {
- "data": {
- "image/png": "",
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
+ "outputs": [],
"source": [
"dims = (3, 4)\n",
"\n",
@@ -906,7 +436,7 @@
},
{
"cell_type": "code",
- "execution_count": 92,
+ "execution_count": null,
"id": "dee4b83a-a069-4ed8-a9c5-935e45539cd3",
"metadata": {
"tags": []
@@ -921,36 +451,12 @@
},
{
"cell_type": "code",
- "execution_count": 91,
+ "execution_count": null,
"id": "31857042-ba20-42b5-9739-6a017f6b1951",
"metadata": {
"tags": []
},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "fixed_acidity False\n",
- "volatile_acidity False\n",
- "citric_acid False\n",
- "residual_sugar False\n",
- "chlorides False\n",
- "free_sulfur_dioxide False\n",
- "total_sulfur_dioxide False\n",
- "density False\n",
- "pH False\n",
- "sulphates False\n",
- "alcohol False\n",
- "quality False\n",
- "is_red False\n",
- "dtype: bool"
- ]
- },
- "execution_count": 91,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
+ "outputs": [],
"source": [
"data.isna().any()"
]
@@ -970,7 +476,7 @@
},
{
"cell_type": "code",
- "execution_count": 95,
+ "execution_count": null,
"id": "757d77d2-d54e-4a2b-8b8d-1eba01ba61ce",
"metadata": {
"tags": []
@@ -991,7 +497,7 @@
},
{
"cell_type": "code",
- "execution_count": 96,
+ "execution_count": null,
"id": "61d6a7b2-974c-4c6c-a31f-3df48760c805",
"metadata": {
"tags": []
@@ -1023,7 +529,7 @@
},
{
"cell_type": "code",
- "execution_count": 20,
+ "execution_count": null,
"id": "4b53a514-9eab-491d-9237-80448e4cea20",
"metadata": {},
"outputs": [],
@@ -1033,7 +539,7 @@
},
{
"cell_type": "code",
- "execution_count": 21,
+ "execution_count": null,
"id": "6628b383-14a4-493e-9add-e6076adf6ad5",
"metadata": {},
"outputs": [],
@@ -1048,7 +554,7 @@
},
{
"cell_type": "code",
- "execution_count": 22,
+ "execution_count": null,
"id": "4f9e8cf7-0458-49d1-9dfa-a3061bbc00d4",
"metadata": {},
"outputs": [],
@@ -1072,7 +578,7 @@
},
{
"cell_type": "code",
- "execution_count": 23,
+ "execution_count": null,
"id": "9f2832df-baa7-4125-8ef3-681517dbe8b0",
"metadata": {},
"outputs": [],
@@ -1111,39 +617,20 @@
},
{
"cell_type": "code",
- "execution_count": 24,
+ "execution_count": null,
"id": "d8957943-d0dc-4803-aab5-6c6ceb8ba34d",
"metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "2022/10/14 04:29:17 WARNING mlflow.utils.autologging_utils: MLflow autologging encountered a warning: \"/opt/app-root/lib/python3.8/site-packages/mlflow/models/signature.py:129: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values `_ for more details.\"\n",
- "2022/10/14 04:29:24 WARNING mlflow.utils.autologging_utils: MLflow autologging encountered a warning: \"/opt/app-root/lib/python3.8/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.\"\n",
- "/opt/app-root/lib/python3.8/site-packages/mlflow/models/signature.py:129: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values `_ for more details.\n",
- " inputs = _infer_schema(model_input)\n"
- ]
- }
- ],
+ "outputs": [],
"source": [
"model = train_randomforest(X_train,y_train,X_test,y_test)"
]
},
{
"cell_type": "code",
- "execution_count": 25,
+ "execution_count": null,
"id": "6c8cdc21-a1df-429b-8be3-3e4d1ea5ff2d",
"metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "AUC: 0.8540300975814177\n"
- ]
- }
- ],
+ "outputs": [],
"source": [
"# Sanity-check: This should match the AUC logged by MLflow\n",
"print(f'AUC: {roc_auc_score(y_test, model.predict_proba(X_test)[:,1])}')"
@@ -1151,108 +638,10 @@
},
{
"cell_type": "code",
- "execution_count": 26,
+ "execution_count": null,
"id": "6eba2f19-7755-412e-bc67-2e586817582c",
"metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " importance | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " alcohol | \n",
- " 0.160192 | \n",
- "
\n",
- " \n",
- " density | \n",
- " 0.117415 | \n",
- "
\n",
- " \n",
- " volatile_acidity | \n",
- " 0.093136 | \n",
- "
\n",
- " \n",
- " chlorides | \n",
- " 0.086618 | \n",
- "
\n",
- " \n",
- " residual_sugar | \n",
- " 0.082544 | \n",
- "
\n",
- " \n",
- " free_sulfur_dioxide | \n",
- " 0.080473 | \n",
- "
\n",
- " \n",
- " pH | \n",
- " 0.080212 | \n",
- "
\n",
- " \n",
- " total_sulfur_dioxide | \n",
- " 0.077798 | \n",
- "
\n",
- " \n",
- " sulphates | \n",
- " 0.075780 | \n",
- "
\n",
- " \n",
- " citric_acid | \n",
- " 0.071857 | \n",
- "
\n",
- " \n",
- " fixed_acidity | \n",
- " 0.071841 | \n",
- "
\n",
- " \n",
- " is_red | \n",
- " 0.002134 | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " importance\n",
- "alcohol 0.160192\n",
- "density 0.117415\n",
- "volatile_acidity 0.093136\n",
- "chlorides 0.086618\n",
- "residual_sugar 0.082544\n",
- "free_sulfur_dioxide 0.080473\n",
- "pH 0.080212\n",
- "total_sulfur_dioxide 0.077798\n",
- "sulphates 0.075780\n",
- "citric_acid 0.071857\n",
- "fixed_acidity 0.071841\n",
- "is_red 0.002134"
- ]
- },
- "execution_count": 26,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
+ "outputs": [],
"source": [
"# Sanity-check: This should match the feature importance logged by MLflow\n",
"feature_importances = pd.DataFrame(model.feature_importances_, index=X_train.columns.tolist(), columns=['importance'])\n",
@@ -1270,978 +659,13 @@
},
{
"cell_type": "code",
- "execution_count": 27,
+ "execution_count": null,
"id": "84cf68d6-86a4-4daa-927b-b4dc53f8cc9d",
"metadata": {
"scrolled": true,
"tags": []
},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[0]\tvalidation-logloss:0.44333 \n",
- "[1]\tvalidation-logloss:0.41397 \n",
- "[2]\tvalidation-logloss:0.42087 \n",
- "[3]\tvalidation-logloss:0.40797 \n",
- "[4]\tvalidation-logloss:0.41671 \n",
- "[5]\tvalidation-logloss:0.42035 \n",
- "[6]\tvalidation-logloss:0.42613 \n",
- "[7]\tvalidation-logloss:0.42852 \n",
- "[8]\tvalidation-logloss:0.43271 \n",
- "[9]\tvalidation-logloss:0.44060 \n",
- "[10]\tvalidation-logloss:0.44639 \n",
- "[11]\tvalidation-logloss:0.45534 \n",
- "[12]\tvalidation-logloss:0.45760 \n",
- "[13]\tvalidation-logloss:0.46125 \n",
- "[14]\tvalidation-logloss:0.46433 \n",
- "[15]\tvalidation-logloss:0.47167 \n",
- "[16]\tvalidation-logloss:0.47850 \n",
- "[17]\tvalidation-logloss:0.48394 \n",
- "[18]\tvalidation-logloss:0.48653 \n",
- "[19]\tvalidation-logloss:0.48643 \n",
- "[20]\tvalidation-logloss:0.48780 \n",
- "[21]\tvalidation-logloss:0.49005 \n",
- "[22]\tvalidation-logloss:0.49160 \n",
- "[23]\tvalidation-logloss:0.49308 \n",
- "[24]\tvalidation-logloss:0.49859 \n",
- "[25]\tvalidation-logloss:0.49639 \n",
- "[26]\tvalidation-logloss:0.49793 \n",
- "[27]\tvalidation-logloss:0.50148 \n",
- "[28]\tvalidation-logloss:0.50193 \n",
- "[29]\tvalidation-logloss:0.50504 \n",
- "[30]\tvalidation-logloss:0.50576 \n",
- "[31]\tvalidation-logloss:0.50856 \n",
- "[32]\tvalidation-logloss:0.50839 \n",
- "[33]\tvalidation-logloss:0.51065 \n",
- "[34]\tvalidation-logloss:0.51173 \n",
- "[35]\tvalidation-logloss:0.51465 \n",
- "[36]\tvalidation-logloss:0.51601 \n",
- "[37]\tvalidation-logloss:0.51602 \n",
- "[38]\tvalidation-logloss:0.51612 \n",
- "[39]\tvalidation-logloss:0.51567 \n",
- "[40]\tvalidation-logloss:0.51560 \n",
- "[41]\tvalidation-logloss:0.51591 \n",
- "[42]\tvalidation-logloss:0.51794 \n",
- "[43]\tvalidation-logloss:0.52040 \n",
- "[44]\tvalidation-logloss:0.52113 \n",
- "[45]\tvalidation-logloss:0.52005 \n",
- "[46]\tvalidation-logloss:0.52180 \n",
- "[47]\tvalidation-logloss:0.52510 \n",
- "[48]\tvalidation-logloss:0.52464 \n",
- "[49]\tvalidation-logloss:0.52464 \n",
- "[50]\tvalidation-logloss:0.52719 \n",
- "[51]\tvalidation-logloss:0.52812 \n",
- "[52]\tvalidation-logloss:0.52910 \n",
- "[53]\tvalidation-logloss:0.52952 \n",
- " 0%| | 0/10 [00:01, ?trial/s, best loss=?]"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "2022/10/14 04:30:35 WARNING mlflow.utils.autologging_utils: MLflow autologging encountered a warning: \"/opt/app-root/lib/python3.8/site-packages/mlflow/models/signature.py:129: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values `_ for more details.\"\n",
- "\n",
- "/opt/app-root/lib/python3.8/site-packages/mlflow/models/signature.py:129: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values `_ for more details.\n",
- " inputs = _infer_schema(model_input)\n",
- "\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[0]\tvalidation-logloss:0.66207 \n",
- "[1]\tvalidation-logloss:0.63397 \n",
- "[2]\tvalidation-logloss:0.60908 \n",
- "[3]\tvalidation-logloss:0.58602 \n",
- "[4]\tvalidation-logloss:0.56529 \n",
- "[5]\tvalidation-logloss:0.54639 \n",
- "[6]\tvalidation-logloss:0.52885 \n",
- "[7]\tvalidation-logloss:0.51333 \n",
- "[8]\tvalidation-logloss:0.49891 \n",
- "[9]\tvalidation-logloss:0.48621 \n",
- "[10]\tvalidation-logloss:0.47480 \n",
- "[11]\tvalidation-logloss:0.46292 \n",
- "[12]\tvalidation-logloss:0.45166 \n",
- "[13]\tvalidation-logloss:0.44162 \n",
- "[14]\tvalidation-logloss:0.43190 \n",
- "[15]\tvalidation-logloss:0.42311 \n",
- "[16]\tvalidation-logloss:0.41422 \n",
- "[17]\tvalidation-logloss:0.40691 \n",
- "[18]\tvalidation-logloss:0.40052 \n",
- "[19]\tvalidation-logloss:0.39411 \n",
- "[20]\tvalidation-logloss:0.38866 \n",
- "[21]\tvalidation-logloss:0.38339 \n",
- "[22]\tvalidation-logloss:0.37805 \n",
- "[23]\tvalidation-logloss:0.37279 \n",
- "[24]\tvalidation-logloss:0.36834 \n",
- "[25]\tvalidation-logloss:0.36418 \n",
- "[26]\tvalidation-logloss:0.36038 \n",
- "[27]\tvalidation-logloss:0.35615 \n",
- "[28]\tvalidation-logloss:0.35357 \n",
- "[29]\tvalidation-logloss:0.35025 \n",
- "[30]\tvalidation-logloss:0.34754 \n",
- "[31]\tvalidation-logloss:0.34511 \n",
- "[32]\tvalidation-logloss:0.34243 \n",
- "[33]\tvalidation-logloss:0.34034 \n",
- "[34]\tvalidation-logloss:0.33777 \n",
- "[35]\tvalidation-logloss:0.33605 \n",
- "[36]\tvalidation-logloss:0.33432 \n",
- "[37]\tvalidation-logloss:0.33272 \n",
- "[38]\tvalidation-logloss:0.33087 \n",
- "[39]\tvalidation-logloss:0.32931 \n",
- "[40]\tvalidation-logloss:0.32804 \n",
- "[41]\tvalidation-logloss:0.32656 \n",
- "[42]\tvalidation-logloss:0.32520 \n",
- "[43]\tvalidation-logloss:0.32441 \n",
- "[44]\tvalidation-logloss:0.32317 \n",
- "[45]\tvalidation-logloss:0.32262 \n",
- "[46]\tvalidation-logloss:0.32225 \n",
- "[47]\tvalidation-logloss:0.32154 \n",
- "[48]\tvalidation-logloss:0.32064 \n",
- "[49]\tvalidation-logloss:0.31889 \n",
- "[50]\tvalidation-logloss:0.31792 \n",
- "[51]\tvalidation-logloss:0.31717 \n",
- "[52]\tvalidation-logloss:0.31751 \n",
- "[53]\tvalidation-logloss:0.31688 \n",
- "[54]\tvalidation-logloss:0.31658 \n",
- "[55]\tvalidation-logloss:0.31562 \n",
- "[56]\tvalidation-logloss:0.31531 \n",
- "[57]\tvalidation-logloss:0.31475 \n",
- "[58]\tvalidation-logloss:0.31417 \n",
- "[59]\tvalidation-logloss:0.31378 \n",
- "[60]\tvalidation-logloss:0.31314 \n",
- "[61]\tvalidation-logloss:0.31257 \n",
- "[62]\tvalidation-logloss:0.31203 \n",
- "[63]\tvalidation-logloss:0.31142 \n",
- "[64]\tvalidation-logloss:0.31103 \n",
- "[65]\tvalidation-logloss:0.31064 \n",
- "[66]\tvalidation-logloss:0.31028 \n",
- "[67]\tvalidation-logloss:0.31011 \n",
- "[68]\tvalidation-logloss:0.30948 \n",
- "[69]\tvalidation-logloss:0.30930 \n",
- "[70]\tvalidation-logloss:0.30910 \n",
- "[71]\tvalidation-logloss:0.30885 \n",
- "[72]\tvalidation-logloss:0.30935 \n",
- "[73]\tvalidation-logloss:0.30907 \n",
- "[74]\tvalidation-logloss:0.30869 \n",
- "[75]\tvalidation-logloss:0.30880 \n",
- "[76]\tvalidation-logloss:0.30869 \n",
- "[77]\tvalidation-logloss:0.30907 \n",
- "[78]\tvalidation-logloss:0.30869 \n",
- "[79]\tvalidation-logloss:0.30863 \n",
- "[80]\tvalidation-logloss:0.30873 \n",
- "[81]\tvalidation-logloss:0.30889 \n",
- "[82]\tvalidation-logloss:0.30847 \n",
- "[83]\tvalidation-logloss:0.30833 \n",
- "[84]\tvalidation-logloss:0.30839 \n",
- "[85]\tvalidation-logloss:0.30865 \n",
- "[86]\tvalidation-logloss:0.30851 \n",
- "[87]\tvalidation-logloss:0.30872 \n",
- "[88]\tvalidation-logloss:0.30883 \n",
- "[89]\tvalidation-logloss:0.30864 \n",
- "[90]\tvalidation-logloss:0.30878 \n",
- "[91]\tvalidation-logloss:0.30871 \n",
- "[92]\tvalidation-logloss:0.30883 \n",
- "[93]\tvalidation-logloss:0.30873 \n",
- "[94]\tvalidation-logloss:0.30853 \n",
- "[95]\tvalidation-logloss:0.30827 \n",
- "[96]\tvalidation-logloss:0.30851 \n",
- "[97]\tvalidation-logloss:0.30842 \n",
- "[98]\tvalidation-logloss:0.30886 \n",
- "[99]\tvalidation-logloss:0.30880 \n",
- " 10%|█ | 1/10 [00:15<01:51, 12.40s/trial, best loss: -0.8633680208124874]"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/opt/app-root/lib/python3.8/site-packages/mlflow/models/signature.py:129: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values `_ for more details.\n",
- " inputs = _infer_schema(model_input)\n",
- "\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[0]\tvalidation-logloss:0.64928 \n",
- "[1]\tvalidation-logloss:0.61253 \n",
- "[2]\tvalidation-logloss:0.57966 \n",
- "[3]\tvalidation-logloss:0.55328 \n",
- "[4]\tvalidation-logloss:0.52902 \n",
- "[5]\tvalidation-logloss:0.50749 \n",
- "[6]\tvalidation-logloss:0.48774 \n",
- "[7]\tvalidation-logloss:0.47102 \n",
- "[8]\tvalidation-logloss:0.45586 \n",
- "[9]\tvalidation-logloss:0.44157 \n",
- "[10]\tvalidation-logloss:0.42889 \n",
- "[11]\tvalidation-logloss:0.41840 \n",
- "[12]\tvalidation-logloss:0.40951 \n",
- "[13]\tvalidation-logloss:0.40049 \n",
- "[14]\tvalidation-logloss:0.39269 \n",
- "[15]\tvalidation-logloss:0.38599 \n",
- "[16]\tvalidation-logloss:0.37907 \n",
- "[17]\tvalidation-logloss:0.37259 \n",
- "[18]\tvalidation-logloss:0.36764 \n",
- "[19]\tvalidation-logloss:0.36193 \n",
- "[20]\tvalidation-logloss:0.35649 \n",
- "[21]\tvalidation-logloss:0.35234 \n",
- "[22]\tvalidation-logloss:0.34860 \n",
- "[23]\tvalidation-logloss:0.34556 \n",
- "[24]\tvalidation-logloss:0.34262 \n",
- "[25]\tvalidation-logloss:0.34042 \n",
- "[26]\tvalidation-logloss:0.33767 \n",
- "[27]\tvalidation-logloss:0.33534 \n",
- "[28]\tvalidation-logloss:0.33396 \n",
- "[29]\tvalidation-logloss:0.33259 \n",
- "[30]\tvalidation-logloss:0.33078 \n",
- "[31]\tvalidation-logloss:0.32946 \n",
- "[32]\tvalidation-logloss:0.32694 \n",
- "[33]\tvalidation-logloss:0.32502 \n",
- "[34]\tvalidation-logloss:0.32331 \n",
- "[35]\tvalidation-logloss:0.32178 \n",
- "[36]\tvalidation-logloss:0.32090 \n",
- "[37]\tvalidation-logloss:0.31960 \n",
- "[38]\tvalidation-logloss:0.31903 \n",
- "[39]\tvalidation-logloss:0.31756 \n",
- "[40]\tvalidation-logloss:0.31628 \n",
- "[41]\tvalidation-logloss:0.31559 \n",
- "[42]\tvalidation-logloss:0.31449 \n",
- "[43]\tvalidation-logloss:0.31376 \n",
- "[44]\tvalidation-logloss:0.31333 \n",
- "[45]\tvalidation-logloss:0.31340 \n",
- "[46]\tvalidation-logloss:0.31240 \n",
- "[47]\tvalidation-logloss:0.31235 \n",
- "[48]\tvalidation-logloss:0.31115 \n",
- "[49]\tvalidation-logloss:0.31095 \n",
- "[50]\tvalidation-logloss:0.31065 \n",
- "[51]\tvalidation-logloss:0.31105 \n",
- "[52]\tvalidation-logloss:0.31076 \n",
- "[53]\tvalidation-logloss:0.31034 \n",
- "[54]\tvalidation-logloss:0.31015 \n",
- "[55]\tvalidation-logloss:0.30969 \n",
- "[56]\tvalidation-logloss:0.30955 \n",
- "[57]\tvalidation-logloss:0.30945 \n",
- "[58]\tvalidation-logloss:0.30866 \n",
- "[59]\tvalidation-logloss:0.30923 \n",
- "[60]\tvalidation-logloss:0.30889 \n",
- "[61]\tvalidation-logloss:0.30846 \n",
- "[62]\tvalidation-logloss:0.30836 \n",
- "[63]\tvalidation-logloss:0.30789 \n",
- "[64]\tvalidation-logloss:0.30792 \n",
- "[65]\tvalidation-logloss:0.30808 \n",
- "[66]\tvalidation-logloss:0.30767 \n",
- "[67]\tvalidation-logloss:0.30790 \n",
- "[68]\tvalidation-logloss:0.30765 \n",
- "[69]\tvalidation-logloss:0.30719 \n",
- "[70]\tvalidation-logloss:0.30717 \n",
- "[71]\tvalidation-logloss:0.30725 \n",
- "[72]\tvalidation-logloss:0.30755 \n",
- "[73]\tvalidation-logloss:0.30846 \n",
- "[74]\tvalidation-logloss:0.30815 \n",
- "[75]\tvalidation-logloss:0.30848 \n",
- "[76]\tvalidation-logloss:0.30848 \n",
- "[77]\tvalidation-logloss:0.30869 \n",
- "[78]\tvalidation-logloss:0.30909 \n",
- "[79]\tvalidation-logloss:0.30936 \n",
- "[80]\tvalidation-logloss:0.31006 \n",
- "[81]\tvalidation-logloss:0.30963 \n",
- "[82]\tvalidation-logloss:0.30952 \n",
- "[83]\tvalidation-logloss:0.30970 \n",
- "[84]\tvalidation-logloss:0.30997 \n",
- "[85]\tvalidation-logloss:0.31026 \n",
- "[86]\tvalidation-logloss:0.30983 \n",
- "[87]\tvalidation-logloss:0.31007 \n",
- "[88]\tvalidation-logloss:0.31046 \n",
- "[89]\tvalidation-logloss:0.31053 \n",
- "[90]\tvalidation-logloss:0.31083 \n",
- "[91]\tvalidation-logloss:0.31128 \n",
- "[92]\tvalidation-logloss:0.31136 \n",
- "[93]\tvalidation-logloss:0.31143 \n",
- "[94]\tvalidation-logloss:0.31128 \n",
- "[95]\tvalidation-logloss:0.31190 \n",
- "[96]\tvalidation-logloss:0.31220 \n",
- "[97]\tvalidation-logloss:0.31199 \n",
- "[98]\tvalidation-logloss:0.31199 \n",
- "[99]\tvalidation-logloss:0.31237 \n",
- " 20%|██ | 2/10 [00:30<01:49, 13.65s/trial, best loss: -0.8922007050384075]"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/opt/app-root/lib/python3.8/site-packages/mlflow/models/signature.py:129: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values `_ for more details.\n",
- " inputs = _infer_schema(model_input)\n",
- "\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[0]\tvalidation-logloss:0.44452 \n",
- "[1]\tvalidation-logloss:0.38512 \n",
- "[2]\tvalidation-logloss:0.37306 \n",
- "[3]\tvalidation-logloss:0.35673 \n",
- "[4]\tvalidation-logloss:0.35491 \n",
- "[5]\tvalidation-logloss:0.35783 \n",
- "[6]\tvalidation-logloss:0.36192 \n",
- "[7]\tvalidation-logloss:0.35998 \n",
- "[8]\tvalidation-logloss:0.36303 \n",
- "[9]\tvalidation-logloss:0.36328 \n",
- "[10]\tvalidation-logloss:0.36600 \n",
- "[11]\tvalidation-logloss:0.36452 \n",
- "[12]\tvalidation-logloss:0.36587 \n",
- "[13]\tvalidation-logloss:0.37022 \n",
- "[14]\tvalidation-logloss:0.37104 \n",
- "[15]\tvalidation-logloss:0.37906 \n",
- "[16]\tvalidation-logloss:0.38189 \n",
- "[17]\tvalidation-logloss:0.38283 \n",
- "[18]\tvalidation-logloss:0.38164 \n",
- "[19]\tvalidation-logloss:0.38464 \n",
- "[20]\tvalidation-logloss:0.38333 \n",
- "[21]\tvalidation-logloss:0.38800 \n",
- "[22]\tvalidation-logloss:0.39003 \n",
- "[23]\tvalidation-logloss:0.39583 \n",
- "[24]\tvalidation-logloss:0.39788 \n",
- "[25]\tvalidation-logloss:0.39817 \n",
- "[26]\tvalidation-logloss:0.40219 \n",
- "[27]\tvalidation-logloss:0.40261 \n",
- "[28]\tvalidation-logloss:0.40122 \n",
- "[29]\tvalidation-logloss:0.40181 \n",
- "[30]\tvalidation-logloss:0.40343 \n",
- "[31]\tvalidation-logloss:0.40646 \n",
- "[32]\tvalidation-logloss:0.40479 \n",
- "[33]\tvalidation-logloss:0.40610 \n",
- "[34]\tvalidation-logloss:0.40858 \n",
- "[35]\tvalidation-logloss:0.41033 \n",
- "[36]\tvalidation-logloss:0.41255 \n",
- "[37]\tvalidation-logloss:0.41407 \n",
- "[38]\tvalidation-logloss:0.41448 \n",
- "[39]\tvalidation-logloss:0.41266 \n",
- "[40]\tvalidation-logloss:0.41274 \n",
- "[41]\tvalidation-logloss:0.41709 \n",
- "[42]\tvalidation-logloss:0.41994 \n",
- "[43]\tvalidation-logloss:0.42096 \n",
- "[44]\tvalidation-logloss:0.42298 \n",
- "[45]\tvalidation-logloss:0.42435 \n",
- "[46]\tvalidation-logloss:0.42436 \n",
- "[47]\tvalidation-logloss:0.42337 \n",
- "[48]\tvalidation-logloss:0.42300 \n",
- "[49]\tvalidation-logloss:0.42483 \n",
- "[50]\tvalidation-logloss:0.42523 \n",
- "[51]\tvalidation-logloss:0.42426 \n",
- "[52]\tvalidation-logloss:0.42530 \n",
- "[53]\tvalidation-logloss:0.42766 \n",
- " 30%|███ | 3/10 [00:42<01:37, 13.93s/trial, best loss: -0.8922007050384075]"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/opt/app-root/lib/python3.8/site-packages/mlflow/models/signature.py:129: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values `_ for more details.\n",
- " inputs = _infer_schema(model_input)\n",
- "\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[0]\tvalidation-logloss:0.56695 \n",
- "[1]\tvalidation-logloss:0.49378 \n",
- "[2]\tvalidation-logloss:0.44155 \n",
- "[3]\tvalidation-logloss:0.40086 \n",
- "[4]\tvalidation-logloss:0.37521 \n",
- "[5]\tvalidation-logloss:0.35622 \n",
- "[6]\tvalidation-logloss:0.34453 \n",
- "[7]\tvalidation-logloss:0.33476 \n",
- "[8]\tvalidation-logloss:0.32688 \n",
- "[9]\tvalidation-logloss:0.32048 \n",
- "[10]\tvalidation-logloss:0.31849 \n",
- "[11]\tvalidation-logloss:0.31509 \n",
- "[12]\tvalidation-logloss:0.31343 \n",
- "[13]\tvalidation-logloss:0.31198 \n",
- "[14]\tvalidation-logloss:0.31275 \n",
- "[15]\tvalidation-logloss:0.31377 \n",
- "[16]\tvalidation-logloss:0.31565 \n",
- "[17]\tvalidation-logloss:0.31619 \n",
- "[18]\tvalidation-logloss:0.31828 \n",
- "[19]\tvalidation-logloss:0.31830 \n",
- "[20]\tvalidation-logloss:0.31754 \n",
- "[21]\tvalidation-logloss:0.31889 \n",
- "[22]\tvalidation-logloss:0.31802 \n",
- "[23]\tvalidation-logloss:0.31898 \n",
- "[24]\tvalidation-logloss:0.32136 \n",
- "[25]\tvalidation-logloss:0.32137 \n",
- "[26]\tvalidation-logloss:0.32283 \n",
- "[27]\tvalidation-logloss:0.32306 \n",
- "[28]\tvalidation-logloss:0.32445 \n",
- "[29]\tvalidation-logloss:0.32455 \n",
- "[30]\tvalidation-logloss:0.32481 \n",
- "[31]\tvalidation-logloss:0.32602 \n",
- "[32]\tvalidation-logloss:0.32544 \n",
- "[33]\tvalidation-logloss:0.32587 \n",
- "[34]\tvalidation-logloss:0.32652 \n",
- "[35]\tvalidation-logloss:0.32734 \n",
- "[36]\tvalidation-logloss:0.32829 \n",
- "[37]\tvalidation-logloss:0.32764 \n",
- "[38]\tvalidation-logloss:0.32824 \n",
- "[39]\tvalidation-logloss:0.32808 \n",
- "[40]\tvalidation-logloss:0.32791 \n",
- "[41]\tvalidation-logloss:0.32828 \n",
- "[42]\tvalidation-logloss:0.32913 \n",
- "[43]\tvalidation-logloss:0.32989 \n",
- "[44]\tvalidation-logloss:0.33035 \n",
- "[45]\tvalidation-logloss:0.33047 \n",
- "[46]\tvalidation-logloss:0.33107 \n",
- "[47]\tvalidation-logloss:0.33194 \n",
- "[48]\tvalidation-logloss:0.33252 \n",
- "[49]\tvalidation-logloss:0.33290 \n",
- "[50]\tvalidation-logloss:0.33255 \n",
- "[51]\tvalidation-logloss:0.33298 \n",
- "[52]\tvalidation-logloss:0.33288 \n",
- "[53]\tvalidation-logloss:0.33393 \n",
- "[54]\tvalidation-logloss:0.33407 \n",
- "[55]\tvalidation-logloss:0.33420 \n",
- "[56]\tvalidation-logloss:0.33412 \n",
- "[57]\tvalidation-logloss:0.33457 \n",
- "[58]\tvalidation-logloss:0.33480 \n",
- "[59]\tvalidation-logloss:0.33463 \n",
- "[60]\tvalidation-logloss:0.33562 \n",
- "[61]\tvalidation-logloss:0.33614 \n",
- "[62]\tvalidation-logloss:0.33670 \n",
- " 40%|████ | 4/10 [00:55<01:19, 13.23s/trial, best loss: -0.8922007050384075]"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/opt/app-root/lib/python3.8/site-packages/mlflow/models/signature.py:129: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values `_ for more details.\n",
- " inputs = _infer_schema(model_input)\n",
- "\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[0]\tvalidation-logloss:0.65860 \n",
- "[1]\tvalidation-logloss:0.62888 \n",
- "[2]\tvalidation-logloss:0.60210 \n",
- "[3]\tvalidation-logloss:0.57870 \n",
- "[4]\tvalidation-logloss:0.55768 \n",
- "[5]\tvalidation-logloss:0.53812 \n",
- "[6]\tvalidation-logloss:0.52052 \n",
- "[7]\tvalidation-logloss:0.50554 \n",
- "[8]\tvalidation-logloss:0.49123 \n",
- "[9]\tvalidation-logloss:0.47832 \n",
- "[10]\tvalidation-logloss:0.46684 \n",
- "[11]\tvalidation-logloss:0.45579 \n",
- "[12]\tvalidation-logloss:0.44639 \n",
- "[13]\tvalidation-logloss:0.43718 \n",
- "[14]\tvalidation-logloss:0.42907 \n",
- "[15]\tvalidation-logloss:0.42136 \n",
- "[16]\tvalidation-logloss:0.41442 \n",
- "[17]\tvalidation-logloss:0.40782 \n",
- "[18]\tvalidation-logloss:0.40120 \n",
- "[19]\tvalidation-logloss:0.39528 \n",
- "[20]\tvalidation-logloss:0.39027 \n",
- "[21]\tvalidation-logloss:0.38543 \n",
- "[22]\tvalidation-logloss:0.38061 \n",
- "[23]\tvalidation-logloss:0.37629 \n",
- "[24]\tvalidation-logloss:0.37283 \n",
- "[25]\tvalidation-logloss:0.36860 \n",
- "[26]\tvalidation-logloss:0.36539 \n",
- "[27]\tvalidation-logloss:0.36233 \n",
- "[28]\tvalidation-logloss:0.35868 \n",
- "[29]\tvalidation-logloss:0.35576 \n",
- "[30]\tvalidation-logloss:0.35282 \n",
- "[31]\tvalidation-logloss:0.35055 \n",
- "[32]\tvalidation-logloss:0.34898 \n",
- "[33]\tvalidation-logloss:0.34718 \n",
- "[34]\tvalidation-logloss:0.34523 \n",
- "[35]\tvalidation-logloss:0.34337 \n",
- "[36]\tvalidation-logloss:0.34115 \n",
- "[37]\tvalidation-logloss:0.33884 \n",
- "[38]\tvalidation-logloss:0.33745 \n",
- "[39]\tvalidation-logloss:0.33592 \n",
- "[40]\tvalidation-logloss:0.33454 \n",
- "[41]\tvalidation-logloss:0.33341 \n",
- "[42]\tvalidation-logloss:0.33188 \n",
- "[43]\tvalidation-logloss:0.33123 \n",
- "[44]\tvalidation-logloss:0.33081 \n",
- "[45]\tvalidation-logloss:0.33005 \n",
- "[46]\tvalidation-logloss:0.32927 \n",
- "[47]\tvalidation-logloss:0.32827 \n",
- "[48]\tvalidation-logloss:0.32771 \n",
- "[49]\tvalidation-logloss:0.32638 \n",
- "[50]\tvalidation-logloss:0.32534 \n",
- "[51]\tvalidation-logloss:0.32447 \n",
- "[52]\tvalidation-logloss:0.32348 \n",
- "[53]\tvalidation-logloss:0.32239 \n",
- "[54]\tvalidation-logloss:0.32164 \n",
- "[55]\tvalidation-logloss:0.32145 \n",
- "[56]\tvalidation-logloss:0.32054 \n",
- "[57]\tvalidation-logloss:0.32011 \n",
- "[58]\tvalidation-logloss:0.31933 \n",
- "[59]\tvalidation-logloss:0.31893 \n",
- "[60]\tvalidation-logloss:0.31832 \n",
- "[61]\tvalidation-logloss:0.31776 \n",
- "[62]\tvalidation-logloss:0.31691 \n",
- "[63]\tvalidation-logloss:0.31653 \n",
- "[64]\tvalidation-logloss:0.31624 \n",
- "[65]\tvalidation-logloss:0.31597 \n",
- "[66]\tvalidation-logloss:0.31526 \n",
- "[67]\tvalidation-logloss:0.31500 \n",
- "[68]\tvalidation-logloss:0.31523 \n",
- "[69]\tvalidation-logloss:0.31553 \n",
- "[70]\tvalidation-logloss:0.31504 \n",
- "[71]\tvalidation-logloss:0.31470 \n",
- "[72]\tvalidation-logloss:0.31405 \n",
- "[73]\tvalidation-logloss:0.31320 \n",
- "[74]\tvalidation-logloss:0.31287 \n",
- "[75]\tvalidation-logloss:0.31275 \n",
- "[76]\tvalidation-logloss:0.31279 \n",
- "[77]\tvalidation-logloss:0.31260 \n",
- "[78]\tvalidation-logloss:0.31188 \n",
- "[79]\tvalidation-logloss:0.31245 \n",
- "[80]\tvalidation-logloss:0.31228 \n",
- "[81]\tvalidation-logloss:0.31216 \n",
- "[82]\tvalidation-logloss:0.31200 \n",
- "[83]\tvalidation-logloss:0.31176 \n",
- "[84]\tvalidation-logloss:0.31213 \n",
- "[85]\tvalidation-logloss:0.31218 \n",
- "[86]\tvalidation-logloss:0.31140 \n",
- "[87]\tvalidation-logloss:0.31108 \n",
- "[88]\tvalidation-logloss:0.31158 \n",
- "[89]\tvalidation-logloss:0.31125 \n",
- "[90]\tvalidation-logloss:0.31184 \n",
- "[91]\tvalidation-logloss:0.31166 \n",
- "[92]\tvalidation-logloss:0.31194 \n",
- "[93]\tvalidation-logloss:0.31184 \n",
- "[94]\tvalidation-logloss:0.31168 \n",
- "[95]\tvalidation-logloss:0.31159 \n",
- "[96]\tvalidation-logloss:0.31162 \n",
- "[97]\tvalidation-logloss:0.31183 \n",
- "[98]\tvalidation-logloss:0.31187 \n",
- "[99]\tvalidation-logloss:0.31173 \n",
- " 50%|█████ | 5/10 [01:09<01:05, 13.16s/trial, best loss: -0.8990124844137253]"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/opt/app-root/lib/python3.8/site-packages/mlflow/models/signature.py:129: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values `_ for more details.\n",
- " inputs = _infer_schema(model_input)\n",
- "\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[0]\tvalidation-logloss:0.42932 \n",
- "[1]\tvalidation-logloss:0.39324 \n",
- "[2]\tvalidation-logloss:0.40056 \n",
- "[3]\tvalidation-logloss:0.41010 \n",
- "[4]\tvalidation-logloss:0.40816 \n",
- "[5]\tvalidation-logloss:0.41112 \n",
- "[6]\tvalidation-logloss:0.42020 \n",
- "[7]\tvalidation-logloss:0.42298 \n",
- "[8]\tvalidation-logloss:0.42552 \n",
- "[9]\tvalidation-logloss:0.42105 \n",
- "[10]\tvalidation-logloss:0.42624 \n",
- "[11]\tvalidation-logloss:0.43340 \n",
- "[12]\tvalidation-logloss:0.43680 \n",
- "[13]\tvalidation-logloss:0.44089 \n",
- "[14]\tvalidation-logloss:0.43846 \n",
- "[15]\tvalidation-logloss:0.43967 \n",
- "[16]\tvalidation-logloss:0.43642 \n",
- "[17]\tvalidation-logloss:0.43892 \n",
- "[18]\tvalidation-logloss:0.44035 \n",
- "[19]\tvalidation-logloss:0.44177 \n",
- "[20]\tvalidation-logloss:0.44254 \n",
- "[21]\tvalidation-logloss:0.44911 \n",
- "[22]\tvalidation-logloss:0.44750 \n",
- "[23]\tvalidation-logloss:0.44964 \n",
- "[24]\tvalidation-logloss:0.45131 \n",
- "[25]\tvalidation-logloss:0.44954 \n",
- "[26]\tvalidation-logloss:0.45093 \n",
- "[27]\tvalidation-logloss:0.45663 \n",
- "[28]\tvalidation-logloss:0.45510 \n",
- "[29]\tvalidation-logloss:0.45451 \n",
- "[30]\tvalidation-logloss:0.45456 \n",
- "[31]\tvalidation-logloss:0.45562 \n",
- "[32]\tvalidation-logloss:0.45818 \n",
- "[33]\tvalidation-logloss:0.45927 \n",
- "[34]\tvalidation-logloss:0.46194 \n",
- "[35]\tvalidation-logloss:0.46172 \n",
- "[36]\tvalidation-logloss:0.46339 \n",
- "[37]\tvalidation-logloss:0.46478 \n",
- "[38]\tvalidation-logloss:0.46743 \n",
- "[39]\tvalidation-logloss:0.46883 \n",
- "[40]\tvalidation-logloss:0.46953 \n",
- "[41]\tvalidation-logloss:0.47075 \n",
- "[42]\tvalidation-logloss:0.46811 \n",
- "[43]\tvalidation-logloss:0.46913 \n",
- "[44]\tvalidation-logloss:0.46751 \n",
- "[45]\tvalidation-logloss:0.47039 \n",
- "[46]\tvalidation-logloss:0.47111 \n",
- "[47]\tvalidation-logloss:0.47470 \n",
- "[48]\tvalidation-logloss:0.47745 \n",
- "[49]\tvalidation-logloss:0.47837 \n",
- "[50]\tvalidation-logloss:0.47768 \n",
- " 60%|██████ | 6/10 [01:22<00:54, 13.74s/trial, best loss: -0.8990124844137253]"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/opt/app-root/lib/python3.8/site-packages/mlflow/models/signature.py:129: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values `_ for more details.\n",
- " inputs = _infer_schema(model_input)\n",
- "\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[0]\tvalidation-logloss:0.61724 \n",
- "[1]\tvalidation-logloss:0.56129 \n",
- "[2]\tvalidation-logloss:0.51815 \n",
- "[3]\tvalidation-logloss:0.48310 \n",
- "[4]\tvalidation-logloss:0.45328 \n",
- "[5]\tvalidation-logloss:0.42723 \n",
- "[6]\tvalidation-logloss:0.40678 \n",
- "[7]\tvalidation-logloss:0.39061 \n",
- "[8]\tvalidation-logloss:0.37696 \n",
- "[9]\tvalidation-logloss:0.36382 \n",
- "[10]\tvalidation-logloss:0.35407 \n",
- "[11]\tvalidation-logloss:0.34475 \n",
- "[12]\tvalidation-logloss:0.33907 \n",
- "[13]\tvalidation-logloss:0.33296 \n",
- "[14]\tvalidation-logloss:0.32818 \n",
- "[15]\tvalidation-logloss:0.32523 \n",
- "[16]\tvalidation-logloss:0.32203 \n",
- "[17]\tvalidation-logloss:0.32108 \n",
- "[18]\tvalidation-logloss:0.31959 \n",
- "[19]\tvalidation-logloss:0.31742 \n",
- "[20]\tvalidation-logloss:0.31573 \n",
- "[21]\tvalidation-logloss:0.31460 \n",
- "[22]\tvalidation-logloss:0.31394 \n",
- "[23]\tvalidation-logloss:0.31415 \n",
- "[24]\tvalidation-logloss:0.31475 \n",
- "[25]\tvalidation-logloss:0.31461 \n",
- "[26]\tvalidation-logloss:0.31400 \n",
- "[27]\tvalidation-logloss:0.31506 \n",
- "[28]\tvalidation-logloss:0.31547 \n",
- "[29]\tvalidation-logloss:0.31747 \n",
- "[30]\tvalidation-logloss:0.31789 \n",
- "[31]\tvalidation-logloss:0.31839 \n",
- "[32]\tvalidation-logloss:0.31780 \n",
- "[33]\tvalidation-logloss:0.31845 \n",
- "[34]\tvalidation-logloss:0.31856 \n",
- "[35]\tvalidation-logloss:0.32030 \n",
- "[36]\tvalidation-logloss:0.32136 \n",
- "[37]\tvalidation-logloss:0.32147 \n",
- "[38]\tvalidation-logloss:0.32228 \n",
- "[39]\tvalidation-logloss:0.32241 \n",
- "[40]\tvalidation-logloss:0.32333 \n",
- "[41]\tvalidation-logloss:0.32391 \n",
- "[42]\tvalidation-logloss:0.32416 \n",
- "[43]\tvalidation-logloss:0.32444 \n",
- "[44]\tvalidation-logloss:0.32452 \n",
- "[45]\tvalidation-logloss:0.32469 \n",
- "[46]\tvalidation-logloss:0.32542 \n",
- "[47]\tvalidation-logloss:0.32518 \n",
- "[48]\tvalidation-logloss:0.32577 \n",
- "[49]\tvalidation-logloss:0.32633 \n",
- "[50]\tvalidation-logloss:0.32756 \n",
- "[51]\tvalidation-logloss:0.32771 \n",
- "[52]\tvalidation-logloss:0.32859 \n",
- "[53]\tvalidation-logloss:0.33004 \n",
- "[54]\tvalidation-logloss:0.33124 \n",
- "[55]\tvalidation-logloss:0.33118 \n",
- "[56]\tvalidation-logloss:0.33230 \n",
- "[57]\tvalidation-logloss:0.33298 \n",
- "[58]\tvalidation-logloss:0.33374 \n",
- "[59]\tvalidation-logloss:0.33370 \n",
- "[60]\tvalidation-logloss:0.33393 \n",
- "[61]\tvalidation-logloss:0.33462 \n",
- "[62]\tvalidation-logloss:0.33478 \n",
- "[63]\tvalidation-logloss:0.33518 \n",
- "[64]\tvalidation-logloss:0.33584 \n",
- "[65]\tvalidation-logloss:0.33648 \n",
- "[66]\tvalidation-logloss:0.33599 \n",
- "[67]\tvalidation-logloss:0.33655 \n",
- "[68]\tvalidation-logloss:0.33690 \n",
- "[69]\tvalidation-logloss:0.33802 \n",
- "[70]\tvalidation-logloss:0.33882 \n",
- "[71]\tvalidation-logloss:0.33947 \n",
- "[72]\tvalidation-logloss:0.34012 \n",
- " 70%|███████ | 7/10 [01:36<00:39, 13.28s/trial, best loss: -0.8990124844137253]"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/opt/app-root/lib/python3.8/site-packages/mlflow/models/signature.py:129: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values `_ for more details.\n",
- " inputs = _infer_schema(model_input)\n",
- "\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[0]\tvalidation-logloss:0.65597 \n",
- "[1]\tvalidation-logloss:0.62338 \n",
- "[2]\tvalidation-logloss:0.59350 \n",
- "[3]\tvalidation-logloss:0.56914 \n",
- "[4]\tvalidation-logloss:0.54644 \n",
- "[5]\tvalidation-logloss:0.52693 \n",
- "[6]\tvalidation-logloss:0.50854 \n",
- "[7]\tvalidation-logloss:0.49154 \n",
- "[8]\tvalidation-logloss:0.47700 \n",
- "[9]\tvalidation-logloss:0.46381 \n",
- "[10]\tvalidation-logloss:0.45050 \n",
- "[11]\tvalidation-logloss:0.43849 \n",
- "[12]\tvalidation-logloss:0.42775 \n",
- "[13]\tvalidation-logloss:0.41765 \n",
- "[14]\tvalidation-logloss:0.40835 \n",
- "[15]\tvalidation-logloss:0.40034 \n",
- "[16]\tvalidation-logloss:0.39337 \n",
- "[17]\tvalidation-logloss:0.38604 \n",
- "[18]\tvalidation-logloss:0.38021 \n",
- "[19]\tvalidation-logloss:0.37427 \n",
- "[20]\tvalidation-logloss:0.36930 \n",
- "[21]\tvalidation-logloss:0.36503 \n",
- "[22]\tvalidation-logloss:0.36102 \n",
- "[23]\tvalidation-logloss:0.35654 \n",
- "[24]\tvalidation-logloss:0.35270 \n",
- "[25]\tvalidation-logloss:0.34936 \n",
- "[26]\tvalidation-logloss:0.34621 \n",
- "[27]\tvalidation-logloss:0.34330 \n",
- "[28]\tvalidation-logloss:0.33954 \n",
- "[29]\tvalidation-logloss:0.33645 \n",
- "[30]\tvalidation-logloss:0.33361 \n",
- "[31]\tvalidation-logloss:0.33187 \n",
- "[32]\tvalidation-logloss:0.33028 \n",
- "[33]\tvalidation-logloss:0.32814 \n",
- "[34]\tvalidation-logloss:0.32791 \n",
- "[35]\tvalidation-logloss:0.32613 \n",
- "[36]\tvalidation-logloss:0.32440 \n",
- "[37]\tvalidation-logloss:0.32372 \n",
- "[38]\tvalidation-logloss:0.32295 \n",
- "[39]\tvalidation-logloss:0.32168 \n",
- "[40]\tvalidation-logloss:0.32060 \n",
- "[41]\tvalidation-logloss:0.31976 \n",
- "[42]\tvalidation-logloss:0.31886 \n",
- "[43]\tvalidation-logloss:0.31840 \n",
- "[44]\tvalidation-logloss:0.31685 \n",
- "[45]\tvalidation-logloss:0.31590 \n",
- "[46]\tvalidation-logloss:0.31450 \n",
- "[47]\tvalidation-logloss:0.31419 \n",
- "[48]\tvalidation-logloss:0.31348 \n",
- "[49]\tvalidation-logloss:0.31344 \n",
- "[50]\tvalidation-logloss:0.31316 \n",
- "[51]\tvalidation-logloss:0.31244 \n",
- "[52]\tvalidation-logloss:0.31219 \n",
- "[53]\tvalidation-logloss:0.31124 \n",
- "[54]\tvalidation-logloss:0.31058 \n",
- "[55]\tvalidation-logloss:0.31009 \n",
- "[56]\tvalidation-logloss:0.30970 \n",
- "[57]\tvalidation-logloss:0.30989 \n",
- "[58]\tvalidation-logloss:0.30933 \n",
- "[59]\tvalidation-logloss:0.30969 \n",
- "[60]\tvalidation-logloss:0.30924 \n",
- "[61]\tvalidation-logloss:0.30882 \n",
- "[62]\tvalidation-logloss:0.30780 \n",
- "[63]\tvalidation-logloss:0.30772 \n",
- "[64]\tvalidation-logloss:0.30751 \n",
- "[65]\tvalidation-logloss:0.30751 \n",
- "[66]\tvalidation-logloss:0.30702 \n",
- "[67]\tvalidation-logloss:0.30709 \n",
- "[68]\tvalidation-logloss:0.30667 \n",
- "[69]\tvalidation-logloss:0.30674 \n",
- "[70]\tvalidation-logloss:0.30659 \n",
- "[71]\tvalidation-logloss:0.30628 \n",
- "[72]\tvalidation-logloss:0.30622 \n",
- "[73]\tvalidation-logloss:0.30611 \n",
- "[74]\tvalidation-logloss:0.30597 \n",
- "[75]\tvalidation-logloss:0.30635 \n",
- "[76]\tvalidation-logloss:0.30636 \n",
- "[77]\tvalidation-logloss:0.30573 \n",
- "[78]\tvalidation-logloss:0.30526 \n",
- "[79]\tvalidation-logloss:0.30570 \n",
- "[80]\tvalidation-logloss:0.30604 \n",
- "[81]\tvalidation-logloss:0.30574 \n",
- "[82]\tvalidation-logloss:0.30548 \n",
- "[83]\tvalidation-logloss:0.30558 \n",
- "[84]\tvalidation-logloss:0.30559 \n",
- "[85]\tvalidation-logloss:0.30583 \n",
- "[86]\tvalidation-logloss:0.30603 \n",
- "[87]\tvalidation-logloss:0.30640 \n",
- "[88]\tvalidation-logloss:0.30655 \n",
- "[89]\tvalidation-logloss:0.30676 \n",
- "[90]\tvalidation-logloss:0.30665 \n",
- "[91]\tvalidation-logloss:0.30703 \n",
- "[92]\tvalidation-logloss:0.30689 \n",
- "[93]\tvalidation-logloss:0.30701 \n",
- "[94]\tvalidation-logloss:0.30730 \n",
- "[95]\tvalidation-logloss:0.30752 \n",
- "[96]\tvalidation-logloss:0.30731 \n",
- "[97]\tvalidation-logloss:0.30750 \n",
- "[98]\tvalidation-logloss:0.30737 \n",
- "[99]\tvalidation-logloss:0.30738 \n",
- " 80%|████████ | 8/10 [01:51<00:27, 13.65s/trial, best loss: -0.8990124844137253]"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/opt/app-root/lib/python3.8/site-packages/mlflow/models/signature.py:129: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values `_ for more details.\n",
- " inputs = _infer_schema(model_input)\n",
- "\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[0]\tvalidation-logloss:0.65702 \n",
- "[1]\tvalidation-logloss:0.62577 \n",
- "[2]\tvalidation-logloss:0.59710 \n",
- "[3]\tvalidation-logloss:0.57311 \n",
- "[4]\tvalidation-logloss:0.55046 \n",
- "[5]\tvalidation-logloss:0.53164 \n",
- "[6]\tvalidation-logloss:0.51396 \n",
- "[7]\tvalidation-logloss:0.49871 \n",
- "[8]\tvalidation-logloss:0.48454 \n",
- "[9]\tvalidation-logloss:0.47101 \n",
- "[10]\tvalidation-logloss:0.45840 \n",
- "[11]\tvalidation-logloss:0.44730 \n",
- "[12]\tvalidation-logloss:0.43741 \n",
- "[13]\tvalidation-logloss:0.42839 \n",
- "[14]\tvalidation-logloss:0.41996 \n",
- "[15]\tvalidation-logloss:0.41202 \n",
- "[16]\tvalidation-logloss:0.40536 \n",
- "[17]\tvalidation-logloss:0.39882 \n",
- "[18]\tvalidation-logloss:0.39293 \n",
- "[19]\tvalidation-logloss:0.38739 \n",
- "[20]\tvalidation-logloss:0.38258 \n",
- "[21]\tvalidation-logloss:0.37814 \n",
- "[22]\tvalidation-logloss:0.37355 \n",
- "[23]\tvalidation-logloss:0.36959 \n",
- "[24]\tvalidation-logloss:0.36643 \n",
- "[25]\tvalidation-logloss:0.36290 \n",
- "[26]\tvalidation-logloss:0.35995 \n",
- "[27]\tvalidation-logloss:0.35751 \n",
- "[28]\tvalidation-logloss:0.35510 \n",
- "[29]\tvalidation-logloss:0.35303 \n",
- "[30]\tvalidation-logloss:0.35114 \n",
- "[31]\tvalidation-logloss:0.34883 \n",
- "[32]\tvalidation-logloss:0.34649 \n",
- "[33]\tvalidation-logloss:0.34441 \n",
- "[34]\tvalidation-logloss:0.34323 \n",
- "[35]\tvalidation-logloss:0.34123 \n",
- "[36]\tvalidation-logloss:0.33954 \n",
- "[37]\tvalidation-logloss:0.33859 \n",
- "[38]\tvalidation-logloss:0.33794 \n",
- "[39]\tvalidation-logloss:0.33690 \n",
- "[40]\tvalidation-logloss:0.33546 \n",
- "[41]\tvalidation-logloss:0.33384 \n",
- "[42]\tvalidation-logloss:0.33290 \n",
- "[43]\tvalidation-logloss:0.33248 \n",
- "[44]\tvalidation-logloss:0.33231 \n",
- "[45]\tvalidation-logloss:0.33139 \n",
- "[46]\tvalidation-logloss:0.33070 \n",
- "[47]\tvalidation-logloss:0.32985 \n",
- "[48]\tvalidation-logloss:0.32950 \n",
- "[49]\tvalidation-logloss:0.32878 \n",
- "[50]\tvalidation-logloss:0.32798 \n",
- "[51]\tvalidation-logloss:0.32706 \n",
- "[52]\tvalidation-logloss:0.32602 \n",
- "[53]\tvalidation-logloss:0.32508 \n",
- "[54]\tvalidation-logloss:0.32440 \n",
- "[55]\tvalidation-logloss:0.32343 \n",
- "[56]\tvalidation-logloss:0.32260 \n",
- "[57]\tvalidation-logloss:0.32177 \n",
- "[58]\tvalidation-logloss:0.32132 \n",
- "[59]\tvalidation-logloss:0.32068 \n",
- "[60]\tvalidation-logloss:0.31944 \n",
- "[61]\tvalidation-logloss:0.31925 \n",
- "[62]\tvalidation-logloss:0.31873 \n",
- "[63]\tvalidation-logloss:0.31814 \n",
- "[64]\tvalidation-logloss:0.31807 \n",
- "[65]\tvalidation-logloss:0.31760 \n",
- "[66]\tvalidation-logloss:0.31730 \n",
- "[67]\tvalidation-logloss:0.31684 \n",
- "[68]\tvalidation-logloss:0.31663 \n",
- "[69]\tvalidation-logloss:0.31657 \n",
- "[70]\tvalidation-logloss:0.31650 \n",
- "[71]\tvalidation-logloss:0.31569 \n",
- "[72]\tvalidation-logloss:0.31466 \n",
- "[73]\tvalidation-logloss:0.31462 \n",
- "[74]\tvalidation-logloss:0.31426 \n",
- "[75]\tvalidation-logloss:0.31395 \n",
- "[76]\tvalidation-logloss:0.31369 \n",
- "[77]\tvalidation-logloss:0.31315 \n",
- "[78]\tvalidation-logloss:0.31290 \n",
- "[79]\tvalidation-logloss:0.31303 \n",
- "[80]\tvalidation-logloss:0.31275 \n",
- "[81]\tvalidation-logloss:0.31255 \n",
- "[82]\tvalidation-logloss:0.31278 \n",
- "[83]\tvalidation-logloss:0.31270 \n",
- "[84]\tvalidation-logloss:0.31266 \n",
- "[85]\tvalidation-logloss:0.31258 \n",
- "[86]\tvalidation-logloss:0.31252 \n",
- "[87]\tvalidation-logloss:0.31239 \n",
- "[88]\tvalidation-logloss:0.31204 \n",
- "[89]\tvalidation-logloss:0.31141 \n",
- "[90]\tvalidation-logloss:0.31104 \n",
- "[91]\tvalidation-logloss:0.31091 \n",
- "[92]\tvalidation-logloss:0.31090 \n",
- "[93]\tvalidation-logloss:0.31114 \n",
- "[94]\tvalidation-logloss:0.31106 \n",
- "[95]\tvalidation-logloss:0.31056 \n",
- "[96]\tvalidation-logloss:0.31044 \n",
- "[97]\tvalidation-logloss:0.31022 \n",
- "[98]\tvalidation-logloss:0.31044 \n",
- "[99]\tvalidation-logloss:0.31037 \n",
- " 90%|█████████ | 9/10 [02:05<00:13, 13.90s/trial, best loss: -0.8990124844137253]"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/opt/app-root/lib/python3.8/site-packages/mlflow/models/signature.py:129: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values `_ for more details.\n",
- " inputs = _infer_schema(model_input)\n",
- "\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "100%|██████████| 10/10 [02:17<00:00, 13.72s/trial, best loss: -0.8990124844137253]\n"
- ]
- }
- ],
+ "outputs": [],
"source": [
"search_space = {\n",
" 'max_depth': scope.int(hp.quniform('max_depth', 50, 100, 10)),\n",
@@ -2282,18 +706,10 @@
},
{
"cell_type": "code",
- "execution_count": 28,
+ "execution_count": null,
"id": "12fb65af-5545-4563-a70a-ad62d11e6615",
"metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "AUC of Best Run: 0.8990124844137253\n"
- ]
- }
- ],
+ "outputs": [],
"source": [
"best_run = mlflow.search_runs(order_by=['metrics.auc DESC']).iloc[0]\n",
"best_run_id = best_run[\"run_id\"]\n",
@@ -2302,21 +718,10 @@
},
{
"cell_type": "code",
- "execution_count": 29,
+ "execution_count": null,
"id": "2df971e0-1748-489c-b36d-dd481c211a0d",
"metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "'fbd116d8519d4119b32f0d7b7d56c980'"
- ]
- },
- "execution_count": 29,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
+ "outputs": [],
"source": [
"best_run_id"
]
@@ -2333,18 +738,10 @@
},
{
"cell_type": "code",
- "execution_count": 30,
+ "execution_count": null,
"id": "db412034-daab-4942-b3ae-0c5410a3e5a5",
"metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "AUC: 0.903522359913793\n"
- ]
- }
- ],
+ "outputs": [],
"source": [
"# model = mlflow.pyfunc.load_model(f\"models:/TestModelD/production\")\n",
"model = mlflow.pyfunc.load_model(\"runs:/\" + best_run_id + \"/model\")\n",
@@ -2355,26 +752,10 @@
},
{
"cell_type": "code",
- "execution_count": 31,
+ "execution_count": null,
"id": "b5593581-c563-4c1a-aa80-f10d77f53209",
"metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- " precision recall f1-score support\n",
- "\n",
- " white wine 0.90 0.95 0.92 1044\n",
- " red wine 0.73 0.57 0.64 256\n",
- "\n",
- " accuracy 0.87 1300\n",
- " macro avg 0.81 0.76 0.78 1300\n",
- "weighted avg 0.87 0.87 0.87 1300\n",
- "\n"
- ]
- }
- ],
+ "outputs": [],
"source": [
"from sklearn.metrics import classification_report\n",
"\n",
@@ -2385,23 +766,10 @@
},
{
"cell_type": "code",
- "execution_count": 32,
+ "execution_count": null,
"id": "b0ebe7ec-a7e2-49bc-88bb-2b5ea79f3807",
"metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "",
- "text/plain": [
- "