RFM Segmentation (#680)

* init rfm_segments func * TODOs * docstrings and for loop * docstrings and for loop * WIP dev notebook debugging * checkpoint commit for remote pull * code testing in dev notebook * unit tests added * dev notebook cleanup * clean up type hints * comments and code cleanup * docstrings * move formatting to rfm_summary and quickstart edits * fix rfm_train_test_split bug * added test for rfm_quartile_labels * added rfm score warning
pymc-labs · May 28, 2024 · b763c12 · coweal · Jun 12, 2024 · coweal
1 parent 3feec18
commit b763c12
Show file tree

Hide file tree

Showing 5 changed files with 498 additions and 89 deletions.
diff --git a/docs/source/notebooks/clv/clv_quickstart.ipynb b/docs/source/notebooks/clv/clv_quickstart.ipynb
@@ -67,10 +67,10 @@
     "* `customer_id` represents a unique identifier for each customer.\n",
     "* `frequency` represents the number of _repeat_ purchases that a customer has made, i.e. one less than the total number of purchases.\n",
     "* `T` represents a customer's \"age\", i.e. the duration between a customer's first purchase and the end of the period of study. In this example notebook, the units of time are in weeks.\n",
-    "* `recency` represents the timepoint when a customer made their most recent purchase. This is also equal to the duration between a customer’s first non-repeat purchase (usually time 0) and last purchase. If a customer has made only 1 purchase, their recency is 0;\n",
+    "* `recency` represents the time period when a customer made their most recent purchase. This is equal to the duration between a customer’s first and last purchase. If a customer has made only 1 purchase, their recency is 0.\n",
     "* `monetary_value` represents the average value of a given customer’s repeat purchases. Customers who have only made a single purchase have monetary values of zero.\n",
     "\n",
-    "If working with raw transaction data, the `rfm_summary` function can be used to preprocess data for modeling:"
+    "The `rfm_summary` function can be used to preprocess raw transaction data for modeling:"
    ]
   },
   {
@@ -339,6 +339,8 @@
    "id": "514ee548",
    "metadata": {},
    "source": [
+    "It is important to note these definitions differ from that used in RFM segmentation, where the first purchase is included, `T` is not used, and `recency` is the number of time periods since a customer's most recent purchase.\n",
+    "\n",
     "To visualize data in RFM format, we can plot the recency and T of the customers with the `plot_customer_exposure` function. We see a large chunk (>60%) of customers haven't made another purchase in a while."
    ]
   },
@@ -2579,7 +2581,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.18"
+   "version": "3.10.14"
   },
   "toc": {
    "base_numbering": 1,

diff --git a/docs/source/notebooks/clv/dev/utilities_plotting.ipynb b/docs/source/notebooks/clv/dev/utilities_plotting.ipynb
@@ -5,15 +5,7 @@
    "execution_count": 1,
    "id": "435ed203-5c3c-4efc-93d1-abac66ce7187",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "from pymc_marketing.clv import utils\n",
     "\n",
@@ -30,7 +22,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 69,
+   "execution_count": 2,
    "id": "7de7f396-1d5b-4457-916b-c29ed90aa132",
    "metadata": {},
    "outputs": [],
@@ -66,7 +58,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 70,
+   "execution_count": 3,
    "id": "932e8db6-78cf-49df-aa4a-83ee6584e5dd",
    "metadata": {},
    "outputs": [
@@ -196,7 +188,7 @@
        "13   6  2015-02-02   True"
       ]
      },
-     "execution_count": 70,
+     "execution_count": 3,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -223,7 +215,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 74,
+   "execution_count": 4,
    "id": "4c0a7de5-8825-40af-84e5-6cd0ad26a0e3",
    "metadata": {},
    "outputs": [
@@ -259,57 +251,57 @@
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>1</td>\n",
-       "      <td>1.0</td>\n",
+       "      <td>2.0</td>\n",
        "      <td>5.0</td>\n",
        "      <td>5.0</td>\n",
-       "      <td>2.0</td>\n",
+       "      <td>1.5</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>2</td>\n",
-       "      <td>0.0</td>\n",
+       "      <td>1.0</td>\n",
        "      <td>0.0</td>\n",
        "      <td>5.0</td>\n",
-       "      <td>0.0</td>\n",
+       "      <td>2.0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td>3</td>\n",
-       "      <td>1.0</td>\n",
+       "      <td>2.0</td>\n",
        "      <td>1.0</td>\n",
        "      <td>5.0</td>\n",
-       "      <td>5.0</td>\n",
+       "      <td>4.5</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td>4</td>\n",
-       "      <td>1.0</td>\n",
+       "      <td>2.0</td>\n",
        "      <td>3.0</td>\n",
        "      <td>3.0</td>\n",
-       "      <td>8.0</td>\n",
+       "      <td>7.0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td>5</td>\n",
-       "      <td>0.0</td>\n",
+       "      <td>1.0</td>\n",
        "      <td>0.0</td>\n",
        "      <td>3.0</td>\n",
-       "      <td>0.0</td>\n",
+       "      <td>12.0</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "text/plain": [
        "   customer_id  frequency  recency    T  monetary_value\n",
-       "0            1        1.0      5.0  5.0             2.0\n",
-       "1            2        0.0      0.0  5.0             0.0\n",
-       "2            3        1.0      1.0  5.0             5.0\n",
-       "3            4        1.0      3.0  3.0             8.0\n",
-       "4            5        0.0      0.0  3.0             0.0"
+       "0            1        2.0      5.0  5.0             1.5\n",
+       "1            2        1.0      0.0  5.0             2.0\n",
+       "2            3        2.0      1.0  5.0             4.5\n",
+       "3            4        2.0      3.0  3.0             7.0\n",
+       "4            5        1.0      0.0  3.0            12.0"
       ]
      },
-     "execution_count": 74,
+     "execution_count": 4,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -323,7 +315,7 @@
     "    observation_period_end = \"2015-02-06\",\n",
     "    datetime_format = \"%Y-%m-%d\",\n",
     "    time_unit = \"W\",\n",
-    "    include_first_transaction=False,\n",
+    "    include_first_transaction=True,\n",
     ")\n",
     "\n",
     "rfm_df.head()"
@@ -339,7 +331,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 76,
+   "execution_count": 5,
    "id": "761edfe9-1b69-4966-83bf-4f1242eda2d5",
    "metadata": {},
    "outputs": [
@@ -450,7 +442,7 @@
        "4                  0.0     5.0  "
       ]
      },
-     "execution_count": 76,
+     "execution_count": 5,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -467,13 +459,137 @@
     "train_test.head()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "73dc1b93-6a4f-4171-b838-30759b2c1e0e",
+   "metadata": {},
+   "source": [
+    "`rfm_segments` will assign customer to segments based on their recency, frequency, and monetary value. It uses a quartile-based RFM score approach that is very computationally efficient, but defining custom segments is a rather subjective exercise. The returned dataframe also cannot be used for modeling because it does not zero out the initial transactions."
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 40,
    "id": "c7b3f800-8dfb-4e5a-b939-5f908281563c",
    "metadata": {},
    "outputs": [],
-   "source": []
+   "source": [
+    "segments = utils.rfm_segments(\n",
+    "    test_data, \n",
+    "    customer_id_col = \"id\", \n",
+    "    datetime_col = \"date\", \n",
+    "    monetary_value_col = \"monetary_value\",\n",
+    "    observation_period_end = \"2015-02-06\",\n",
+    "    datetime_format = \"%Y-%m-%d\",\n",
+    "    time_unit = \"W\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "932ac4e5-361e-42fa-97d3-d8e508128944",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>customer_id</th>\n",
+       "      <th>frequency</th>\n",
+       "      <th>recency</th>\n",
+       "      <th>monetary_value</th>\n",
+       "      <th>segment</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1</td>\n",
+       "      <td>2.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>1.5</td>\n",
+       "      <td>Other</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>2.0</td>\n",
+       "      <td>Inactive Customer</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>3</td>\n",
+       "      <td>2.0</td>\n",
+       "      <td>4.0</td>\n",
+       "      <td>4.5</td>\n",
+       "      <td>At Risk Customer</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>4</td>\n",
+       "      <td>2.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>7.0</td>\n",
+       "      <td>Top Spender</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>5</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>3.0</td>\n",
+       "      <td>12.0</td>\n",
+       "      <td>At Risk Customer</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>6</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>Top Spender</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   customer_id  frequency  recency  monetary_value            segment\n",
+       "0            1        2.0      0.0             1.5              Other\n",
+       "1            2        1.0      5.0             2.0  Inactive Customer\n",
+       "2            3        2.0      4.0             4.5   At Risk Customer\n",
+       "3            4        2.0      0.0             7.0        Top Spender\n",
+       "4            5        1.0      3.0            12.0   At Risk Customer\n",
+       "5            6        1.0      0.0             5.0        Top Spender"
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "segments"
+   ]
   }
  ],
  "metadata": {
@@ -492,7 +608,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.18"
+   "version": "3.10.14"
   }
  },
  "nbformat": 4,

diff --git a/pymc_marketing/clv/__init__.py b/pymc_marketing/clv/__init__.py
@@ -25,6 +25,7 @@
 )
 from pymc_marketing.clv.utils import (
     customer_lifetime_value,
+    rfm_segments,
     rfm_summary,
     rfm_train_test_split,
 )
@@ -39,6 +40,7 @@
     "plot_customer_exposure",
     "plot_frequency_recency_matrix",
     "plot_probability_alive_matrix",
+    "rfm_segments",
     "rfm_summary",
     "rfm_train_test_split",
 )