Skip to content

Commit

Permalink
RFM Segmentation (#680)
Browse files Browse the repository at this point in the history
* init rfm_segments func

* TODOs

* docstrings and for loop

* docstrings and for loop

* WIP dev notebook debugging

* checkpoint commit for remote pull

* code testing in dev notebook

* unit tests added

* dev notebook cleanup

* clean up type hints

* comments and code cleanup

* docstrings

* move formatting to rfm_summary and quickstart edits

* fix rfm_train_test_split bug

* added test for rfm_quartile_labels

* added rfm score warning
  • Loading branch information
ColtAllen authored May 28, 2024
1 parent 3feec18 commit b763c12
Show file tree
Hide file tree
Showing 5 changed files with 498 additions and 89 deletions.
8 changes: 5 additions & 3 deletions docs/source/notebooks/clv/clv_quickstart.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -67,10 +67,10 @@
"* `customer_id` represents a unique identifier for each customer.\n",
"* `frequency` represents the number of _repeat_ purchases that a customer has made, i.e. one less than the total number of purchases.\n",
"* `T` represents a customer's \"age\", i.e. the duration between a customer's first purchase and the end of the period of study. In this example notebook, the units of time are in weeks.\n",
"* `recency` represents the timepoint when a customer made their most recent purchase. This is also equal to the duration between a customer’s first non-repeat purchase (usually time 0) and last purchase. If a customer has made only 1 purchase, their recency is 0;\n",
"* `recency` represents the time period when a customer made their most recent purchase. This is equal to the duration between a customer’s first and last purchase. If a customer has made only 1 purchase, their recency is 0.\n",

This comment has been minimized.

Copy link
@coweal

coweal Jun 12, 2024

Hey! I am not sure if this is a correct definition. For example, if a customer has made one transaction one year ago and the period is in days, as I understand it, recency for that customer should be one year, not zero as it is in the current implementation.

Another problem is with customers who make a lot of purchases for a long time. For example, if a customer has been with the company almost from the beginning and has made many purchases over that time, with the last purchase, let's say, two days ago, their recency will be huge and equal their whole lifespan (not two days, as I would expect)! When one runs expected_probability_alive, the probability of being alive for this customer will be equal to zero, which definitely is not correct.

It seems that recency should be equal to the difference between observation_period_end and the time of the latest purchase made by a customer.

This comment has been minimized.

Copy link
@coweal

coweal Jun 12, 2024

Same issue for those who have recency equals to 0 while purchasing something just once a few years ago. They have expected_probability_alive equals 1.

This comment has been minimized.

Copy link
@ColtAllen

ColtAllen Jun 12, 2024

Author Collaborator

These definitions are correct for the modeling definition of recency. The definition used in RFM segmentation is as you've described, and will be clarified in the next release. Docs for the next release are viewable here: https://www.pymc-marketing.io/en/latest/notebooks/clv/clv_quickstart.html

BetaGeoModel assumes all non-repeat customers have an alive probability of 1. If this is not a valid assumption for your use case, use ParetoNBDModel because it does not have this assumption.

"* `monetary_value` represents the average value of a given customer’s repeat purchases. Customers who have only made a single purchase have monetary values of zero.\n",
"\n",
"If working with raw transaction data, the `rfm_summary` function can be used to preprocess data for modeling:"
"The `rfm_summary` function can be used to preprocess raw transaction data for modeling:"
]
},
{
Expand Down Expand Up @@ -339,6 +339,8 @@
"id": "514ee548",
"metadata": {},
"source": [
"It is important to note these definitions differ from that used in RFM segmentation, where the first purchase is included, `T` is not used, and `recency` is the number of time periods since a customer's most recent purchase.\n",
"\n",
"To visualize data in RFM format, we can plot the recency and T of the customers with the `plot_customer_exposure` function. We see a large chunk (>60%) of customers haven't made another purchase in a while."
]
},
Expand Down Expand Up @@ -2579,7 +2581,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
"version": "3.10.14"
},
"toc": {
"base_numbering": 1,
Expand Down
186 changes: 151 additions & 35 deletions docs/source/notebooks/clv/dev/utilities_plotting.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,7 @@
"execution_count": 1,
"id": "435ed203-5c3c-4efc-93d1-abac66ce7187",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.\n"
]
}
],
"outputs": [],
"source": [
"from pymc_marketing.clv import utils\n",
"\n",
Expand All @@ -30,7 +22,7 @@
},
{
"cell_type": "code",
"execution_count": 69,
"execution_count": 2,
"id": "7de7f396-1d5b-4457-916b-c29ed90aa132",
"metadata": {},
"outputs": [],
Expand Down Expand Up @@ -66,7 +58,7 @@
},
{
"cell_type": "code",
"execution_count": 70,
"execution_count": 3,
"id": "932e8db6-78cf-49df-aa4a-83ee6584e5dd",
"metadata": {},
"outputs": [
Expand Down Expand Up @@ -196,7 +188,7 @@
"13 6 2015-02-02 True"
]
},
"execution_count": 70,
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -223,7 +215,7 @@
},
{
"cell_type": "code",
"execution_count": 74,
"execution_count": 4,
"id": "4c0a7de5-8825-40af-84e5-6cd0ad26a0e3",
"metadata": {},
"outputs": [
Expand Down Expand Up @@ -259,57 +251,57 @@
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>5.0</td>\n",
" <td>5.0</td>\n",
" <td>2.0</td>\n",
" <td>1.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>5.0</td>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>5.0</td>\n",
" <td>5.0</td>\n",
" <td>4.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>8.0</td>\n",
" <td>7.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>0.0</td>\n",
" <td>12.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" customer_id frequency recency T monetary_value\n",
"0 1 1.0 5.0 5.0 2.0\n",
"1 2 0.0 0.0 5.0 0.0\n",
"2 3 1.0 1.0 5.0 5.0\n",
"3 4 1.0 3.0 3.0 8.0\n",
"4 5 0.0 0.0 3.0 0.0"
"0 1 2.0 5.0 5.0 1.5\n",
"1 2 1.0 0.0 5.0 2.0\n",
"2 3 2.0 1.0 5.0 4.5\n",
"3 4 2.0 3.0 3.0 7.0\n",
"4 5 1.0 0.0 3.0 12.0"
]
},
"execution_count": 74,
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -323,7 +315,7 @@
" observation_period_end = \"2015-02-06\",\n",
" datetime_format = \"%Y-%m-%d\",\n",
" time_unit = \"W\",\n",
" include_first_transaction=False,\n",
" include_first_transaction=True,\n",
")\n",
"\n",
"rfm_df.head()"
Expand All @@ -339,7 +331,7 @@
},
{
"cell_type": "code",
"execution_count": 76,
"execution_count": 5,
"id": "761edfe9-1b69-4966-83bf-4f1242eda2d5",
"metadata": {},
"outputs": [
Expand Down Expand Up @@ -450,7 +442,7 @@
"4 0.0 5.0 "
]
},
"execution_count": 76,
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -467,13 +459,137 @@
"train_test.head()"
]
},
{
"cell_type": "markdown",
"id": "73dc1b93-6a4f-4171-b838-30759b2c1e0e",
"metadata": {},
"source": [
"`rfm_segments` will assign customer to segments based on their recency, frequency, and monetary value. It uses a quartile-based RFM score approach that is very computationally efficient, but defining custom segments is a rather subjective exercise. The returned dataframe also cannot be used for modeling because it does not zero out the initial transactions."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 40,
"id": "c7b3f800-8dfb-4e5a-b939-5f908281563c",
"metadata": {},
"outputs": [],
"source": []
"source": [
"segments = utils.rfm_segments(\n",
" test_data, \n",
" customer_id_col = \"id\", \n",
" datetime_col = \"date\", \n",
" monetary_value_col = \"monetary_value\",\n",
" observation_period_end = \"2015-02-06\",\n",
" datetime_format = \"%Y-%m-%d\",\n",
" time_unit = \"W\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "932ac4e5-361e-42fa-97d3-d8e508128944",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>customer_id</th>\n",
" <th>frequency</th>\n",
" <th>recency</th>\n",
" <th>monetary_value</th>\n",
" <th>segment</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>1.5</td>\n",
" <td>Other</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>5.0</td>\n",
" <td>2.0</td>\n",
" <td>Inactive Customer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>2.0</td>\n",
" <td>4.0</td>\n",
" <td>4.5</td>\n",
" <td>At Risk Customer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>7.0</td>\n",
" <td>Top Spender</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>1.0</td>\n",
" <td>3.0</td>\n",
" <td>12.0</td>\n",
" <td>At Risk Customer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>5.0</td>\n",
" <td>Top Spender</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" customer_id frequency recency monetary_value segment\n",
"0 1 2.0 0.0 1.5 Other\n",
"1 2 1.0 5.0 2.0 Inactive Customer\n",
"2 3 2.0 4.0 4.5 At Risk Customer\n",
"3 4 2.0 0.0 7.0 Top Spender\n",
"4 5 1.0 3.0 12.0 At Risk Customer\n",
"5 6 1.0 0.0 5.0 Top Spender"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"segments"
]
}
],
"metadata": {
Expand All @@ -492,7 +608,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
"version": "3.10.14"
}
},
"nbformat": 4,
Expand Down
2 changes: 2 additions & 0 deletions pymc_marketing/clv/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
)
from pymc_marketing.clv.utils import (
customer_lifetime_value,
rfm_segments,
rfm_summary,
rfm_train_test_split,
)
Expand All @@ -39,6 +40,7 @@
"plot_customer_exposure",
"plot_frequency_recency_matrix",
"plot_probability_alive_matrix",
"rfm_segments",
"rfm_summary",
"rfm_train_test_split",
)
Loading

0 comments on commit b763c12

Please sign in to comment.