Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLV API Standardization #527

Open
3 of 6 tasks
ColtAllen opened this issue Feb 9, 2024 · 5 comments
Open
3 of 6 tasks

CLV API Standardization #527

ColtAllen opened this issue Feb 9, 2024 · 5 comments

Comments

@ColtAllen
Copy link
Collaborator

ColtAllen commented Feb 9, 2024

There are API inconsistencies in the CLV module. Standardization is a big task best broken down into multiple PRs:

PRs to be Completed In-Order

Current API

Beta-Geo/NBD Transactions Model

rfm_data = pd.DataFrame(
            {
                "customer_id": customer_id,
                "frequency": frequency,
                "recency": recency,
                "T": T,
            }
        )

model = BetaGeoModel(data=rfm_df)
model.build_model()
model.fit()

model.expected_num_purchases(
    customer_id=rfm_data["customer_id"],
    t=10,
    frequency=rfm_data["frequency"],
    recency=rfm_data["recency"],
    T=rfm_data["T"],
)

Note how fit data is provided as a dataframe, but in the predictive methods individual arrays must be provided. Specifying one array at a time was one of the most annoying aspects about using the legacy lifetimes library, and sometimes even created indexing issues that caused the underlying scipy functions to crash.

For ParetoNBDModel, I streamlined this nonsense with a dataframe argument, and made it optional if running predictions on the fit dataset:

Pareto/NBD Transactions Model

rfm_data = pd.DataFrame(
            {
                "customer_id": customer_id,
                "frequency": frequency,
                "recency": recency,
                "T": T,
            }
        )

model = ParetoNBDModel(data=rfm_data)
model.build_model()
model.fit()

# Data param is optional and only required for out-of-sample data
model.expected_purchases(future_t=10)

model.expected_purchases(
    data=future_rfm_df,
    future_t=10,
)

(We will also need to resolve the naming inconsistencies between these models.)

I've been told passing in dataframes instead of arrays loses some xarray broadcasting functionality, which I'd be interested to hear more about. I'm not opposed to arrays being passed in provided it's optional for in-sample data.

The API discrepancies between these models necessitated a hotfix for the monetary value model, which follows the same conventions as BetaGeoModel:

Gamma/Gamma Monetary Value Model

monetary_data = pd.DataFrame(
            {
                "customer_id": customer_id,
                "mean_transaction_value": monetary_value,
                "frequency": frequency,
            }
        )

model = GammaGammaModel(data=monetary_data)
model.build_model()
model.fit()

model.expected_customer_lifetime_value(
    transaction_model=transaction_model,
    customer_id=rfm_data["customer_id"],
    mean_transaction_value=rfm_data["monetary_value"],
    frequency=rfm_data["frequency"],
    recency=rfm_data["recency"],
    T=rfm_data["T"],
    time=12,
    discount_rate=0.01,
    freq="W",
)

Lastly, ShiftedBetaGeoModelIndividual is a whole different animal since it handles contractual transactions, but I think it'd be a good idea to add support for it to the customer_lifetime_value utility:

Shifted Beta-Geo Contractual Model

contract_data = pd.DataFrame(
            {
                "customer_id": customer_id,
                "t_churn": t_churn,
                "T": T,
            }
        )

model = ShiftedBetaGeoModelIndividual(data=contract_data)
model.build_model()
model.fit()

model.distribution_customer_churn_time(customer_id=contract_data["customer_id"])
@ricardoV94
Copy link
Contributor

I think the best would be to work with xarray datasets. It has the organization benefits of pandas, with the broadcasting behavior of numpy. Internally most predictive methods are already written with xarray code anyway. Users could pass pandas dataframes and we convert to xarray, but the default type which needs to conversion would be xarrays.

Definitely agree that passing separate numpy arrays is too cumbersome

@ColtAllen
Copy link
Collaborator Author

Updated original comment with list of PRs to complete.

@wd60622
Copy link
Contributor

wd60622 commented May 22, 2024

Is t / future_t meant to be vector / vectorized in the API? I think previous implementation had it as vector of same size as each other input or scalar

@ColtAllen
Copy link
Collaborator Author

Is t / future_t meant to be vector / vectorized in the API? I think previous implementation had it as vector of same size as each other input or scalar

Both forms of parametrization (vectorized or scalar) are supported:

# scalar parametrization (here predictions are being ran for in-sample data)
model.expected_purchases(future_t=10)

# equivalent vectorized parametrization
data = data.assign(future_t=10)
model.expected_purchases(data)

Vectorization support was added to facilitate xarray inputs in the future.

@ColtAllen ColtAllen mentioned this issue Jun 1, 2024
12 tasks
This was referenced Jun 9, 2024
@ColtAllen ColtAllen added this to the 0.7.0 milestone Jun 13, 2024
@ColtAllen
Copy link
Collaborator Author

Steps 4-6 (along with adding CLV support for ShiftedBetaGeoModelIndividual) are extraneous and will be given their own issues after #758 is merged.

@juanitorduz juanitorduz removed this from the 0.7.0 milestone Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants