Buycycle's recommendation system is designed to enhance the user experience in the pre-owned bike market by providing personalized bike suggestions. The system uses machine learning to tailor recommendations based on user preferences and behavior, addressing the challenges of information inefficiencies and high transaction costs in the used bike market.
Create_data.py implements reads in the DB of bikes periodically and saves the queried data to disc. It then applies feature engineering and constructs the m x n (m = bike_ids with status != new, n= bike_ids with status active) similarity matrix (default metric is pariwise cosine).
The above descibed data generation and read-in process is done periodically with a pvc cron job (k8s) and thread (app.py).
The model implements three algorithm for content based recommendation.
-
get_top_n_quality_prefiltered Returns the top n recommendation based on overall quality (quality_score), filtered by price, frame_size_code and family_id
Since the click data in the DB is discontinued, we randomized the order of the df until the new quality DB is red. Currently this is relevant for the edge case that the bike_id is not yet in the constructed similarity matrix The prefiltered features need to be send in the request.
-
get_top_n_recommendations Returns the top n recommendations for a bike_id
Features considered are categorical_features and numerical_features defined in src/driver_content.py Based on the constructed similarity matrix
-
get_top_n_recommendations_prefiltered Returns the top n recommendations as described above for a bike_id prefiltered by the prefilter_features defined in src/driver_content.py. This enables to prefilter the recommendation by for example model and brand, making the recommendation more specific. Currently we only filter for family_id. Currently only one prefilter_feature is supported, multiple lead to docker container performance issues (300ms).
Finally, get_top_n_recommendations_mix implements a mix of prefilters and generic recommendations. If interveav_prefiltered_general these selected these two lists are interwoven, meaning they are mixed alternatevly. Otherwise the generic bikes are appended to the prefiltered one. The ratio between the two can be determined with the ratio argument. If there are not enough recommendations available, populatity is used to append to the recommendation list.
Currently these features are use to construct the similarity matrix.
categorical_features = [
"motor",
"bike_component_id",
"bike_category_id",
"bike_type",
"brake_type_code",
"frame_material_code",
"shifting_code",
"color",
]
numerical_features = ["price", "frame_size_code", "year"]
We overwieight certain features to account for their different relevance.
numerical_features_to_overweight = ["price", "frame_size_code"]
numerical_features_overweight_factor = 4
categorical_features_to_overweight = ["bike_component_id", "bike_category_id", "bike_type"]
categorical_features_overweight_factor = 2
The data generation is implemented periodicaly in create_data.py. The script checks which data is available on the pod's pvc and compares to the mixpanel based click date on S3. It then updates the local data and saves it to the pvc, again the model reads this data periodically and retrains the model.
The data generation condenses all click data of unique user_id (distinct_id in the current case) and calculates the feedback according to the weights below.
Currently, the features used are:
user_id = 'distinct_id'
bike_id = 'bike.id'
item_features = [
"family_id",
"rider_height_min",
"rider_height_max",
"price",
]
implicit_feedback = {'Bike_view': 1,
'Choose_service': 5,
'Choose_shipping_method': 5,
'Save_filter': 5,
'add_discount': 10,
'add_payment_info': 20,
'add_shipping_info': 20,
'add_to_compare': 3,
'add_to_favorite': 10,
'ask_question': 10,
'begin_checkout': 12,
'choose_condition': 0,
'click_filter_no_result': 0,
'close_login_page': 0,
'comment_show_original': 3,
'counter_offer': 10,
'delete_from_favourites': -2,
'home_page_open': 0,
'login': 5,
'open_login_page': 2,
'purchase': 50,
'receive_block_comment_pop_up': 0,
'recom_bike_view': 3,
'register': 10,
'remove_bike_from_compare': -1,
'request_leasing': 20,
'sales_ad_created': 0,
'search': 0,
'search_without_result': 0,
'sell_click_family': 0,
'sell_click_template': 0,
'sell_condition': 0,
'sell_details': 0,
'sell_search': 0,
'sell_templates': 0,
'sell_toggle_components_changed': 0,
'sellpage_open': 0,
'share_bike': 10,
'shop_view': 2,
'toggle_language': 2,
}
Same as above but draw from a sample of more recommendations and randomize.
Rerank a given list of bike_rerank_id according to collaborative filtering model.
To account for business logic we introduce a preference dict that applies custom user specific filters such as continent_id. Currently this is used to avoid cross continent ebike shipping. We also filter the recommendations by the user specified preferenes in the df_user_preferences. These two filters are combined into the preference_mask.
The chat bot buyers guide is relying on the QualityFilter strategy that returns slugs of progressivly prefiltered bikes in desc order of quality_score.
The AB test is implemented with istio on Kubernetes. Istio ensures a weighted assignment of users by browser. Istio returns a 'version', if no or unknown --header "version: xx" is sent, that is then saved in local storage of the browser. Make sure the app_version in the model matches the Kubernetes version. There are three scenarios:
If traffic should be routed to the dev environment, name them -dev for the version name. This allows the Load Balancer to route to the dev environment.
Test two versions:
Prepare two versions and add -stable, -canary to the app_version
Define your image tag and name and weight under Values.api.version.
use the same name as in the model app_version
Sync Kubernetes.
After test finished:
set canary to 0% to stop allocation of new people
and/or rename Values.api.version.name to force also already assigned to 0%.
Change model image but keep user assignment:
This only works if the model input data is not changing.
Set Values.newVersion to false and change Values.api.version.tag to different value.
Sync Kubernets.
Start a new ab test:
same as 'test two versions'
The PVCs and cronjobs for the data creation are kept and assigned through meta_name, so change the metaname.
If data requirements are chaning for the model version first start version with
0 value and wait until the old data from the pvc is delete and new one created.
Then set to desired weight.
Try to apply the specified strategy in the request to get recommendations. If no stategy is passed, use the default strategy (product page). If an unknow strategy is passed use the fallback strategy (product_page). If the specified strategy or fallback strategy leads to < n recommendations, use a ContentMixed to ensure that always enough recommendations are generated.
strategy_dict = {
"product_page": CollaborativeRandomized,
"braze": Collaborative,
"homepage": CollaborativeRandomized,
"FallbackContentMixed": FallbackContentMixed,
"bot": QualityFilter,
}
FallbackContentMixed
:- This strategy is a fallback that uses the content based model.
- It uses a similarity matrix to find bikes similar to the given bike ID.
- If not enough similar bikes are found, quality-based recommendations are added.
- The recommendations are generated by the
get_top_n_recommendations_mix
function, which takes into account various parameters such as family ID, price, and frame size code. - The strategy returns a tuple containing the strategy name, a list of recommendations, and an optional error message.
ContentMixed
:- Similar to
FallbackContentMixed
, this strategy also mixes content-based recommendations with quality-based recommendations. - It uses the same
get_top_n_recommendations_mix
function and parameters. - The strategy is designed to return quality-based recommendations if there are too few specific recommendations available.
- Similar to
Collaborative
:- This strategy attempts to use collaborative filtering to generate recommendations.
- It uses the
get_top_n_collaborative
function, which leverages a pre-trained model and a dataset to predict items that a user might like. - The strategy returns a tuple with the strategy name, a list of recommendations, and a placeholder for an error message (which is always
None
in this case).
CollaborativeRandomized
:- This strategy is a variation of collaborative filtering that includes randomized sampling.
- It uses the
get_top_n_collaborative_randomized
function to generate recommendations, which introduces randomness to potentially increase the diversity of recommendations. - The strategy returns a tuple with the strategy name, a list of recommendations, and an optional error message.
QualityFilter
:- This strategy applies filters to the bikes list sorted by quality score and returns the slugs.
- Currently we consider, category, price and rider_height.
- The strategy uses the
get_top_n_quality_prefiltered_bot
function to generate recommendations after applying the specified filters. - The strategy returns a tuple with the strategy name, a list of recommendations, and an optional error message.
Most parameters are determined in the driver_.py files, such as numerical and categorical features to consider and which feature to prefilter for. The function create_data_model_content allows the overweighing num and cat feature, chose in the driver. get_data replaces NA values with the median, currently only for 'motor'.
The DB parameters are read-in from a config.ini, this should be moved to a vault based approach.
Currently, the status to recommend bike is active and the similarity metric is eucledian.
We setup the conda development environment with the following command.
make setup
Install requirements
make install
Prepare test data
python create_data.py data/ test
-
make lint
-
make format
when creating an docker image the data is downloaded and prepared. Build test and production stages and runs tests.
docker compose build
Run app.
docker compose up app
src/driver_content.py defines the SQL queries, categorical and numerical features as well as the prefilter_features. config/config.ini holdes DB credentials. src/driver_collaborative.py defines the bike and user column, which item and user features to use and the implicit feedback weights.
REST API
Path: /health
HTTP Verb: GET
Path: /recommendation
HTTP Verb: POST
Description: Create n recommednations given a bike_id, user_id and distinct_id, family_id, price and frame_size_code. Parameters:
strategy: str, recommendation strategy to apply
bike_id: int, bike_id for which recommendations are generated
user_id: int, if null translated to 0, user_id for which recommednations are generated
distinct_id: str, distinct_id for which recommednations are generated, mixpanel ID
n: int, number of recommendations to return
family_id: int, family id
price, int, price
frame_size_code, str, frame size
{ "strategy": "CollaborativeRandomized",
"bike_type": 1234,
"bike_id": 1234,
"user_id": 123,
"distinct_id": "abc",
"n": 4,
"family_id": 12334,
"price": 1999,
"frame_size_code": "56"
}
Return:
jsonify({"status": "success", "strategy": strategy, "recommendation": recommendation, "app_name": app_name, "app_version": app_version}),
strategy:
'product_page': ContentMixed
'braze': Collaborative
'homepage': CollaborativeRandomized
recommendation:
bike_id_1,
.
.
.
bike_id_n
HTTP:
200: Sucessfull
400: Bad request or request argument
404: Recommendation failed
Example:
ab
curl -i -X POST https://ab.recommendation.buycycle.com/recommendation \
-H "Content-Type: application/json" \
-d '{
"strategy": "product_page",
"bike_id": 78041,
"user_id": 123,
"distinct_id": "3bf240f7-aead-4227-8538-b204aaa58692",
"n": 8,
"family_id": 403,
"continent_id": 1
}' \
--header "version: stable"
dev
curl -i -X POST https://dev.recommendation.buycycle.com/recommendation \
-H "Content-Type: application/json" \
-d '{
"strategy": "product_page",
"bike_id": 78041,
"user_id": 123,
"distinct_id": "3bf240f7-aead-4227-8538-b204aaa58692",
"n": 8,
"family_id": 403,
"continent_id": 1
}' \
--header "version: stable-001-dev"
deployed
curl -i -X POST https://recommendation.buycycle.com/recommendation ... "user_id": 123,
local
curl -i -X POST 0.0.0.0:8000/recommendation ...
Merge get_top_n_quality_prefiltered_bot and get_top_n_quality_prefiltered.