This is the repository associated to our article "ContentWise Impressions: An Industrial Dataset with Impressions Included" accepted and presented at CIKM 2020. Full text is available on ACM DL, ArXiv, or ResearchGate.
You can obtain the link to download the dataset by filling this form.
Filling the form is completely optional, and it won't block you from getting the link to download the dataset.
After you receive the dataset link, download the zip file and decompress it on your local environment.
You'll find a README.md
file, that includes information about the dataset, authors,
license, and more. You'll also find the data
folder. Inside this folder you'll find the dataset (interactions
,
impressions-direct-link
, and impressions-non-direct-link
) alongside the URM splits that we used in our experiments.
Moreover, if you wish to run the scripts inside the repository, you'll need the whole data
folder.
If you use this dataset in a publication, please cite our CIKM paper:
Fernando B. Pérez Maurera, Maurizio Ferrari Dacrema, Lorenzo Saule, Mario Scriminaci, and Paolo Cremonesi. 2020.
ContentWise Impressions: An Industrial Dataset with Impressions Included.
In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (CIKM '20).
Association for Computing Machinery, New York, NY, USA, 3093–3100. DOI:https://doi.org/10.1145/3340531.3412774
If you use BibTeX:
@inproceedings{contentwise-impressions,
author = {P\'{e}rez Maurera, Fernando B. and Ferrari Dacrema, Maurizio and Saule, Lorenzo and Scriminaci, Mario and Cremonesi, Paolo},
title = {ContentWise Impressions: An Industrial Dataset with Impressions Included},
year = {2020},
isbn = {9781450368599},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3340531.3412774},
doi = {10.1145/3340531.3412774},
booktitle = {Proceedings of the 29th ACM International Conference on Information & Knowledge Management},
pages = {3093–3100},
numpages = {8},
keywords = {dataset, implicit feedback, impressions, collaborative filtering, open source},
location = {Virtual Event, Ireland},
series = {CIKM '20}
}
Full text is available on ArXiv, ResearchGate, or ACM DL. Source code of our experiments and results is available on GitHub.
You can download the results of our experiments on this link.
There you'll find two folders: statistics
and result_experiments
. The first folder contains the statistical features
of the dataset alongside several plots, including others that didn't make it into the paper. The second folder contains
all the fine-tuned trained recommender models.
Note: As we exported the models for several recommenders, the results folder takes approximately 2GB on disk.
In this repository we provide several tools to load and use the dataset. We strongly recommend you to go through the Installation and Using the repo sections to know which scripts we provide and how to run them.
Note: that this repository requires Python 3.7
First we suggest you create an environment for this project using conda. We have tested the installation procedures on Linux 64-bits (Ubuntu 18.04), macOS Catalina 10.15, and Windows 10.
First, install miniconda, instructions on how to install miniconda are found in Their docs
Second, clone this repository, checkout, and install the environment with the following:
git clone https://github.com/ContentWise/contentwise-impressions.git
cd contentwise-impressions
conda env create -f environment.yml
conda activate contentwise-impressions
Now, depending on your platform, there are special installation procedures that you need to perform. If you:
- Are running on linux, then see Linux dependencies.
- Are running on macOS, then see macOS dependencies.
- Are running on Windows, then see Windows dependencies.
After you have performed your environment specific steps, continue to Compiling Cython.
At this point, having installed all dependencies, you have to compile all Cython algorithms.
In order to compile you must first have installed: gcc and python3 dev. Under Linux those can be installed with the following commands:
sudo apt install gcc
sudo apt-get install python3-dev
sudo apt-get install libopenblas-base libopenblas-dev
Now, continue to Compiling Cython.
You must download Xcode, and the command-line tools in order to have a C compiler installed on your system. More information about Xcode is found on Apple docs.
Now, continue to Compiling Cython.
If you are using Windows as operating system, the installation procedure is a bit more complex. You may refer to THIS guide.
Continue to Compiling Cython.
Now you can compile all Cython algorithms by running the following command. The script will compile within the current active environment. The code has been developed for Linux and Windows platforms. During the compilation you may see some warnings. These are expected
(contentwise-impressions): python run_compile_all_cython.py
Now that you have the environment set, download the dataset and the splits. Please place the data
folder inside the
repository folder.
After you've done this, you're ready to use the repo.
We have provided several python scripts that uses the dataset in different ways.
In the following sections we describe each script that we provide.
Prerequisites: You need to have the environment fully installed.
NOTE: On our tests, this process consumes up to 16GiB
of RAM. Please ensure to have these resources or use our
splits.
In order to download the data and generate the URM splits that we used in our experiments, you must use the
run_generate_splits.py
.
- If it's run without arguments, it will download the interactions, interacted impressions, and non-interacted impressions.
- If it's run with the
-i
or--items
arguments, it will download the dataset and will generate three URM splits of the interactions Train, Validation, and Test. Using a proportion of 0.7, 0.1, and 0.2, respectively. Users are rows and Items are columns. - If it's run with the
-s
or--series
arguments, it will download the dataset and will generate three URM splits of the interactions Train, Validation, and Test. Using a proportion of 0.7, 0.1, and 0.2, respectively. Users are rows and Series are columns.
Examples:
(contentwise-impressions): python run_generate_splits.py -i -s
Prerequisites: You need to have the environment fully installed, and the data splits, preferably.
NOTE: Depending on your environment and available resources, the process could get killed because of insufficient
memory. We used an r4.4xlarge Linux Amazon EC2 Instance to run our experiments. It had 16vCPUs and 128 GiB of RAM.
However, we utilized this type of instance to run several experiments on parallel. By our own calculations, running
each recommender should take less than 20GiB
of RAM using eight cores if the evaluation is done in parallel. Execution
times for different recommenders vary.
In order to tune the hyper-parameters of several recommendation algorithms, you must use the
run_hyperparameter_tuning.py
script. You need to provide the -t
or --tune_recommenders
arguments in order to
make it run. This is to ensure that you're willing to run the hyperparameter tuning.
We ran the experiments using the following recommenders:
- Random: recommends a list of random items,
- TopPop: recommends the most popular items,
- ItemKNN: Item-based collaborative KNN,
- RP3beta: collaborative graph-based algorithm with re-ranking,
- PureSVD: SVD decomposition of the user-item matrix,
- Impressions MatrixFactorization BPR (BPRMF): machine learning based matrix factorization optimizing ranking with BPR, with the possibility to sample negative items at random, inside the impressions, or outside the impressions.
Examples:
(contentwise-impressions): python run_hyperparameter_tuning.py -t
Prerequisites: You need to have the environment fully installed, the data splits, and the result_experiments
folder
in the repository folder.
The run_results_gathering.py
script outputs a table with the hyperparameter tuning results. You need to provide the
-s
or --show_results
arguments in order to make it run. This is to ensure that you're willing to run the script.
Examples:
(contentwise-impressions): python run_results_gathering.py -s
Prerequisites: You need to have the environment fully installed, and the dataset saved (not necessarily with the splits).
In order to generate the statistics of the dataset, we provide a jupyter notebook, notebook_generate_statistics.ipynb
,
that lets you to generate several statistics of the dataset on the same place.
In order to run the code.
(contentwise-impressions): jupyter lab --no-browser
Inside the notebook, just run the different sections to obtain different statistics. We provide documentation of what
kind of statistics we calculate. All the statistics and plots are generated into the statistics
folder. This
notebooks generates all of the plots, numbers and figures that we used in the paper.
Prerequisites: You need to have the environment fully installed, and the dataset saved (not necessarily with the splits).
We provide consistency tests of the dataset. It will check several properties of the dataset that are reported in the paper.
We use pytest
as test runner. To run the tests is just as easy as to run the following:
(contentwise-impressions): pytest test_dataset_consistency.py --verbose --color=yes
This command doesn't write any report, instead it shows on the console the results of the tests in a PASS/FAIL fashion.
Please, don't hesitate to let us know by opening an issue on the Issue Tracker. We highly appreciate your feedback.
Thanks for using ContentWise Impressions, this repo and supporting our work. We hope that it's useful for your purposes.
This is not an official ContentWise product.
For help or issues using ContentWise Impressions, please submit a GitHub issue.
For personal communication related to ContentWise Impressions, please contact:
- Fernando Benjamín Pérez Maurera ([email protected] or [email protected])
- Maurizio Ferrari Dacrema ([email protected]).
- Lorenzo Saule ([email protected]).
- Mario Scriminaci ([email protected]).
- Paolo Cremonesi ([email protected]).