This branch offers
- An initial test set having a small number of test examples for each dataset, together with their labels in
exist
column. Note that this test set only serves for development purposes. So- The intermediate and final dataset will not contain the
exist
column. - This is not the intermediate dataset we will be using for ranking solutions.
- The intermediate and final dataset will not contain the
- A simple baseline that trains on both datasets.
Download links to initial test set: Dataset A Dataset B
The baseline is only a minimal working example for both datasets, and it is certainly not optimal. You are encouraged to tweak it or propose your own solutions from scratch!
Here we summarize our baseline:
The baseline is an RGCN-like GNN model trained on the entire graph.
Event timestamps on the graph are encoded by decomposing the 10-digit decimal integers into 10-dimensional vectors, each element representing a digit.
We train the model as binary classification using a negative-sampling-like strategy.
Given a ground truth event (s, d, r, t)
with source node s
, destination node d
, event type r
and timestamp t
, we perturb t
to obtain a new value t'
.
We label the quadruplet with 1 if the new timestamp is larger than the original timestamp, and 0 otherwise. The model is essentially trained to
predict p(t < t' | s, d, r)
, i.e. the probability that an edge with type r
exists from source s
and destination d
before timestamp t'
.
To use the baseline you need to install DGL.
You also need at least 64GB of CPU memory. GPU is not required.
-
Convert csv file to DGL graph objects.
python csv2DGLgraph.py --dataset [A or B]
-
Training.
python base_pipeline.py --dataset [A or B]
The baseline got AUC of 0.511 on Dataset A and 0.510 on Dataset B.