Snoopy: an effective and efficient semantic join discovery framework powered by proxy-column-based column embeddings. The proposed column embeddings are obtained from the column-to-proxy-column relationships captured by a lightweight approximate-graph-matching-based column projection function. To acquire good pivot columns for guiding the column projection process, a rank-aware contrastive learning paradigm is introduced.
- Python 3.7
- PyTorch 1.10.1
- CUDA 11.5
- NVIDIA 3090 GPU
Please refer to the source code to install all required packages in Python.
We use WikiTable, Opendata, and WDC. We provide our experimental datasets.
To construct training data:
python DataGen.py --datasets "WikiTable" --type mat --tau 0.2 --list_size 3
To learn proxy columns using the generated data:
python train.py --datasets "WikiTable" --type mat --tau 0.2 --list_size 3 --version Your_Model_Version
To perform semantically join search via learned proxy columns:
python search.py --datasets "WikiTable" --version Your_Model_Version --topk 25
-
--datasets
: the dataset used (e.g., "WikiTable") -
--type
: which data generation strategy to be used ("mat" means embedding-level, and "text" means text-level) -
--tau
: the threshold of cell matching -
--list_size
: the size of the positive ranking list -
--version
: the model version you saved during the training phase and used for online search -
--topk
: top-k joinable columns will be returned
The original datasets are form WikiTable, opendata, and WDC Web Table Corpus.
The baseline Deepjoin is implemented with the details provided by the authors after contacting them.