Snoopy: Effective and Efficient Semantic Join Discovery via Proxy Columns

Snoopy: an effective and efficient semantic join discovery framework powered by proxy-column-based column embeddings. The proposed column embeddings are obtained from the column-to-proxy-column relationships captured by a lightweight approximate-graph-matching-based column projection function. To acquire good pivot columns for guiding the column projection process, a rank-aware contrastive learning paradigm is introduced.

Requirements

Python 3.7
PyTorch 1.10.1
CUDA 11.5
NVIDIA 3090 GPU

Please refer to the source code to install all required packages in Python.

Datasets

We use WikiTable, Opendata, and WDC. We provide our experimental datasets.

Run Experimental Case

To construct training data:

python DataGen.py --datasets "WikiTable" --type mat --tau 0.2 --list_size 3

To learn proxy columns using the generated data:

python train.py --datasets "WikiTable" --type mat --tau 0.2 --list_size 3 --version Your_Model_Version

To perform semantically join search via learned proxy columns:

python search.py --datasets "WikiTable" --version Your_Model_Version --topk 25

Parameters

--datasets: the dataset used (e.g., "WikiTable")
--type: which data generation strategy to be used ("mat" means embedding-level, and "text" means text-level)
--tau: the threshold of cell matching
--list_size: the size of the positive ranking list
--version: the model version you saved during the training phase and used for online search
--topk: top-k joinable columns will be returned

Acknowledgementt

The original datasets are form WikiTable, opendata, and WDC Web Table Corpus.

The baseline Deepjoin is implemented with the details provided by the authors after contacting them.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
check		check
datasets/Lake		datasets/Lake
src		src
README.md		README.md
snoopy.jpg		snoopy.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snoopy: Effective and Efficient Semantic Join Discovery via Proxy Columns

Requirements

Datasets

Run Experimental Case

Parameters

Acknowledgementt

About

Releases

Packages

Languages

ZJU-DAILY/Snoopy

Folders and files

Latest commit

History

Repository files navigation

Snoopy: Effective and Efficient Semantic Join Discovery via Proxy Columns

Requirements

Datasets

Run Experimental Case

Parameters

Acknowledgementt

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages