CollaborEM: A Self-supervised Entity Matching Framework Using Multi-features Collaboration

CollaborEM, a self-supervised entity matching framework via multi-features collaboration. It is capable of (i) obtaining reliable ER results with zero human annotations and (ii) discovering adequate tuples’ features in a fault-tolerant manner. CollaborEM consists of two phases, i.e., automatic label generation (ALG) and collaborative EM training (CEMT). In the first phase, ALG is proposed to generate a set of positive tuple pairs and a set of negative tuple pairs. ALG guarantees the high quality of the generated tuples, and hence ensure the training quality of the subsequent CEMT. In the second phase, CEMT is introduced to learn the matching signals by discovering graph features and sentence features of tuples collaboratively.

For more technical details, see CollaborEM: A Self-supervised Entity Matching Framework using Multi-features Collaboration.

Requirements

Python 3.7
PyTorch 1.7.1
CUDA 11.0
HuggingFace Transformers 4.4.2
Sentence Transformers 1.0.4
NVIDIA Apex (fp16 training)

①Download er.tar.gz, we recommend using conda-pack to reproduce the environment:

pip install conda-pack
mkdir -p er
tar -xzf er.tar.gz -C er
./er/bin/python
source er/bin/activate

②Download and unzip lm_model.

Datasets

We conduct experiments on eight representative and widely-used EM benchmarks with different sizes and in various domains from DeepMatcher paper.

The dataset configurations can be found in configs.json.

Training with CollaborEM

Download and unzip preprocessed data.

To train the matching model with CollaborEM:

python run_all.py

You can download checkpoints here.

Acknowledgement

We use the code of DITTO and AttrGNN.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CollaborEM: A Self-supervised Entity Matching Framework Using Multi-features Collaboration

Requirements

Datasets

Training with CollaborEM

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

CollaborEM: A Self-supervised Entity Matching Framework Using Multi-features Collaboration

Requirements

Datasets

Training with CollaborEM

Acknowledgement