CollaborEM, a self-supervised entity matching framework via multi-features collaboration. It is capable of (i) obtaining reliable ER results with zero human annotations and (ii) discovering adequate tuples’ features in a fault-tolerant manner. CollaborEM consists of two phases, i.e., automatic label generation (ALG) and collaborative EM training (CEMT). In the first phase, ALG is proposed to generate a set of positive tuple pairs and a set of negative tuple pairs. ALG guarantees the high quality of the generated tuples, and hence ensure the training quality of the subsequent CEMT. In the second phase, CEMT is introduced to learn the matching signals by discovering graph features and sentence features of tuples collaboratively.
For more technical details, see CollaborEM: A Self-supervised Entity Matching Framework using Multi-features Collaboration.
- Python 3.7
- PyTorch 1.7.1
- CUDA 11.0
- HuggingFace Transformers 4.4.2
- Sentence Transformers 1.0.4
- NVIDIA Apex (fp16 training)
①Download er.tar.gz, we recommend using conda-pack to reproduce the environment:
pip install conda-pack
mkdir -p er
tar -xzf er.tar.gz -C er
./er/bin/python
source er/bin/activate
②Download and unzip lm_model.
We conduct experiments on eight representative and widely-used EM benchmarks with different sizes and in various domains from DeepMatcher paper.
The dataset configurations can be found in configs.json
.
Download and unzip preprocessed data.
To train the matching model with CollaborEM:
python run_all.py
You can download checkpoints here.