-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
130 lines (112 loc) · 4.72 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
output:
md_document:
variant: gfm
bibliography: refs.bib
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
# exchanger: Bayesian Entity Resolution with Exchangeable Random Partition Priors
## Overview
This R package implements the Bayesian entity resolution model described
in [@marchant_exchangeable_2023]. The model assumes records are generated
from a latent population of entities via a distortion process. Entity
resolution is performed by inferring the latent _linkage structure_
using a Markov chain Monte Carlo algorithm.
## Features
* Supports user-defined attribute distance functions for modeling the
distortion process.
* Supports missing values.
* Implements the Ewens-Pitman (EP) family of exchangeable clustering priors
[@pitman_exchangeable_2006], with support for hyperpriors on the EP
parameters. This includes the Ewens partition (related to the Dirichlet
process), Pitman-Yor partition (related to the Pitman-Yor process) and
Generalized Coupon partition (related to finite mixture models).
* Inference is implemented in C++ using partially-collapsed Gibbs sampling
[@dyk_partially_2008]. Several computational tricks are used to
improve scalability.
## Installation
`exchanger` is not currently available on CRAN.
The latest development version can be installed from source using `devtools`
as follows:
```{r, eval=FALSE}
library(devtools)
install_github("cleanzr/exchanger")
```
## Example
We demonstrate how to perform entity resolution using the `RLdata500` test
data set included with the package. `RLdata500` contains 500 synthetic
personal records, 50 of which are duplicates with randomly-generated errors.
In the code block, below we load the package and examine the first few rows
of `RLdata500`.
```{r, eval=TRUE, message=FALSE, warning=FALSE}
library(exchanger)
library(comparator) # provides string comparison functions
library(clevr) # provides functions for supervised evaluation
head(RLdata500)
```
Next we specify the model parameters for the entity attributes.
For simplicity, we use a uniform prior for the distortion probability
associated with each attribute.
```{r, eval=TRUE, message=FALSE, warning=FALSE}
unif_prior <- BetaRV(1, 1)
```
We model the distortion for the name attributes (`fname_c1` and `lname_c1`)
using Levenshtein (edit) distance.
The `transform_dist_fn` applies a threshold at a distance of 3, so that any
pair of names `y`, `x` with an edit distance strictly greater than 3 are
assumed to be "infinitely" apart.
This improves computational efficiency, as the inference algorithm exploits
the fact that `x` can never be a distortion of `y` if their distance is
strictly greater than 3.
For the date attributes (`by`, `bm` and `bd`) we assume a constant distance
function, which means all values in the domain are equally likely a priori
as a distorted alternative.
The `CategoricalAttribute` constructor provides a shortcut for an attribute
with a constant distance function.
```{r, eval=TRUE, message=FALSE}
attr_params <- list(
fname_c1 = Attribute(transform_dist_fn(Levenshtein(), threshold=3),
distort_prob_prior = unif_prior),
lname_c1 = Attribute(transform_dist_fn(Levenshtein(), threshold=3),
distort_prob_prior = unif_prior),
by = CategoricalAttribute(distort_prob_prior = unif_prior),
bm = CategoricalAttribute(distort_prob_prior = unif_prior),
bd = CategoricalAttribute(distort_prob_prior = unif_prior)
)
```
Finally we specify the prior over the linkage structure (clustering). Here we
use a Pitman-Yor random partition for the prior, with hyperpriors on the
concentration and discount parameters.
```{r, eval=TRUE, message=FALSE}
clust_prior <- PitmanYorRP(alpha = GammaRV(1, 1), d = BetaRV(1, 1))
```
All that remains is to initialize the model and run inference.
For demonstration purposes, we only generate 100 posterior samples of the
linkage structure.
```{r, eval=TRUE, message=FALSE}
model <- exchanger(RLdata500, attr_params, clust_prior)
result <- run_inference(model, n_samples=100, thin_interval=10, burnin_interval=1000)
```
We can obtain a point estimate of the most likely linkage structure using the
_shared most probable maximal matching sets method_ [@steorts2016].
```{r, eval=TRUE}
pred_clust <- smp_clusters(result)
```
Then we evaluate with respect to the true matching.
```{r, eval=TRUE}
n_records <- nrow(RLdata500)
true_pairs <- membership_to_pairs(identity.RLdata500)
pred_pairs <- clusters_to_pairs(pred_clust)
measures <- eval_report_pairs(true_pairs, pred_pairs, num_pairs=n_records*(n_records-1)/2)
print(measures)
```
## License
GPL-3
## References