项目作者: cleanzr

项目描述 :
Duplicate detection in R using a Bayesian partitioning approach
高级语言: R
项目地址: git://github.com/cleanzr/BDD.git
创建时间: 2021-01-24T04:03:09Z
项目社区:https://github.com/cleanzr/BDD

开源协议:GNU General Public License v3.0

下载


BDD: Duplicate detection using a Bayesian partitioning approach

This R package implements a Bayesian model for duplicate detection as
proposed in (Sadinle 2014). It resembles the classic Fellegi-Sunter
model (Fellegi and Sunter 1969) in that coreference decisions are based
on pairwise attribute-level comparisons. However, unlike the
Fellegi-Sunter mdoel, this model targets a partition of the records,
rather than pairwise coreference predictions, thereby ensuring
transitivity. Inference is conducted using a Gibbs sampler, with parts
implemented in Rcpp.

Installation

BDD is not currently available on CRAN. The latest development version
can be installed from source using devtools as follows:

  1. library(devtools)
  2. install_github("cleanzr/BDD")

Example

We demonstrate how to perform duplicate detection using the RLdata500
test data set included with the package. RLdata500 contains 500
synthetic personal records, 50 of which are duplicates with
randomly-generated errors. In the code block, below we load the package
and examine the first few rows of RLdata500.

  1. library(magrittr) # pipe operator (must be loaded before BDD)
  2. library(comparator) # normalized Levenshtein distance
  3. library(clevr) # evaluation functions
  4. library(BDD)
  5. RLdata500[['rec_id']] <- seq.int(nrow(RLdata500))
  6. head(RLdata500)
  7. #> fname_c1 fname_c2 lname_c1 lname_c2 by bm bd rec_id
  8. #> 1 CARSTEN <NA> MEIER <NA> 1949 7 22 1
  9. #> 2 GERD <NA> BAUER <NA> 1968 7 27 2
  10. #> 3 ROBERT <NA> HARTMANN <NA> 1930 4 30 3
  11. #> 4 STEFAN <NA> WOLFF <NA> 1957 9 2 4
  12. #> 5 RALF <NA> KRUEGER <NA> 1966 1 13 5
  13. #> 6 JUERGEN <NA> FRANKE <NA> 1929 7 4 6

The model uses agreement levels between attributes in order to assess
whether a pair of records is a match (referring to the same entity) or
not. To compute the agreement levels, we’ll first compute real-valued
scores between the attributes, then maps the scores to discrete levels
of agreement. In the code block below, we specify scoring functions for
the five attributes we will use for matching (we don’t use fname_c2
and lname_c2 as more than 90% of the values are missing). Note that
the scoring functions must accept a pair of attribute vectors x and
y as arguments and return a vector of scores.

  1. scoring_fns <- list(
  2. fname_c1 = Levenshtein(normalize = TRUE),
  3. lname_c1 = Levenshtein(normalize = TRUE),
  4. by = function(x, y) abs(x - y),
  5. bm = function(x, y) abs(x - y),
  6. bd = function(x, y) abs(x - y)
  7. )

For each scoring function above, we provide a breaks vector which
specifies the discrete levels of agreement (from ‘high’ agreement to
‘low’).

  1. scoring_breaks <- list(
  2. fname_c1 = c(-Inf, .05, .2, .4, Inf),
  3. lname_c1 = c(-Inf, .05, .2, .4, Inf),
  4. by = c(-Inf, 0, 1, 3, Inf),
  5. bm = c(-Inf, 0, 1, 3, Inf),
  6. bd = c(-Inf, 0, 2, 7, Inf)
  7. )

Now we are ready to compute the agreement levels for the record pairs.
Since this is a small data set, we consider all pairs using the
pairs_all function. For larger data sets, blocking/indexing is
recommended using pairs_hamming, pairs_fuzzyblock or a custom
indexing function.

  1. pairs <- pairs_all(RLdata500$rec_id) %>%
  2. compute_scores(RLdata500, scoring_fns, id_col = 'rec_id') %>%
  3. discretize_scores(scoring_breaks)

To speed up inference, we only consider a subset of the pairs as
candidate matches. Specifically, we consider pairs that have a strong
agreement on name (accounting for missing names).

  1. pairs[['candidate']] <- (pairs$fname_c1 < 4) & (pairs$lname_c1 < 4) |
  2. is.na(pairs$fname_c1) | is.na(pairs$lname_c1)

Next we specify the priors on the m* and u* probabilities for each
attribute and agreement level. lambda specifies the lower truncation
points for the truncated Beta priors on the m* probabilities. By
default, a uniform prior is used over the truncated interval; a
non-uniform prior can be specified using the alpha1 and beta1
arguments to BDD below. The Beta priors on the u* probabilities are
also uniform by default and can be adjusted using the alpha0 and
beta0 arguments.

  1. lambda <- list(
  2. fname_c1 = c(0.8, 0.85, 0.99),
  3. lname_c1 = c(0.8, 0.85, 0.99),
  4. by = c(0.8, 0.85, 0.99),
  5. bm = c(0.8, 0.85, 0.99),
  6. bd = c(0.8, 0.85, 0.99)
  7. )

Finally, we initialize the model and run inference using Markov chain
Monte Carlo.

  1. model <- BDD(pairs, lambda, id_cols = c("rec_id.x", "rec_id.y"),
  2. candidate_col = "candidate")
  3. fit <- run_inference(model, 100, thin_interval = 10, burnin_interval = 100)
  4. #> Completed sampling in 1.464495 secs

The posterior samples of the linkage structure can be accessed by
calling extract(fit, "links"). However, these samples only cover the
records that were considered as candidate pairs. We can obtain samples
of the complete linkage structure (for all records) using the following
function.

  1. links_samples <- complete_links_samples(fit, RLdata500$rec_id)

Since RLdata500 comes with ground truth, we can evaluate the predicted
linkage structure using supervised metrics. We find that the model
performs well on this data set, achieving high pairwise precision and
recall.

  1. n_records <- nrow(RLdata500)
  2. true_pairs <- membership_to_pairs(identity.RLdata500)
  3. metrics <- apply(links_samples, 1, function(links) {
  4. pred_pairs <- membership_to_pairs(links)
  5. pr <- precision_pairs(true_pairs, pred_pairs)
  6. re <- recall_pairs(true_pairs, pred_pairs)
  7. c(Precision = pr, Recall = re)
  8. }) %>% t()
  9. summary(metrics)
  10. #> Precision Recall
  11. #> Min. :0.8772 Min. :0.9600
  12. #> 1st Qu.:0.9259 1st Qu.:1.0000
  13. #> Median :0.9434 Median :1.0000
  14. #> Mean :0.9386 Mean :0.9968
  15. #> 3rd Qu.:0.9615 3rd Qu.:1.0000
  16. #> Max. :0.9804 Max. :1.0000

Licence

GPL-3

References





Fellegi, Ivan P., and Alan B. Sunter. 1969. “A Theory for Record
Linkage.” Journal of the American Statistical Association 64 (328):
1183–1210. https://doi.org/10.1080/01621459.1969.10501049.



Sadinle, Mauricio. 2014. “Detecting Duplicates in a Homicide Registry
Using a Bayesian Partitioning Approach.” Ann. Appl. Stat. 8 (4):
2404–34. https://doi.org/10.1214/14-AOAS779.