项目作者: PiotrNewt

项目描述 :
Clean data tables with simulated knowledge base
高级语言: Python
项目地址: git://github.com/PiotrNewt/scruber.git
创建时间: 2020-01-03T09:46:10Z
项目社区:https://github.com/PiotrNewt/scruber

开源协议:GNU General Public License v2.0

下载


Scruber

Data Repair
python3.7

English | 简体中文

Problem Description

In relational database data cleaning, KATARA | (here is the paper) has been able to better clean data for single tables by using knowledge base and crowd.

However, due to the special structure of a relational database, and in order to save storage space, the foreign keys of a single relational table often store the IDs of entities in other relational tables instead of the entire field. At this time, the knowledge graph cannot judge a simple and independent ID field.This problem is even more serious when it is assumed that the metadata information of the table is not known during cleaning.



For example, in the figure above, all the league_id and stadium_id in the Team table are actually data in other tables. Without knowing the metadata of the data table, if we directly enter 15, 91, 938 into the knowledge base, we cannot get the result our desired.

Simple Idea

If we can build a pattern for the entire database at the same time, and then use the pattern to clean the tables, then when the knowledge graph cannot get the results we expect, we can still look up other tables to correct the problematic data.

Experiment

Dataset

  • KB: Use reasonably sized RDF-like datasets to simulate knowledge base queries

  • DB (Tables): Create a table dataset with reference to the following data model

Main Algorithm

1 Multi-table Pattern Creation

The build of the pattern is actually to create a directed graph with sideband semantics. The function of the pattern is to be able to correspond to the structure in the knowledge graph, so we can query the knowledge graph for any two matching entities (vertices)Semantic relationship (edge).Similarly, for any matching entity and semantic relationship, another related entity is queried.

What we need to solve here is how to add the id to the pattern reasonably without knowing the table meta information, so there are:

  1. Definition table_list is all data tables of the entire database
  2. Define col_list as all columns in a data table
  3. for t in table_list:
  4. for C in col_list:
  5. if No correlation between C and other data columns in KB
  6. Sort existing degrees of nodes in this table
  7. Query the association relationship between the node with the largest degree A and the other nodes with the largest degree B in KB
  8. if If there is a relation
  9. Add the relationship that has been found in the pattern (reduce the repeated query after)
  10. For another match table:
  11. Count the number of unique values ​​D in all node columns in the table
  12. Take the largest node F and establish a 'link' relationship with the matching node column B
  13. verify A --'kbr'--> B --'link'--> F --'is'-- C
  14. if sobuild coding relationships A --'t2-c(A)-c(B)' --> C
  15. else
  16. Take the node with the second highest degree to continue
  17. The relationship in pattern is empty if no matching columns exist

Where kbr represents the relationship obtained from the KB (knowledge base), t2 represents the matching table number found, and c(A) represents the column number where A is located.Using the encoding relationship can quickly find the data by inverse encoding during the repair phase.The pattern after the algorithm is completed is as follows:



Because the pattern of the entire database is too large, only the relationship between the two tables is shown here.

2 Table Data Cleaning

When cleaning data, you need to consider the association between different tables for cleaning, including:

  1. Find the connected branches of the pattern, and do the following for each connected branch:
  2. Count the sum of the degrees of all nodes (columns) of each table in the connected branch
  3. for Table t with the lowest degree (lowest degree of correlation) in table_list:
  4. Clean the data in the table according to the pattern
  5. for dirty datum dd
  6. if There are KB-related edges in the pattern. First, clean up by KB-related edges:
  7. Get all correlated columns c_list
  8. for relationship in pattern
  9. Get existing data cd
  10. Enter relationship, cd to find possible values ​​for dd from KB
  11. Count the possible values, take the value that appears the most
  12. else ifThere is edge with encoding relationship
  13. Inverse coding solves t1, c1, c2
  14. Quickly find matching data by encoding

The cleaning from the table with the lowest total degree is to maximize the use of the repaired value after cleaning to the related table cleaning.

Result

Here is a comparison of single-table cleaning.

You can see that the second row of the graph [1,0], [1,1], [1,2] has foreign key columns in its three tables. When multiple tables participate in data cleaning at the same time, the effect is better than single tables.Cleaning, and you can see that the data cleaning effect gradually decreases with the increase in the data rate of the abscissa data table.

At the same time, it can be seen that the data cleaning effect is not good because the stadium has less data in the KB in Figure [0,0]. The country in Figure [0,1] is just the opposite, and because there is no foreign key, it is cleanedThe effect is quite.