Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015

RECORD LINKAGE,
A REAL USE CASE WITH SPARK ML
Alexis Seigneurin - Pascale Mkhael

Who I am
• Software engineer for 15 years
• Consultant at IpponTech in Paris, France
• Spark trainer
• Favorite subjects: Spark, Machine Learning, Cassandra
• @aseigneurin

The DIL’s mission: Help AXA become a
data-driven company…
BUILDING
technological
platforms using Big
Data technologies
SUPPORTING
AXA entities’
Big Data projects
(tools and/or
expertise)
EXPLORING
innovative
opportunities
to transform the
insurance business
Platforms
> By focusing on…
> The DIL in figures:
1
TEAM
3
SITES
77
OPPORTUNITIES
100+
TB
3
SETS OF
PLAFORMS

The project
• Record Linkage with Machine learning
• Use cases:
• Find new clients who come from insurance comparison services
→ Commission
• Find duplicates in existing ﬁles (acquisitions)
• Record Linkage
• Entity resolution
• Deduplication
• Entity disambiguation
• …

• Find duplicates!
Purpose
+---+-------+------------+----------+------------+---------+------------+
| ID| veh|codptgar_veh|dt_nais_cp|dt_permis_cp|redmaj_cp| formule|
+---+-------+------------+----------+------------+---------+------------+
|...|PE28221| 50000|1995-10-12| 2013-10-08| 100.0| TIERS|
|...|FO26016| 59270|1967-01-01| 1987-02-01| 100.0|VOL_INCENDIE|
|...|FI19107| 77100|1988-09-27| 2009-09-13| 105.0|TOUS_RISQUES|
|...|RE07307| 69100|1984-08-15| 2007-04-20| 50.0| TIERS|
|...|FO26016| 59270|1967-01-07| 1987-02-01| 105.0|TOUS_RISQUES|
+---+-------+------------+----------+------------+---------+------------+

Steps
1. Preprocessing
1. Find potential duplicates
2. Feature engineering
2. Manual labeling of a sample
3. Machine Learning to make predictions on the rest of the
records

Prototype
• Crafted by a Data Scientist
• Not architectured, not versioned, not unit tested…
→ Not ready for production
• Spark, but a lot of Spark SQL (data processing)
• Machine Learning in Python (Scikit Learn)
→ Objective: industrialization of the code

• Data (CSV) + Schema (JSON)
Inputs
000010;Jose;Lester;10/10/1970
000011;José;Lester;10/10/1970
000012;Tyler;Hunt;12/12/1972
000013;Tiler;Hunt;25/12/1972
000014;Patrick;Andrews;1973-12-13
{
"tableSchemaBeforeSelection": [
{
"name": "ID",
"typeField": "StringType",
"hardJoin": false
},
{
"name": "name",
"typeField": "StringType",
"hardJoin": true,
"cleaning": "digitLetter",
"listFeature": [ "scarcity" ],
"listDistance": [ "equality", "soundLike" ]
},
...

• Spark CSV module → DataFrame
Don’t use type inference
Data loading
+------+-------+-------+----------+
| ID| name|surname| birthDt|
+------+-------+-------+----------+
|000010| Jose| Lester|10/10/1970|
|000011| José| Lester|10/10/1970|
|000012| Tyler| Hunt|12/12/1972|
|000013| Tiler| Hunt|25/12/1972|
|000014|Patrick|Andrews|1970-10-10|
+------+-------+-------+----------+

• Parsing of dates, numbers…
• Cleaning of strings
Data cleasning
+------+-------+-------+----------+
| ID| name|surname| birthDt|
+------+-------+-------+----------+
|000010| jose| lester|1970-10-10|
|000011| jose| lester|1970-10-10|
|000012| tyler| hunt|1972-12-12|
|000013| tiler| hunt|1972-12-25|
|000014|patrick|andrews| null|
+------+-------+-------+----------+

• Auto-join (more on that later…)
Find potential duplicates
+------+------+---------+...+------+------+---------+...
| ID_1|name_1|surname_1|...| ID_2|name_2|surname_2|...
+------+------+---------+...+------+------+---------+...
|000010| jose| lester|...|000011| jose| lester|...
|000012| tyler| hunt|...|000013| tiler| hunt|...
+------+------+---------+...+------+------+---------+...

• Several distance algorithms:
• Levenshtein distance, date difference…
Distance calculation
+------+...+------+...+-------------+--------------+...+----------------+
| ID_1|...| ID_2|...|equality_name|soundLike_name|...|dateDiff_birthDt|
+------+...+------+...+-------------+--------------+...+----------------+
|000010|...|000011|...| 0.0| 0.0|...| 0.0|
|000012|...|000013|...| 1.0| 0.0|...| 13.0|
+------+...+------+...+-------------+--------------+...+----------------+

• Standardization of distances only
• Vectorization (2 vectors)
Standardization / vectorization
+------+------+---------+----------+------+------+---------+----------+------------+--------------+
| ID_1|name_1|surname_1| birthDt_1| ID_2|name_2|surname_2| birthDt_2| distances|other_features|
+------+------+---------+----------+------+------+---------+----------+------------+--------------+
|000010| jose| lester|1970-10-10|000011| jose| lester|1970-10-10|[0.0,0.0,...| [2.0,2.0,...|
|000012| tyler| hunt|1972-12-12|000013| tiler| hunt|1972-12-25|[1.0,1.0,...| [1.0,2.0,...|
+------+------+---------+----------+------+------+---------+----------+------------+--------------+

From SQL…
• Generated SQL requests
• Hard to maintain (especially as regards to UDFs)
val cleaningRequest = tableSchema.map(x => {
x.CleaningFuction match {
case (Some(a), _) => a + "(" + x.name + ") as " + x.name
case _ => x.name
}
}).mkString(", ")
val cleanedTable = sqlContext.sql("select " + cleaningRequest + " from " + tableName)
cleanedTable.registerTempTable(schema.tableName + "_cleaned")

… to DataFrames
• DataFrame primitives
• More work done by the Scala compiler
val cleanedDF = tableSchema.filter(_.cleaning.isDefined).foldLeft(df) {
case (df, field) =>
val udf: UserDefinedFunction = ... // get the cleaning UDF

df.withColumn(field.name + "_cleaned", udf.apply(df(field.name)))
.drop(field.name)
.withColumnRenamed(field.name + "_cleaned", field.name)
}

Unit testing
• Scalatest + Scoverage
• Coverage of all the data processing operations
• Comparison of Row objects

val resDF = schema.cleanTable(rows) 
"The cleaning process" should "clean text fields" in { 
val res = resDF.select("ID", "name", "surname").collect() 
val expected = Array( 
Row("000010", "jose", "lester"), 
Row("000011", "jose", "lester ea"), 
Row("000012", "jose", "lester") 
) 
res should contain theSameElementsAs expected 
}
 
"The cleaning process" should "parse dates" in {
...
Unit testing
000010;Jose;Lester;10/10/1970 
000011;Jose =-+;Lester éà;10/10/1970 
000012;Jose;Lester;invalid date

Join strategy
• For record linkage, ﬁrst merge the
two sources
• Then auto-join
Prospects New clients
Duplicate

Join -Volume of data
• Input: 1M records
• Cartesian product: 1000 B records
→ Find an appropriate join condition
0
25
50
75
100

Join condition
• Multiples join on 2 ﬁelds
• Equality of values or custom condition (UDF)
• Union between all the intermediate results
• E.g. with ﬁelds name, surname, birth_date:
df1.join(df2, (df1("ID_1") < df2("ID_2")) 
&& (df1("name_1") === df2("name_2")) 
&& (soundLike(df1("surname_1"), df2("surname_2")))
&& (df1("name_1") === df2("name_2")) 
&& (df1("birth_date_1") === df2("birth_date_2")))
&& (soundLike(df1("surname_1"), df2("surname_2"))) 
&& (df1("birth_date_1") === df2("birth_date_2")))
UNION

• 3 types of columns
DataFrames extension
+------+...+------+...+-------------+--------------+...+----------------+
| ID_1|...| ID_2|...|equality_name|soundLike_name|...|dateDiff_birthDt|
+------+...+------+...+-------------+--------------+...+----------------+
|000010|...|000011|...| 0.0| 0.0|...| 0.0|
|000012|...|000013|...| 1.0| 0.0|...| 13.0|
+------+...+------+...+-------------+--------------+...+----------------+
Data DistancesNon-distance features

• DataFrame columns have a name and a data type
• DataFrameExt = DataFrame + metadata over columns
DataFrames extension
case class OutputColumn(name: String, columnType: ColumnType) 
 
 
class DataFrameExt(val df: DataFrame, val outputColumns: Seq[OutputColumn]) {
def show() = df.show()
def drop(colName: String): DataFrameExt = ...
def withColumn(colName: String, col: Column, columnType: ColumnType): DataFrameExt = ...
...

Labeling
• Manual operation
• Is this a duplicate? →Yes / No
• Performed on a sample of the potential duplicates
• Between 1000 and 10 000 records

Predictions
• Machine Learning
• Random Forests
• (Gradient BoostingTrees also give good results)
• Training on the potential duplicates labeled by hand
• Predictions on the potential duplicates not labeled by hand

Predictions
• Sample: 1000 records
• Training set: 800 records
• Test set: 200 records
• Results
• True positives: 53
• False positives: 2
• True negatives: 126
• False negatives: 5
→ Found 53 duplicates on the 58
expected (53+5) and only 2 errors
•Precision ≈ 93%
•Recall ≈ 91%

Summary
✓ Single engine for Record Linkage and Deduplication
✓ Machine Learning → Speciﬁc rules for each dataset
✓ Higher identiﬁcation of matches
• Previously ~50% → Now 80-90%

Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015

Similar to Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015 (20)

More from Modern Data Stack France

More from Modern Data Stack France (20)

Recently uploaded

Recently uploaded (20)

Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015