Record Linkage, un cas d’utilisation en Spark ML par Alexis Seigneurin
Le Record Linkage est le process qui consiste à trouver, dans un data set, les enregistrements qui représentent la même entité. Cette opération est particulièrement compliquée quand, comme nous, vous travaillez avec des données anonymisées. C’est là que le Machine Learning vient en renfort ! Nous avons implémenté un algorithme de Record Linkage en Spark SQL (DataFrames) et Spark ML plutôt que d’utiliser des règles statiques. Nous verrons le process de Feature Engineering, pourquoi nous avons dû étendre Spark DataFrames pour préserver des méta-données au travers du pipeline de traitement, et comment nous avons utilisé le Machine Learning pour réconcilier les enregistrements. Nous verrons enfin comment nous avons industrialisé cette application.
Alexis Seigneurin : Développeur depuis 15 ans, j'attache beaucoup d'importance aux problématiques de traitement, d'analyse et de stockage de la donnée.Chez Ippon, j'interviens principalement sur des missions de conseil et d'architecture autour de technologies big data. Par ailleurs, j'anime la formation Spark chez Ippon.
2. Who I am
• Software engineer for 15 years
• Consultant at IpponTech in Paris, France
• Spark trainer
• Favorite subjects: Spark, Machine Learning, Cassandra
• @aseigneurin
3. The DIL’s mission: Help AXA become a
data-driven company…
BUILDING
technological
platforms using Big
Data technologies
SUPPORTING
AXA entities’
Big Data projects
(tools and/or
expertise)
EXPLORING
innovative
opportunities
to transform the
insurance business
Platforms
> By focusing on…
> The DIL in figures:
1
TEAM
3
SITES
77
OPPORTUNITIES
100+
TB
3
SETS OF
PLAFORMS
4. The project
• Record Linkage with Machine learning
• Use cases:
• Find new clients who come from insurance comparison services
→ Commission
• Find duplicates in existing files (acquisitions)
• Record Linkage
• Entity resolution
• Deduplication
• Entity disambiguation
• …
7. Steps
1. Preprocessing
1. Find potential duplicates
2. Feature engineering
2. Manual labeling of a sample
3. Machine Learning to make predictions on the rest of the
records
8. Prototype
• Crafted by a Data Scientist
• Not architectured, not versioned, not unit tested…
→ Not ready for production
• Spark, but a lot of Spark SQL (data processing)
• Machine Learning in Python (Scikit Learn)
→ Objective: industrialization of the code
18. From SQL…
• Generated SQL requests
• Hard to maintain (especially as regards to UDFs)
val cleaningRequest = tableSchema.map(x => {
x.CleaningFuction match {
case (Some(a), _) => a + "(" + x.name + ") as " + x.name
case _ => x.name
}
}).mkString(", ")
val cleanedTable = sqlContext.sql("select " + cleaningRequest + " from " + tableName)
cleanedTable.registerTempTable(schema.tableName + "_cleaned")
19. … to DataFrames
• DataFrame primitives
• More work done by the Scala compiler
val cleanedDF = tableSchema.filter(_.cleaning.isDefined).foldLeft(df) {
case (df, field) =>
val udf: UserDefinedFunction = ... // get the cleaning UDF
df.withColumn(field.name + "_cleaned", udf.apply(df(field.name)))
.drop(field.name)
.withColumnRenamed(field.name + "_cleaned", field.name)
}
21. Unit testing
• Scalatest + Scoverage
• Coverage of all the data processing operations
• Comparison of Row objects
22. val resDF = schema.cleanTable(rows)
"The cleaning process" should "clean text fields" in {
val res = resDF.select("ID", "name", "surname").collect()
val expected = Array(
Row("000010", "jose", "lester"),
Row("000011", "jose", "lester ea"),
Row("000012", "jose", "lester")
)
res should contain theSameElementsAs expected
}
"The cleaning process" should "parse dates" in {
...
Unit testing
000010;Jose;Lester;10/10/1970
000011;Jose =-+;Lester éà;10/10/1970
000012;Jose;Lester;invalid date
28. • 3 types of columns
DataFrames extension
+------+...+------+...+-------------+--------------+...+----------------+
| ID_1|...| ID_2|...|equality_name|soundLike_name|...|dateDiff_birthDt|
+------+...+------+...+-------------+--------------+...+----------------+
|000010|...|000011|...| 0.0| 0.0|...| 0.0|
|000012|...|000013|...| 1.0| 0.0|...| 13.0|
+------+...+------+...+-------------+--------------+...+----------------+
Data DistancesNon-distance features
29. • DataFrame columns have a name and a data type
• DataFrameExt = DataFrame + metadata over columns
DataFrames extension
case class OutputColumn(name: String, columnType: ColumnType)
class DataFrameExt(val df: DataFrame, val outputColumns: Seq[OutputColumn]) {
def show() = df.show()
def drop(colName: String): DataFrameExt = ...
def withColumn(colName: String, col: Column, columnType: ColumnType): DataFrameExt = ...
...
34. Predictions
• Machine Learning
• Random Forests
• (Gradient BoostingTrees also give good results)
• Training on the potential duplicates labeled by hand
• Predictions on the potential duplicates not labeled by hand
35. Predictions
• Sample: 1000 records
• Training set: 800 records
• Test set: 200 records
• Results
• True positives: 53
• False positives: 2
• True negatives: 126
• False negatives: 5
→ Found 53 duplicates on the 58
expected (53+5) and only 2 errors
•Precision ≈ 93%
•Recall ≈ 91%
37. Summary
✓ Single engine for Record Linkage and Deduplication
✓ Machine Learning → Specific rules for each dataset
✓ Higher identification of matches
• Previously ~50% → Now 80-90%