Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

[DSC Europe 22] Latest Techniques of Entity Matching in NLP - Avinash Pathak

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 19 Anuncio

[DSC Europe 22] Latest Techniques of Entity Matching in NLP - Avinash Pathak

Descargar para leer sin conexión

Entity matching is matching of two records, for example, two songs. This matching is done based on, in our example, song name, composer, lyricist, artist, genre etc. When two records have to be matched in real world things get complicated. This complication includes missing attributes, typos, etc. There have been lot of techniques and evolution of them throughout advent of NLP techniques in Data Science. I will cover journey and best practices for Entity Matching in industry and research which can be directly applied by anyone.

Entity matching is matching of two records, for example, two songs. This matching is done based on, in our example, song name, composer, lyricist, artist, genre etc. When two records have to be matched in real world things get complicated. This complication includes missing attributes, typos, etc. There have been lot of techniques and evolution of them throughout advent of NLP techniques in Data Science. I will cover journey and best practices for Entity Matching in industry and research which can be directly applied by anyone.

Anuncio
Anuncio

Más Contenido Relacionado

Más de DataScienceConferenc1 (20)

Más reciente (20)

Anuncio

[DSC Europe 22] Latest Techniques of Entity Matching in NLP - Avinash Pathak

  1. 1. Latest Techniques of Entity Matching in NLP Avinash Pathak Expert Data Scientist TomTom
  2. 2. Agenda What is entity matching? Why entity matching is important? History of Entity matching Entity matching models How to measure success of Entity Matching Entity Embed – Open-source tool for Entity Matching
  3. 3. What is entity matching? Entity Matching refers to the problem of determining whether two different data representations refer to the same real-world entity. Example Use Cases Id Title Album Composer Songwriter 1 Me and Mrs Jones Call me irresponsible Michael Bublé Michael Bublé 2 Me and Mrs Jones[Remix] Call me irresponsible Michael Bublé Michael Bublé Song Matching •Address Matching •Social Profile Matching (Facebook, Twitter) •Clothes Matching (for that matter any item in retail) •Clothes Matching (for that matter any item in retail)
  4. 4. Things get little complicated • Use of words in Colloquial fashion • additional information • Un-normalization • Unstructured data • Missing Data • Dirty Data • Un-availability of Supervised data • Scale Id Title Album Composer Songwriter 1 Me and Mrs Jones Call me irresponsible Michael Bublé 2 Me and Mrs Jones[Remix] Michael Bublé 3 Blowin’ in the Wind The Freewheelin’ Bob Dylam Bob Dylan 4 Blowing in the Wind Bob Dylan
  5. 5. Problem of Scale, hence blocking • An exhaustive pairwise comparison grows quadratically with the number of records, which is unaffordable for datasets of even moderate size. As a result, virtually every entity matching task on large datasets requires blocking, a step that effectively reduces the number of record pairs to be considered for matching without potentially ruling out true matches. • A successful application of blocking to an entity matching task should fulfil the following four desiderata • First, blocking, ideally, should not leave out any true matches (i.e., high recall), since only the candidate record pairs generated by blocking will be further examined in the downstream matching step. • Second, the number of candidate pairs should be small so that the cost of applying a usually computationally- expensive matching algorithm is controlled. • Third, human effort should not be overspent during the whole blocking process; • Finally, the blocking algorithm should be scalable enough to handle millions of records.
  6. 6. Without Blocking Blocking (non-expensive comparisons) kC2*M Rigorous computationally expensive Comparisons nC2 Rigorous computationally expensive Comparisons With Blocking 10c2 = 10 * 9 / 2 = 45 expensive comparisons without blocking For k=2, 5 clusters of size 2 kC2*M = 1 * 5 comparisons. K – cluster size, M – number of clusters For k=3, 3 clusters of size 3 and 1 with size 1 It becomes 2 * 3 + 1. 7 comparisons
  7. 7. History of EM • Pattern Matching/Fuzzy Matching • Deep Learning Based Blocking • Self-supervised blocking
  8. 8. Pattern Matching/Fuzzy Matching Example Very specific solution Needs incremental updates if new examples come in Cross attributes mismatch is present and would need enormous efforts to cover that Id Title Album Composer Songwriter 1 Me and Mrs Jones Call me irresponsible Michael Bublé Michael Bublé 2 Me and Mrs Jones[Remix] Call me irresponsible Michael Bublé Michael Bublé
  9. 9. EM - Deep Learning Based Blocking Autoblock •Token embedding: A word-embedding model transforms each token to a token embedding •Attribute embedding: For each attribute value of a tuple, an attention-based neural network encoder converts the input sequence of token embeddings to an attribute embedding •Tuple signature: Multiple signature functions combine the attribute embeddings of each tuple and produce multiple tuple signatures (one per signature function) •Model training: Equipped with the positive label set, the model is trained with an objective that maximizes the differences of the cosine similarities between the tuple signatures of matched pairs and between unmatched pairs •Fast NN search: The learned model is applied to compute the signatures for all tuples, and an LSH family for cosine similarity is used to retrieve the nearest neighbors for each tuple to generate candidate pairs for blocking Architecture of Autoblock
  10. 10. Self-Supervised Blocking for entity matching - DeepBlock • Encoder-Decoder for self-supervision • Take a tuple 𝑡, feed it into a neural network (NN) to output a compact embedding vector u𝑡 , such that if we feed u𝑡 into a second NN, we can recover the original tuple 𝑡 (or a good approximation of 𝑡). If this happens, u𝑡 can be viewed as a good compact summary of tuple 𝑡, and can be used as the tuple embedding of 𝑡. The above two NNs are called encoder and decoder, respectively. • We can create u𝑡 in various ways. For example, one can feed in original tuple t with few attributes missing and recover an original tuple with all the attributes (this would help our model match two entities even when one of the entities has missing attributes) encoder decoder entity entity ut
  11. 11. After Blocking – The computationally expensive comparisons • Edit distance between strings • Embedding distance between entities • Phonetic similarity • Length • Jaro Winkler • Sequence Matcher • Jaccard Similarity • Entity specific Features Reference: Location Matching Kaggle Competition https://www.kaggle.com/code/icfstat/lightgbm-feature-engineering-training-0-888-pv-lb
  12. 12. How to measure success of Entity Matching? For Model verification in isolation – Precision, Recall Business Metrics – Business specific Can we know if we did the perfect entity matching?  One use-case, let’s say you are using entity matching for social media profile duplication. There is no perfect way of knowing if all the duplicate/redundant profiles are identified Human in the loop  Sample the population and know if entity matching has worked for your cases  Be mindful of testing with various size of samples and various types of mixtures of populations Note: You can also check these benchmarks defined by Machamp (https://github.com/megagonlabs/machamp)
  13. 13. Applications of Entity Matching Song Matching Address Matching Social Profile Matching (Facebook, Twitter) Clothes Matching (for that matter any item in retail) Matrimony, Dating sites profile matching
  14. 14. Entity Embed Installation pip install entity-embed Preparing the data Data needs to be a list of dict objects which must contain ‘id’ and ‘cluster’ [{'id': 0, 'cluster': 0, 'title': '001-Berimbou', 'artist': 'Astrud Gilberto', 'album': 'Look to the Rainbow (2008)'}, {'id': 1, 'cluster': 0, 'title': 'Berimbau', 'artist': 'Astrud Gilberto', 'album': 'Look to the Rainbow (1966)'}, {'id': 2, 'cluster': 1, 'title': '4 - Marooned - Pink Floyd', 'artist': '', 'album': 'The Division Bell'}]
  15. 15. Entity Embed Defining the fields We need to define how record fields will be numericalized and encoded by Entity Embed’s deep neural network field_config_dict = { 'title': { 'field_type': "MULTITOKEN", 'tokenizer': "entity_embed.default_tokenizer", 'alphabet': DEFAULT_ALPHABET, 'max_str_len': None, # compute }, 'title_semantic': { 'key': 'title', 'field_type': "SEMANTIC_MULTITOKEN", 'tokenizer': "entity_embed.default_tokenizer", 'vocab': "fasttext.en.300d", } }
  16. 16. Entity Embed Building the model Under the hood, Entity Embed uses pytorch-lightning, so we need to create a datamodule object: from entity_embed import DeduplicationDataModule datamodule = DeduplicationDataModule( train_record_dict=train_record_dict, valid_record_dict=valid_record_dict, test_record_dict=test_record_dict, cluster_field="cluster", record_numericalizer=record_numericalizer, batch_size=32, eval_batch_size=64, random_seed=42, )
  17. 17. Entity Embed Training the model We must choose the K of the Approximate Nearest Neighbors, i.e., the top K neighbors our model will use to find duplicates in the embedding space. from entity_embed import EntityEmbed model = EntityEmbed( record_numericalizer, ann_k=100, )
  18. 18. Entity Embed Finding candidate pairs When running in production, you only have access to the trained model object and the production record_dict (without the true clusters filled, of course). You can get the embedding vectors of a production record_dict using the predict method: vector_dict = model.predict( record_dict=test_record_dict, batch_size=64 ) But what you usually want instead is the ANN pairs. You can get them with the predict_pairs method: found_pair_set = model.predict_pairs( record_dict=test_record_dict, batch_size=64, ann_k=100, sim_threshold=0.3, )
  19. 19. References 1) Deep Learning for Blocking in Entity Matching 2) Autoblock 3) Entity Embed 4) Location Matching

×