Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Linking Entities for Enriching and Structuring Social Media Content

818 visualizaciones

Publicado el

Keynote Talk given at the 2nd International Workshop on Natural Language Processing for Informal Text (NLPIT 2016).
In conjunction with 25th International World Wide Web Conference (WWW 2016), April 11-15, 2016, Montreal, Canada

Publicado en: Tecnología
  • Sé el primero en comentar

Linking Entities for Enriching and Structuring Social Media Content

  1. 1. Linking Entities for Enriching and Structuring Social Media Content Raphaël Troncy <raphael.troncy@eurecom.fr> @rtroncy
  2. 2. 12/04/2016 NLPIT Workshop @ WWW 2016 - 2
  3. 3. Extracting and Linking Entities (NER/NEL)  “ Tampa Bay Lightning vs Canadiens in Montreal tonight with @erikmannens #hockey #NHL ” 12/04/2016 NLPIT Workshop @ WWW 2016 - 3 https://www.youtube.com/ watch?v=Rmug-PUyIzI
  4. 4. Part of Speech (GATE Twitter POS) Tampa NNP Bay NNP Lightning NNP vs CC Canadiens NNP in IN Montreal NNP tonight NN with IN @erikmannens USR #hockey HT #NHL HT 12/04/2016 NLPIT Workshop @ WWW 2016 NER: What is NHL? - 4 https://gate.ac.uk/wiki/twitter-postagger.html NEL: Which Montreal are we talking about?
  5. 5. What is #NHL? Type Ambiguity 12/04/2016 NLPIT Workshop @ WWW 2016 - 5 Sports League Organization Place Railway Line
  6. 6. What is #NHL? Type Ambiguity 12/04/2016 NLPIT Workshop @ WWW 2016 - 6 http://schema.org /SportsEvent http://dbpedia.org/ ontology/Event http://schema.org /Organization http://dbpedia.org/ ontology/IceHocke yLeague Different infobox templates
  7. 7. Named Entity Recognition (NER) Tampa NNP ORG Bay NNP ORG Lightning NNP ORG vs CC O Canadiens NNP ORG in IN O Montreal NNP LOC tonight NN O with IN O @erikmannens USR PER #hockey HT THG #NHL HT ORG 12/04/2016 NLPIT Workshop @ WWW 2016 - 7
  8. 8. What is Montreal? Name Ambiguity 12/04/2016 NLPIT Workshop @ WWW 2016 Montréal, Ardèche Montréal, Aude Montréal, Gers Montreal, Wisconsin Mont-ral, Catalonia - 8
  9. 9. Named Entity Linking (NEL) Tampa NNP ORG Bay NNP ORG Lightning NNP ORG vs CC O Canadiens NNP ORG in IN O Montreal NNP LOC http://dbpedia.org/resource/Montreal tonight NN O with IN O @erikmannens USR PER NIL #hockey HT THG #NHL HT ORG 12/04/2016 NLPIT Workshop @ WWW 2016 - 9
  10. 10. NERD: a framework for comparing NER APIs  NER Stanford CoreNLP  Web APIs http://nerd.eurecom.fr/ 12/04/2016 NLPIT Workshop @ WWW 2016 - 10
  11. 11. NERD: AlchemyAPI 12/04/2016 NLPIT Workshop @ WWW 2016 - 11 Incorrect boundaries No disambiguation No dereferencing for @mention
  12. 12. NERD: Dandelion 12/04/2016 NLPIT Workshop @ WWW 2016 - 12 Everything is a Thing No dereferencing for @mention
  13. 13. NERDML 12/04/2016 NLPIT Workshop @ WWW 2016 - 13 No dereferencing for @mention
  14. 14. Research Questions  How to adapt an entity linking system depending on different criteria?  How to design an entity linking system in order to be able to process a large amount of data in near real time? 12/04/2016 NLPIT Workshop @ WWW 2016 - 14
  15. 15. ADEL: Adaptive Framework for NER  POS Tagger:  use bidirectional dependency network  combine CMM left to right and right to left  NER:  use CRF with Gibbs sampling (Monte Carlo for approximate inference) to take n words into account instead of only the previous and next one 12/04/2016 NLPIT Workshop @ WWW 2016 - 15
  16. 16. ADEL: Overlap Resolution  Detect overlaps among extractors with the boundaries of the entities  Different heuristics can be applied:  Merge: (“United States” and “States of America” => “United States of America”) default behavior  Simple Substring: (“Florence” and “Florence May Harding” => ”Florence” and “May Harding”)  Smart Substring: (”Giants of New York” and “New York” => “Giants” and “New York”) 12/04/2016 NLPIT Workshop @ WWW 2016 - 16
  17. 17. ADEL: KB Indexing  Create index from DBpedia and Wikipedia  Integrate external data such as PageRank and HITS scores from Hasso Platner Institute 12/04/2016 NLPIT Workshop @ WWW 2016 - 17
  18. 18. ADEL: Adaptive Framework for NEL  Generate candidate links for all extracted mentions:  If any, they go to the linking method  If not, they are linked to NIL  Linking method:  ADEL linear formula: r(l): the score of the candidate l L: the Levenshtein distance m: the extracted mention title: the title of the candidate l R: the set of redirect pages associated to the candidate l D: the set of disambiguation pages associated to the candidate l PR: Pagerank associated to the candidate l a, b and c are weights following the properties: a > b > c and a + b + c = 1 12/04/2016 NLPIT Workshop @ WWW 2016 - 18
  19. 19. ADEL: Pruning for NER/NEL  k-NN machine learning algorithm  Why a pruning module?  Useful to correct the errors from the extractor by removing wrong annotations. Example:  France played against Russia for a friendly match  Yesterday, I went to see Against in concert  Useful to adapt the annotations in order to follow a given guideline Example: suppose we are participating to two different challenges, the first one count the dates as entities, and the second one does not  NEEL challenge: Jimmy Page was born the January 9th, 1944.  OKE challenge: Jimmy Page was born the January 9th, 1944. 12/04/2016 NLPIT Workshop @ WWW 2016 - 19
  20. 20. ADEL Evaluation  #Micropost2014 NEEL Challenge – ADEL v1  #Micropost2015 NEEL Challenge – ADEL v1  #Micropost2016 NEEL Challenge – ADEL v2  OKE2015 Challenge – ADEL v1  OKE2016 Challenge – ADEL v2 E2E UTwente DataTXT ADEL AIDA Hyberabad SAP F- measure 70.06 54.93 49.9 46.29 45.37 45.23 39.02 ADEL FOX FRED F- measure 60.75 49.88 34.73 ousia acubelab ADEL uniba ualberta uva cen_neel F- measure 76.2 52.3 47.9 46.4 41.5 31.6 0 ADEL F- measure 78.8 ADEL F- measure 56.5 12/04/2016 NLPIT Workshop @ WWW 2016 - 20
  21. 21. ADEL Live Demo 12/04/2016 NLPIT Workshop @ WWW 2016 - 21
  22. 22. Social Media: some definitions  Media Item: a photo or a video that is shared on a social network  Micropost: a text status message that can optionally accompany a media item  Social Network: an online service that focuses on building and reflecting social relationships among people sharing interests or activities Media Sharing Platforms: emphasis on sharing media but blurred boundaries with social networks since users are encouraged to react on media content (like, comment, favorite, etc.) NLPIT Workshop @ WWW 201612/04/2016 - 22
  23. 23. Media Server  Composition of media item extractors (12 SNs)  Rely on search APIs + a fix 30s timeout window to provide results  Fallback on screen scraping when necessary (Twitter ecosystem)  Implemented as a NodeJS server  Serialize results in a common schema (JSON) NLPIT Workshop @ WWW 201612/04/2016 - 23 https://github.com/tomayac/media-server
  24. 24. 12/04/2016 NLPIT Workshop @ WWW 2016 Deep link Permalink Clean text for NLP processing Aggregate view of ALL social interactions 12 Social Networks
  25. 25. Media Finder (www2013) 12/04/2016 NLPIT Workshop @ WWW 2016 - 25
  26. 26. Media Finder (zooming on media items) 12/04/2016 NLPIT Workshop @ WWW 2016 - 26
  27. 27. Media Finder (timeline view) 12/04/2016 NLPIT Workshop @ WWW 2016 - 27
  28. 28. Media Finder Architecture  Media items harvesting using the Media Server http://eventmedia.eurecom.fr/media- server/search/{combined}/{term} https://github.com/vuknje/media-server (@tomayac fork)  Image near de-duplication DCT signature on image and video frame, Hamming distance between image pairs  Clustering and disambiguation Named Entity Extraction using NERD Topic Generation using LDA 12/04/2016 NLPIT Workshop @ WWW 2016 - 28
  29. 29. Media Finder (named entities clustering) 12/04/2016 NLPIT Workshop @ WWW 2016 - 29
  30. 30. Media Finder (zooming in a cluster) 12/04/2016 NLPIT Workshop @ WWW 2016 - 30
  31. 31. Media Finder  Live Topic Generation from Event Streams Published at WWW 2013 Demo Track http://www.youtube.com/watch?v=8iRiwz7cDYY 12/04/2016 NLPIT Workshop @ WWW 2016 - 31
  32. 32. Tracking an event: Italian Election  Repeated queries over a period of time We have tracked and analyzed media posts tagged as elezioni2013 from 2013-02-26 to 2013-03-03 Cron job: every 30 minutes over the 6 days Slice the data in 24 hours slots  Research questions: Can we re-create the news headlines?  Storyboarding: http://mediafinder.eurecom.fr/story/elezioni2013 12/04/2016 NLPIT Workshop @ WWW 2016 - 32
  33. 33. Tracking an event: Italian Election  Dataset: ~16501 microposts containing (duplicate) media items ~21087 Named Entities extracted  Clustering NER and LDA Generate Bag of Entities (BOE) disambiguated with a DBpedia URI  Examples: Monti, Bersani, Italia, Berlusconi, Grillo, Stelle 12/04/2016 NLPIT Workshop @ WWW 2016 - 33
  34. 34. Tracking an event: Italian Election  Tracking and Analyzing The 2013 Italian Election Published at ESWC 2013 Demo Track http://www.youtube.com/watch?v=jIMdnwMoWnk 12/04/2016 NLPIT Workshop @ WWW 2016 - 34
  35. 35. Searching and browsing TED Talks GO!
  36. 36. MF: Chapters
  37. 37. “This is Nikita, a security guard from one of the bars in St. Petersburg.” “This is Nikita, a security guard from one of the bars in St. Petersburg.” NER Example taken from the transcript of https://www.ted.com/talks/2089 PERSON FUNCTION LOCATION Category: type in the NER task. Natural Language Processing (NPL) Task  disambiguating URL in a knowledge base. E.g. http://dbpedia.org/resource/Saint_P etersburg. Annotations: Named Entities
  38. 38. 1. Clustering of consecutive chapters which talk about similar topics and entities 2. Ordering of those fragments based on annotation relevance (TF-IDF) 3. Filtering: Hot Spots are fragments whose relative relevance falls under the first quarter of the final score distribution MF: Hot Spots Hot Spot 1 Chapters Hot Spot 2 Hot Spots
  39. 39. Hyperlink: Indexing TED Talks
  40. 40. http://www.slideshare.net/troncy 12/04/2016 NLPIT Workshop @ WWW 2016 - 40

×