SlideShare una empresa de Scribd logo
1 de 23
1RANLP’2011, Hissar, Bulgaria, 12.09.2011
JRC-Names: A freely available, highly multilingual
named entity resource
Hissar, Bulgaria, 12 September 2011
Ralf Steinberger, Bruno Pouliquen, Mijail Kabadjov,
Jenya Belyaeva & Erik van der Goot
Technical details and publications: http://langtech.jrc.ec.europa.eu/
Applications: http://emm.newbrief.eu/overview.html
2RANLP’2011, Hissar, Bulgaria, 12.09.2011
Agenda
• What is JRC-Names; What can it be used for
• Related work: other named entity (NE) resources
• How JRC-Names was produced
• Recognition of named entities in news reports in 20 languages
• Introduction to EMM
• Automatic mapping of name variants to the same entity
• Enrichment with Wikipedia variants
• Partial manual moderation
• Statistics on JRC-Names
• Programming details / Usage of the tool
• Solutions to capture morphological variants
• Further multilingual linguistic resources
3RANLP’2011, Hissar, Bulgaria, 12.09.2011
What is JRC-Names?
• JRC-Names consists of:
• Lists of names and their many spelling variants,
• ~205,000 person and organisation names plus
• ~204,000 name spelling variants
• In 27 scripts and many more languages
• Software to recognise these names in multilingual text, with offset and unique name identifier
• Download from http://langtech.jrc.ec.europa.eu/
4RANLP’2011, Hissar, Bulgaria, 12.09.2011
Possible uses of JRC-Names
• Standardise name spellings in databases, text collections and the internet for
improved retrieval (Stern & Sagot 2010)
• Improve Machine Translation – names must be treated differently from other
words (Babych & Hartley 2003; Steinberger & Pouliquen 2009)
• Use as input to learn automatic transliteration rules (e.g. Pouliquen 2009)
• Use output of JRC-Names as seeds to learn NER rules (e.g. Buchholz & van
den Bosch 2000)
• Social networks are less biased by national viewpoints if based on information
extracted from multilingual texts
• NER results are useful for other text mining tasks (opinion mining; co-reference
resolution; summarisation; topic detection and tracking; cross-lingual linking of
related documents across languages; …)
5RANLP’2011, Hissar, Bulgaria, 12.09.2011
Related work – other multilingual (ml) NE resources
• Wentlant et al. (2008) – built a ml NE repository based on Wikipedia links and
case information; 2.5 Mio English names, 250K German, 3K Swahili, …
• Toral et al. (2008) – built Named Entity WordNet by searching NEs in WordNet
and Wikipedia: 310K entities, including 278K persons
• Stern & Sagot (2010) – exploit French Wikipedia and GeoNames to produce
French resource: 263K person names + 883K variants.
• Maurel (2009) –produced Prolexbase mostly manually: 75K entities of all types
•  Most resources are based on Wikipedia
• Strong at providing cross-lingual and cross-script variants
• Offers only few other spelling variants
• No morphological inflections
• JRC-Names contains mostly spelling variants from real-life text,
enriched with Wikipedia – up to 413 variants for the same NE.
6RANLP’2011, Hissar, Bulgaria, 12.09.2011
Name variants found and used in 6 hours (!) of EMM news analysis
26.08.2011, PM
7RANLP’2011, Hissar, Bulgaria, 12.09.2011
Agenda
• What is JRC-Names; What can it be used for
• Related work: other named entity (NE) resources
• How JRC-Names was produced
• Recognition of named entities in news reports in 20 languages
• Introduction to EMM
• Automatic mapping of name variants to the same entity
• Enrichment with Wikipedia variants
• Partial manual moderation
• Statistics on JRC-Names
• Programming details / Usage of the tool
• Solutions to capture morphological variants
• Further multilingual linguistic resources
8RANLP’2011, Hissar, Bulgaria, 12.09.2011
NER on news gathered by the Europe Media Monitor (EMM)
• ~ 3600 Sources (world-wide, with focus on Europe)
• ~ 3225 news sources (web portals)
• ~ 360 specialist medical sites
• ~ 20 commercial newswires
• Specialist pay-for sources (LexisMed)
• 24/7, updated every 10 minutes
• ~ 100,000 articles / day in ~ 50 languages
• Named Entity Recognition (NER) performed on 20 languages.
• Articles are fed into the various publicly accessible EMM applications:
9RANLP’2011, Hissar, Bulgaria, 12.09.2011
Multilingual NER in EMM – A brief overview
• Lookup of most frequent known names and their variants in all languages
• Database currently contains about 1,18 million names + 225.000 variants (status July 2011)
• Including morphological (and other) variants by pre-generating inflection forms
(Slovene example):
Tony(a|o|u|om|em|m|ju|jem|ja)?s+Blair(a|o|u|om|em|m|ju|jem|ja)
• Guessing new names using empirically-derived lexical patterns in 20 languages.
• President, Minister, Head of State, Sir, American
• “death of”, “[0-9]+-year-old”, …
• Known first names + uppercase words
• Identification of a current average of 1,000 unknown names per day.
• Only names found repeatedly will become known names (error reduction).
10RANLP’2011, Hissar, Bulgaria, 12.09.2011
Multilingual name recognition using lexical patterns
asesinato del exprimer ministro Rafic al-Hariri, que la oposición atribuyóes
l'assassinat de l'ex-dirigeant Rafic Hariri et le départ du chef de la diplomfr
na de moord op oud-premier Rafiq al-Hariri gingen gisteren bijna eennl
libanesischen Regierungschef Rafik Hariri vor einem Monat wichtige Bde
danjega libanonskega premiera Rafika Haririja. Libanonska opozicija sisl
möödumisele ekspeaminister Rafik al-Hariri surma põhjustanud pommiplet
death of former Prime Minister Rafik Hariri, blamed by many oppositionen
‫اغتيال‬‫السابق‬ ‫الوزراء‬ ‫رئيس‬‫الحريري‬ ‫رفيق‬‫سابقا‬ ‫حدث‬ ‫وما‬ ‫يهودية‬ ‫بأياد‬ar
Бывший премьер-министр Ливана Рафик Харири, которыйru
11RANLP’2011, Hissar, Bulgaria, 12.09.2011
Merging name variants for the same entity
• For all newly found name forms, detect whether they are a variant of an existing NE:
• Transliteration;
• Normalisation, using ~30 hand-written rules and removing vowels;
• Calculate similarity (threshold: 94%).
• Below threshold  new entity
20%
+
80%
Condition:
12RANLP’2011, Hissar, Bulgaria, 12.09.2011
Enriching the EMM data with Wikipedia name variants
• For frequent or highly visible names, manually launch a Wikipedia mining process.
• Check for each variant of a name whether there is a Wikipedia entry.
• New name variants, in all scripts, will be recognised in new EMM articles.
Хамид Карзай
Hamid Karzai
Hamid Karzaï
Hamid Karsai
‫كرزاي‬ ‫حامد‬
हामिद करजई
哈米德·卡尔扎伊
http://en.wikipedia.org/wiki/Hamid_Karzai
13RANLP’2011, Hissar, Bulgaria, 12.09.2011
Manual moderation of EMM name database
• Process is fully automatic, but it can be useful to make changes manually.
• Manual process only for frequent or important names (e.g. Nobel Prize winners):
• Name changes: (e.g. Cardinal Josef Ratzinger  Pope Benedict XVI)
• Correct NER mistakes (e.g. Genius Report, Opfer von Diskriminierung);
• Add new stop name parts (e.g. Monday, Report);
• Merge name variants with similarity below the threshold;
• Change the display name of an entity;
• Correct the entity type (PER, ORG, T, U, …);
• Launch Wikipedia mining process;
• …
• Caveat: Name database contains errors!
14RANLP’2011, Hissar, Bulgaria, 12.09.2011
Agenda
• What is JRC-Names; What can it be used for
• Related work: other named entity (NE) resources
• How JRC-Names was produced
• Recognition of named entities in news reports in 20 languages
• Introduction to EMM
• Automatic mapping of name variants to the same entity
• Enrichment with Wikipedia variants
• Partial manual moderation
• Statistics on JRC-Names
• Programming details / Usage of the tool
• Solutions to capture morphological variants
• Further multilingual linguistic resources
15RANLP’2011, Hissar, Bulgaria, 12.09.2011
Statistics on JRC-Names (1)
• JRC-Names include names from the EMM database if any of the following hold:
• Found in 5 or more news clusters;
• Manually verified;
• Retrieved from Wikipedia;
• Number of entries (status July 2011):
• 205,000 distinct names;
• 204,000 additional variants;
• ~3.2% names of organisations / events
• Number of variants:
• 413 variants for Muammar Gaddafi (entity 262)
• 256 variants for Mikhail Saakashvili (entity 472)
• 246 variants for Mahmoud Ahmadinejad (entity 101358)
• Grows by ~230 new entities and ~430 new variants per week.
Variant forms No. of entities
1 63.76%
2 22.52%
3 5.31%
10 or more 3760 entities
50 or more 242 entities
100 or more 37 entities
16RANLP’2011, Hissar, Bulgaria, 12.09.2011
Statistics on JRC-Names (2)
• Number of scripts: 27 Number of languages: ???
• News mentions names from
around the world.
• Frequency does not reflect origin
• European Union (10101) is most
frequent entity in German, and
second in English.
• It does not matter where a name
like Silvio Berlusconi comes from.
17RANLP’2011, Hissar, Bulgaria, 12.09.2011
Agenda
• What is JRC-Names; What can it be used for
• Related work: other named entity (NE) resources
• How JRC-Names was produced
• Recognition of named entities in news reports in 20 languages
• Introduction to EMM
• Automatic mapping of name variants to the same entity
• Enrichment with Wikipedia variants
• Partial manual moderation
• Statistics on JRC-Names
• Programming details / Usage of the tool
• Solutions to capture morphological variants
• Further multilingual linguistic resources
18RANLP’2011, Hissar, Bulgaria, 12.09.2011
Details about the JRC-Names software
• Java-implemented demonstrator
• Finite state automaton
• Reads the NE resource file entities.gzip (frequently updated)
• Searches for known names (and their variants) in UTF8-encoded text files. Returns:
• Numerical name identifier
• Main name for that entity
• Name string found in the text
• Position (Offset and string length)
• For any given name string, returns all variants.
• Software and NE resource file can be downloaded from
• http://langtech.jrc.ec.europe.eu/ , Section on ‘Resources’
• Free usage, according to accompanying end-user licence.
19RANLP’2011, Hissar, Bulgaria, 12.09.2011
Agenda
• What is JRC-Names; What can it be used for
• Related work: other named entity (NE) resources
• How JRC-Names was produced
• Recognition of named entities in news reports in 20 languages
• Introduction to EMM
• Automatic mapping of name variants to the same entity
• Enrichment with Wikipedia variants
• Partial manual moderation
• Statistics on JRC-Names
• Programming details / Usage of the tool
• Solutions to capture morphological variants
• Further multilingual linguistic resources
20RANLP’2011, Hissar, Bulgaria, 12.09.2011
Treatment of morphological inflections
• The recognition of morphological inflections used in EMM processing chain are
not currently part of JRC-Names.
• We are working on a solution to include morphological processing in a future
release of JRC-Names.
• Further variants will also be included more consistently:
• Hyphenation (e.g. Yves Saint-Laurent vs. Yves Saint Laurent)
• Names with and without name ‘infixes’ (e.g. Khan al Khalil vs. Khan Khalil)
• Abbreviations (e.g. Saint vs. St.)
• …
• Current solution: Add the approximately 45,000 full-forms of inflected names,
as found in EMM processing results since January 2011, to the resource file
entities.gzip
• This helps to recognise the most frequent inflection forms of the frequent names.
21RANLP’2011, Hissar, Bulgaria, 12.09.2011
Agenda
• What is JRC-Names; What can it be used for
• Related work: other named entity (NE) resources
• How JRC-Names was produced
• Recognition of named entities in news reports in 20 languages
• Introduction to EMM
• Automatic mapping of name variants to the same entity
• Enrichment with Wikipedia variants
• Partial manual moderation
• Statistics on JRC-Names
• Programming details / Usage of the tool
• Solutions to capture morphological variants
• Further multilingual linguistic resources
22RANLP’2011, Hissar, Bulgaria, 12.09.2011
Further JRC/EC-provided multilingual linguistic resources
• JRC-Acquis (2006): 1 billion word parallel corpus in 22 languages
• DGT-TM (2007): Translation Memory in 22 languages; up to 2 million segments
•  DGT-TM-2011 (forthcoming): 23 languages; 4 million segments? Yearly updates
• JEX (JRC Eurovoc Indexer) (forthcoming): software to automatically label texts
according to the thousands of categories of the Eurovoc thesaurus; 23 languages.
• Further smaller resources:
• Multilingual summary evaluation data (2010): 4 clusters for each of 7 languages
• Sentiment-annotated collection of quotations (2010): English (German forthcoming)
• Multilingual Named Entity-annotated parallel corpus (forthcoming)
• Available at http://langtech.jrc.ec.europa.eu/, section on ‘Resources’
23RANLP’2011, Hissar, Bulgaria, 12.09.2011
JRC-Names: A freely available, highly multilingual
named entity resource
Hissar, Bulgaria, 12 September 2011
Ralf Steinberger, Bruno Pouliquen, Mijail Kabadjov,
Jenya Belyaeva & Erik van der Goot
Technical details and publications: http://langtech.jrc.ec.europa.eu/
Applications: http://emm.newbrief.eu/overview.html

Más contenido relacionado

La actualidad más candente

Publishing Data Using Semantic Web Technologies
Publishing Data Using Semantic Web TechnologiesPublishing Data Using Semantic Web Technologies
Publishing Data Using Semantic Web Technologies
Nikolaos Konstantinou
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
Robert Viseur
 
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talkDistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
Gezim Sejdiu
 

La actualidad más candente (16)

Publishing Data Using Semantic Web Technologies
Publishing Data Using Semantic Web TechnologiesPublishing Data Using Semantic Web Technologies
Publishing Data Using Semantic Web Technologies
 
Publishing and Using Linked Open Data - Day 2
Publishing and Using Linked Open Data - Day 2Publishing and Using Linked Open Data - Day 2
Publishing and Using Linked Open Data - Day 2
 
Federated data stores using semantic web technology
Federated data stores using semantic web technologyFederated data stores using semantic web technology
Federated data stores using semantic web technology
 
Linked (Open) Data
Linked (Open) DataLinked (Open) Data
Linked (Open) Data
 
Hack U Barcelona 2011
Hack U Barcelona 2011Hack U Barcelona 2011
Hack U Barcelona 2011
 
Linguistic markup and transclusion processing in XML documents
Linguistic markup and transclusion processing in XML documentsLinguistic markup and transclusion processing in XML documents
Linguistic markup and transclusion processing in XML documents
 
A Context-Based Semantics for SPARQL Property Paths over the Web
A Context-Based Semantics for SPARQL Property Paths over the WebA Context-Based Semantics for SPARQL Property Paths over the Web
A Context-Based Semantics for SPARQL Property Paths over the Web
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
 
Building OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web toolsBuilding OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web tools
 
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
 
cldr_overview
cldr_overviewcldr_overview
cldr_overview
 
Using Public RDF Resources in Neo4j
Using Public RDF Resources in Neo4jUsing Public RDF Resources in Neo4j
Using Public RDF Resources in Neo4j
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
 
Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...
 
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talkDistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
 
Word Occurrence Based Extraction of Work Contributors from Statements of Resp...
Word Occurrence Based Extraction of Work Contributors from Statements of Resp...Word Occurrence Based Extraction of Work Contributors from Statements of Resp...
Word Occurrence Based Extraction of Work Contributors from Statements of Resp...
 

Destacado

RSE : le Medef et EcoVadis publient un guide pratique pour accompagner les PME
RSE : le Medef et EcoVadis publient un guide pratique pour accompagner les PMERSE : le Medef et EcoVadis publient un guide pratique pour accompagner les PME
RSE : le Medef et EcoVadis publient un guide pratique pour accompagner les PME
Adm Medef
 

Destacado (20)

Cellar: Semantic Database of EU Law - Diplohack Brussels
Cellar: Semantic Database of EU Law - Diplohack BrusselsCellar: Semantic Database of EU Law - Diplohack Brussels
Cellar: Semantic Database of EU Law - Diplohack Brussels
 
Unvote - Diplohack Brussels
 Unvote - Diplohack Brussels Unvote - Diplohack Brussels
Unvote - Diplohack Brussels
 
EU SectorLex - Diplohack Brussels
EU SectorLex - Diplohack BrusselsEU SectorLex - Diplohack Brussels
EU SectorLex - Diplohack Brussels
 
Open data De Lijn
Open data De LijnOpen data De Lijn
Open data De Lijn
 
refEUgo - Diplohack Brussels
refEUgo - Diplohack BrusselsrefEUgo - Diplohack Brussels
refEUgo - Diplohack Brussels
 
Open Data Governance as an Integral Part of a Smart City: How to start with?
Open Data Governance as an Integral Part of a Smart City: How to start with?Open Data Governance as an Integral Part of a Smart City: How to start with?
Open Data Governance as an Integral Part of a Smart City: How to start with?
 
Smartcities
SmartcitiesSmartcities
Smartcities
 
Estrategia smart city ajuntament bcn - la-salle_bcn
Estrategia smart city   ajuntament bcn - la-salle_bcnEstrategia smart city   ajuntament bcn - la-salle_bcn
Estrategia smart city ajuntament bcn - la-salle_bcn
 
Beyond the single act of creation - Sustainability of hackathons
Beyond the single act of creation - Sustainability of hackathonsBeyond the single act of creation - Sustainability of hackathons
Beyond the single act of creation - Sustainability of hackathons
 
EnergyZip - Diplohack Brussels
EnergyZip - Diplohack BrusselsEnergyZip - Diplohack Brussels
EnergyZip - Diplohack Brussels
 
The economic value behind (Linked) Open Data - Nikolaos Loutas
The economic value behind (Linked) Open Data - Nikolaos LoutasThe economic value behind (Linked) Open Data - Nikolaos Loutas
The economic value behind (Linked) Open Data - Nikolaos Loutas
 
Ciudades inteligentes
Ciudades inteligentesCiudades inteligentes
Ciudades inteligentes
 
When you are given Open Science, what will you do with it?
When you are given Open Science, what will you do with it?When you are given Open Science, what will you do with it?
When you are given Open Science, what will you do with it?
 
Smart City Analytics (español)
Smart City Analytics (español)Smart City Analytics (español)
Smart City Analytics (español)
 
RSE : le Medef et EcoVadis publient un guide pratique pour accompagner les PME
RSE : le Medef et EcoVadis publient un guide pratique pour accompagner les PMERSE : le Medef et EcoVadis publient un guide pratique pour accompagner les PME
RSE : le Medef et EcoVadis publient un guide pratique pour accompagner les PME
 
The economic value behind (Linked) Open Data - Toon Vanagt
 The economic value behind (Linked) Open Data - Toon Vanagt The economic value behind (Linked) Open Data - Toon Vanagt
The economic value behind (Linked) Open Data - Toon Vanagt
 
Rapport de gestion 2015 medef
Rapport de gestion 2015 medefRapport de gestion 2015 medef
Rapport de gestion 2015 medef
 
Caso Práctico: Ciudad de Málaga. Smart Cities
Caso Práctico: Ciudad de Málaga. Smart CitiesCaso Práctico: Ciudad de Málaga. Smart Cities
Caso Práctico: Ciudad de Málaga. Smart Cities
 
Supporting the Marine Strategy Framework Directive - Diplohack Brussels
Supporting the Marine Strategy Framework Directive - Diplohack BrusselsSupporting the Marine Strategy Framework Directive - Diplohack Brussels
Supporting the Marine Strategy Framework Directive - Diplohack Brussels
 
Data Derby, Belgium versus the Netherlands
Data Derby, Belgium versus the NetherlandsData Derby, Belgium versus the Netherlands
Data Derby, Belgium versus the Netherlands
 

Similar a JRC-Names - EC - Diplohack Datamarket

Semantic Pipes and Semantic Mashups
Semantic Pipes and Semantic MashupsSemantic Pipes and Semantic Mashups
Semantic Pipes and Semantic Mashups
giurca
 
Linked Data and cultural heritage data: an overview of the approaches from Eu...
Linked Data and cultural heritage data: an overview of the approaches from Eu...Linked Data and cultural heritage data: an overview of the approaches from Eu...
Linked Data and cultural heritage data: an overview of the approaches from Eu...
The European Library
 

Similar a JRC-Names - EC - Diplohack Datamarket (20)

Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
 
LSDI.pptx
LSDI.pptxLSDI.pptx
LSDI.pptx
 
Importing life science at a into Neo4j
Importing life science at a into Neo4jImporting life science at a into Neo4j
Importing life science at a into Neo4j
 
RDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization dataRDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization data
 
Semantic Pipes and Semantic Mashups
Semantic Pipes and Semantic MashupsSemantic Pipes and Semantic Mashups
Semantic Pipes and Semantic Mashups
 
Is the Calvet Language Barometer useful to measure linguistic justice?
Is the Calvet Language Barometer useful to measure linguistic justice?Is the Calvet Language Barometer useful to measure linguistic justice?
Is the Calvet Language Barometer useful to measure linguistic justice?
 
20110728 datalift-rpi-troy
20110728 datalift-rpi-troy20110728 datalift-rpi-troy
20110728 datalift-rpi-troy
 
Facilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppFacilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-jupp
 
Linked Data and cultural heritage data: an overview of the approaches from Eu...
Linked Data and cultural heritage data: an overview of the approaches from Eu...Linked Data and cultural heritage data: an overview of the approaches from Eu...
Linked Data and cultural heritage data: an overview of the approaches from Eu...
 
Bringing parliamentary debates to the Semantic Web
Bringing parliamentary debates to the Semantic WebBringing parliamentary debates to the Semantic Web
Bringing parliamentary debates to the Semantic Web
 
Towards a Web Search Service for Minority Language Communities
Towards a Web Search Service for Minority Language CommunitiesTowards a Web Search Service for Minority Language Communities
Towards a Web Search Service for Minority Language Communities
 
Widening the limits of cognitive reception with online digital library graph ...
Widening the limits of cognitive reception with online digital library graph ...Widening the limits of cognitive reception with online digital library graph ...
Widening the limits of cognitive reception with online digital library graph ...
 
OpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportOpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project Report
 
RDA 101: an introduction to RDA (2012)
RDA 101: an introduction to RDA (2012)RDA 101: an introduction to RDA (2012)
RDA 101: an introduction to RDA (2012)
 
Harnessing diversity in crowds and machines for better ner performance
Harnessing diversity in crowds and machines for better ner performanceHarnessing diversity in crowds and machines for better ner performance
Harnessing diversity in crowds and machines for better ner performance
 
Tillett, Hillmann, and Moen, "Bibliographic Control Alphabet Soup: AACR to R...
Tillett, Hillmann, and Moen, "Bibliographic Control Alphabet Soup:  AACR to R...Tillett, Hillmann, and Moen, "Bibliographic Control Alphabet Soup:  AACR to R...
Tillett, Hillmann, and Moen, "Bibliographic Control Alphabet Soup: AACR to R...
 
Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...
 
CROSER
CROSERCROSER
CROSER
 
Oc wg-nif-20130711
Oc wg-nif-20130711Oc wg-nif-20130711
Oc wg-nif-20130711
 
Promoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language TechnologyPromoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language Technology
 

Más de Open Knowledge Belgium

Más de Open Knowledge Belgium (20)

Open Data Stories You haven't heard!
Open Data Stories You haven't heard!Open Data Stories You haven't heard!
Open Data Stories You haven't heard!
 
A​ FUNUMENTARY:​ Take what you can, give nothing back...​ ​(NOT)
A​ FUNUMENTARY:​ Take what you can, give nothing back...​ ​(NOT)A​ FUNUMENTARY:​ Take what you can, give nothing back...​ ​(NOT)
A​ FUNUMENTARY:​ Take what you can, give nothing back...​ ​(NOT)
 
Smarter by Open Data: Process and Practice in Flevoland (NL)
Smarter by Open Data: Process and Practice in Flevoland (NL)Smarter by Open Data: Process and Practice in Flevoland (NL)
Smarter by Open Data: Process and Practice in Flevoland (NL)
 
Open Knowledge for Social Innovation
Open Knowledge for Social InnovationOpen Knowledge for Social Innovation
Open Knowledge for Social Innovation
 
Smart Flanders: Tackling urban challenges through Open Data
Smart Flanders: Tackling urban challenges through Open DataSmart Flanders: Tackling urban challenges through Open Data
Smart Flanders: Tackling urban challenges through Open Data
 
EIF and NIFO connecting public administrations, businesses, and citizens
EIF and NIFO connecting public administrations, businesses, and citizensEIF and NIFO connecting public administrations, businesses, and citizens
EIF and NIFO connecting public administrations, businesses, and citizens
 
Connecting Open data for solving the fiscal transparency puzzle in the EU
Connecting Open data for solving the fiscal transparency puzzle in the EUConnecting Open data for solving the fiscal transparency puzzle in the EU
Connecting Open data for solving the fiscal transparency puzzle in the EU
 
Open Government and Networked European Democracy
Open Government and Networked European DemocracyOpen Government and Networked European Democracy
Open Government and Networked European Democracy
 
Mundaneum Factories for Open Tokenomics
Mundaneum Factories for Open TokenomicsMundaneum Factories for Open Tokenomics
Mundaneum Factories for Open Tokenomics
 
MIRVA: The European Open Recognition Project
MIRVA: The European Open Recognition ProjectMIRVA: The European Open Recognition Project
MIRVA: The European Open Recognition Project
 
Bike for Brussels - Open Summer of Code 2017
Bike for Brussels - Open Summer of Code 2017Bike for Brussels - Open Summer of Code 2017
Bike for Brussels - Open Summer of Code 2017
 
The story behind SNCB alerts
The story behind SNCB alertsThe story behind SNCB alerts
The story behind SNCB alerts
 
Traffic safety - answering tough questions with open data
Traffic safety - answering tough questions with open dataTraffic safety - answering tough questions with open data
Traffic safety - answering tough questions with open data
 
Eliminating data roadbloacks to get by traffic roadblocks without pain
Eliminating data roadbloacks to get by traffic roadblocks without painEliminating data roadbloacks to get by traffic roadblocks without pain
Eliminating data roadbloacks to get by traffic roadblocks without pain
 
Linked Open Data in limbo: Open cultural heritage resources
Linked Open Data in limbo: Open cultural heritage resourcesLinked Open Data in limbo: Open cultural heritage resources
Linked Open Data in limbo: Open cultural heritage resources
 
A journey to Linked Open Touristic Data
A journey to Linked Open Touristic DataA journey to Linked Open Touristic Data
A journey to Linked Open Touristic Data
 
How we use the massive open lidar dataset for the benfit of our clients
How we use the massive open lidar dataset for the benfit of our clientsHow we use the massive open lidar dataset for the benfit of our clients
How we use the massive open lidar dataset for the benfit of our clients
 
mu.semte.ch: A transitional architecture for Linked Data
mu.semte.ch: A transitional architecture for Linked Datamu.semte.ch: A transitional architecture for Linked Data
mu.semte.ch: A transitional architecture for Linked Data
 
Linked Open Chatbots
Linked Open ChatbotsLinked Open Chatbots
Linked Open Chatbots
 
The role and value of making data inventories
The role and value of making data inventoriesThe role and value of making data inventories
The role and value of making data inventories
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 

JRC-Names - EC - Diplohack Datamarket

  • 1. 1RANLP’2011, Hissar, Bulgaria, 12.09.2011 JRC-Names: A freely available, highly multilingual named entity resource Hissar, Bulgaria, 12 September 2011 Ralf Steinberger, Bruno Pouliquen, Mijail Kabadjov, Jenya Belyaeva & Erik van der Goot Technical details and publications: http://langtech.jrc.ec.europa.eu/ Applications: http://emm.newbrief.eu/overview.html
  • 2. 2RANLP’2011, Hissar, Bulgaria, 12.09.2011 Agenda • What is JRC-Names; What can it be used for • Related work: other named entity (NE) resources • How JRC-Names was produced • Recognition of named entities in news reports in 20 languages • Introduction to EMM • Automatic mapping of name variants to the same entity • Enrichment with Wikipedia variants • Partial manual moderation • Statistics on JRC-Names • Programming details / Usage of the tool • Solutions to capture morphological variants • Further multilingual linguistic resources
  • 3. 3RANLP’2011, Hissar, Bulgaria, 12.09.2011 What is JRC-Names? • JRC-Names consists of: • Lists of names and their many spelling variants, • ~205,000 person and organisation names plus • ~204,000 name spelling variants • In 27 scripts and many more languages • Software to recognise these names in multilingual text, with offset and unique name identifier • Download from http://langtech.jrc.ec.europa.eu/
  • 4. 4RANLP’2011, Hissar, Bulgaria, 12.09.2011 Possible uses of JRC-Names • Standardise name spellings in databases, text collections and the internet for improved retrieval (Stern & Sagot 2010) • Improve Machine Translation – names must be treated differently from other words (Babych & Hartley 2003; Steinberger & Pouliquen 2009) • Use as input to learn automatic transliteration rules (e.g. Pouliquen 2009) • Use output of JRC-Names as seeds to learn NER rules (e.g. Buchholz & van den Bosch 2000) • Social networks are less biased by national viewpoints if based on information extracted from multilingual texts • NER results are useful for other text mining tasks (opinion mining; co-reference resolution; summarisation; topic detection and tracking; cross-lingual linking of related documents across languages; …)
  • 5. 5RANLP’2011, Hissar, Bulgaria, 12.09.2011 Related work – other multilingual (ml) NE resources • Wentlant et al. (2008) – built a ml NE repository based on Wikipedia links and case information; 2.5 Mio English names, 250K German, 3K Swahili, … • Toral et al. (2008) – built Named Entity WordNet by searching NEs in WordNet and Wikipedia: 310K entities, including 278K persons • Stern & Sagot (2010) – exploit French Wikipedia and GeoNames to produce French resource: 263K person names + 883K variants. • Maurel (2009) –produced Prolexbase mostly manually: 75K entities of all types •  Most resources are based on Wikipedia • Strong at providing cross-lingual and cross-script variants • Offers only few other spelling variants • No morphological inflections • JRC-Names contains mostly spelling variants from real-life text, enriched with Wikipedia – up to 413 variants for the same NE.
  • 6. 6RANLP’2011, Hissar, Bulgaria, 12.09.2011 Name variants found and used in 6 hours (!) of EMM news analysis 26.08.2011, PM
  • 7. 7RANLP’2011, Hissar, Bulgaria, 12.09.2011 Agenda • What is JRC-Names; What can it be used for • Related work: other named entity (NE) resources • How JRC-Names was produced • Recognition of named entities in news reports in 20 languages • Introduction to EMM • Automatic mapping of name variants to the same entity • Enrichment with Wikipedia variants • Partial manual moderation • Statistics on JRC-Names • Programming details / Usage of the tool • Solutions to capture morphological variants • Further multilingual linguistic resources
  • 8. 8RANLP’2011, Hissar, Bulgaria, 12.09.2011 NER on news gathered by the Europe Media Monitor (EMM) • ~ 3600 Sources (world-wide, with focus on Europe) • ~ 3225 news sources (web portals) • ~ 360 specialist medical sites • ~ 20 commercial newswires • Specialist pay-for sources (LexisMed) • 24/7, updated every 10 minutes • ~ 100,000 articles / day in ~ 50 languages • Named Entity Recognition (NER) performed on 20 languages. • Articles are fed into the various publicly accessible EMM applications:
  • 9. 9RANLP’2011, Hissar, Bulgaria, 12.09.2011 Multilingual NER in EMM – A brief overview • Lookup of most frequent known names and their variants in all languages • Database currently contains about 1,18 million names + 225.000 variants (status July 2011) • Including morphological (and other) variants by pre-generating inflection forms (Slovene example): Tony(a|o|u|om|em|m|ju|jem|ja)?s+Blair(a|o|u|om|em|m|ju|jem|ja) • Guessing new names using empirically-derived lexical patterns in 20 languages. • President, Minister, Head of State, Sir, American • “death of”, “[0-9]+-year-old”, … • Known first names + uppercase words • Identification of a current average of 1,000 unknown names per day. • Only names found repeatedly will become known names (error reduction).
  • 10. 10RANLP’2011, Hissar, Bulgaria, 12.09.2011 Multilingual name recognition using lexical patterns asesinato del exprimer ministro Rafic al-Hariri, que la oposición atribuyóes l'assassinat de l'ex-dirigeant Rafic Hariri et le départ du chef de la diplomfr na de moord op oud-premier Rafiq al-Hariri gingen gisteren bijna eennl libanesischen Regierungschef Rafik Hariri vor einem Monat wichtige Bde danjega libanonskega premiera Rafika Haririja. Libanonska opozicija sisl möödumisele ekspeaminister Rafik al-Hariri surma põhjustanud pommiplet death of former Prime Minister Rafik Hariri, blamed by many oppositionen ‫اغتيال‬‫السابق‬ ‫الوزراء‬ ‫رئيس‬‫الحريري‬ ‫رفيق‬‫سابقا‬ ‫حدث‬ ‫وما‬ ‫يهودية‬ ‫بأياد‬ar Бывший премьер-министр Ливана Рафик Харири, которыйru
  • 11. 11RANLP’2011, Hissar, Bulgaria, 12.09.2011 Merging name variants for the same entity • For all newly found name forms, detect whether they are a variant of an existing NE: • Transliteration; • Normalisation, using ~30 hand-written rules and removing vowels; • Calculate similarity (threshold: 94%). • Below threshold  new entity 20% + 80% Condition:
  • 12. 12RANLP’2011, Hissar, Bulgaria, 12.09.2011 Enriching the EMM data with Wikipedia name variants • For frequent or highly visible names, manually launch a Wikipedia mining process. • Check for each variant of a name whether there is a Wikipedia entry. • New name variants, in all scripts, will be recognised in new EMM articles. Хамид Карзай Hamid Karzai Hamid Karzaï Hamid Karsai ‫كرزاي‬ ‫حامد‬ हामिद करजई 哈米德·卡尔扎伊 http://en.wikipedia.org/wiki/Hamid_Karzai
  • 13. 13RANLP’2011, Hissar, Bulgaria, 12.09.2011 Manual moderation of EMM name database • Process is fully automatic, but it can be useful to make changes manually. • Manual process only for frequent or important names (e.g. Nobel Prize winners): • Name changes: (e.g. Cardinal Josef Ratzinger  Pope Benedict XVI) • Correct NER mistakes (e.g. Genius Report, Opfer von Diskriminierung); • Add new stop name parts (e.g. Monday, Report); • Merge name variants with similarity below the threshold; • Change the display name of an entity; • Correct the entity type (PER, ORG, T, U, …); • Launch Wikipedia mining process; • … • Caveat: Name database contains errors!
  • 14. 14RANLP’2011, Hissar, Bulgaria, 12.09.2011 Agenda • What is JRC-Names; What can it be used for • Related work: other named entity (NE) resources • How JRC-Names was produced • Recognition of named entities in news reports in 20 languages • Introduction to EMM • Automatic mapping of name variants to the same entity • Enrichment with Wikipedia variants • Partial manual moderation • Statistics on JRC-Names • Programming details / Usage of the tool • Solutions to capture morphological variants • Further multilingual linguistic resources
  • 15. 15RANLP’2011, Hissar, Bulgaria, 12.09.2011 Statistics on JRC-Names (1) • JRC-Names include names from the EMM database if any of the following hold: • Found in 5 or more news clusters; • Manually verified; • Retrieved from Wikipedia; • Number of entries (status July 2011): • 205,000 distinct names; • 204,000 additional variants; • ~3.2% names of organisations / events • Number of variants: • 413 variants for Muammar Gaddafi (entity 262) • 256 variants for Mikhail Saakashvili (entity 472) • 246 variants for Mahmoud Ahmadinejad (entity 101358) • Grows by ~230 new entities and ~430 new variants per week. Variant forms No. of entities 1 63.76% 2 22.52% 3 5.31% 10 or more 3760 entities 50 or more 242 entities 100 or more 37 entities
  • 16. 16RANLP’2011, Hissar, Bulgaria, 12.09.2011 Statistics on JRC-Names (2) • Number of scripts: 27 Number of languages: ??? • News mentions names from around the world. • Frequency does not reflect origin • European Union (10101) is most frequent entity in German, and second in English. • It does not matter where a name like Silvio Berlusconi comes from.
  • 17. 17RANLP’2011, Hissar, Bulgaria, 12.09.2011 Agenda • What is JRC-Names; What can it be used for • Related work: other named entity (NE) resources • How JRC-Names was produced • Recognition of named entities in news reports in 20 languages • Introduction to EMM • Automatic mapping of name variants to the same entity • Enrichment with Wikipedia variants • Partial manual moderation • Statistics on JRC-Names • Programming details / Usage of the tool • Solutions to capture morphological variants • Further multilingual linguistic resources
  • 18. 18RANLP’2011, Hissar, Bulgaria, 12.09.2011 Details about the JRC-Names software • Java-implemented demonstrator • Finite state automaton • Reads the NE resource file entities.gzip (frequently updated) • Searches for known names (and their variants) in UTF8-encoded text files. Returns: • Numerical name identifier • Main name for that entity • Name string found in the text • Position (Offset and string length) • For any given name string, returns all variants. • Software and NE resource file can be downloaded from • http://langtech.jrc.ec.europe.eu/ , Section on ‘Resources’ • Free usage, according to accompanying end-user licence.
  • 19. 19RANLP’2011, Hissar, Bulgaria, 12.09.2011 Agenda • What is JRC-Names; What can it be used for • Related work: other named entity (NE) resources • How JRC-Names was produced • Recognition of named entities in news reports in 20 languages • Introduction to EMM • Automatic mapping of name variants to the same entity • Enrichment with Wikipedia variants • Partial manual moderation • Statistics on JRC-Names • Programming details / Usage of the tool • Solutions to capture morphological variants • Further multilingual linguistic resources
  • 20. 20RANLP’2011, Hissar, Bulgaria, 12.09.2011 Treatment of morphological inflections • The recognition of morphological inflections used in EMM processing chain are not currently part of JRC-Names. • We are working on a solution to include morphological processing in a future release of JRC-Names. • Further variants will also be included more consistently: • Hyphenation (e.g. Yves Saint-Laurent vs. Yves Saint Laurent) • Names with and without name ‘infixes’ (e.g. Khan al Khalil vs. Khan Khalil) • Abbreviations (e.g. Saint vs. St.) • … • Current solution: Add the approximately 45,000 full-forms of inflected names, as found in EMM processing results since January 2011, to the resource file entities.gzip • This helps to recognise the most frequent inflection forms of the frequent names.
  • 21. 21RANLP’2011, Hissar, Bulgaria, 12.09.2011 Agenda • What is JRC-Names; What can it be used for • Related work: other named entity (NE) resources • How JRC-Names was produced • Recognition of named entities in news reports in 20 languages • Introduction to EMM • Automatic mapping of name variants to the same entity • Enrichment with Wikipedia variants • Partial manual moderation • Statistics on JRC-Names • Programming details / Usage of the tool • Solutions to capture morphological variants • Further multilingual linguistic resources
  • 22. 22RANLP’2011, Hissar, Bulgaria, 12.09.2011 Further JRC/EC-provided multilingual linguistic resources • JRC-Acquis (2006): 1 billion word parallel corpus in 22 languages • DGT-TM (2007): Translation Memory in 22 languages; up to 2 million segments •  DGT-TM-2011 (forthcoming): 23 languages; 4 million segments? Yearly updates • JEX (JRC Eurovoc Indexer) (forthcoming): software to automatically label texts according to the thousands of categories of the Eurovoc thesaurus; 23 languages. • Further smaller resources: • Multilingual summary evaluation data (2010): 4 clusters for each of 7 languages • Sentiment-annotated collection of quotations (2010): English (German forthcoming) • Multilingual Named Entity-annotated parallel corpus (forthcoming) • Available at http://langtech.jrc.ec.europa.eu/, section on ‘Resources’
  • 23. 23RANLP’2011, Hissar, Bulgaria, 12.09.2011 JRC-Names: A freely available, highly multilingual named entity resource Hissar, Bulgaria, 12 September 2011 Ralf Steinberger, Bruno Pouliquen, Mijail Kabadjov, Jenya Belyaeva & Erik van der Goot Technical details and publications: http://langtech.jrc.ec.europa.eu/ Applications: http://emm.newbrief.eu/overview.html