SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
ANEA: Automated (Named) Entity Annotation for
German Domain-Specific Texts
2nd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2021)
30 September 2021
Anastasia Zhukova, Felix Hamborg, Bela Gipp
Introduction
Introduction
3
• Named entity recognition (NER) is a well-known NLP task.
• NER datasets contain general categories, e.g., person, location, time, etc.
Problems
1. General NER reflects no categories of the other domains, e.g., technology, production
2. A small number of NLP datasets for German, i.e., a low-resources language
3. Domain NER requires annotating a dataset for training a NER model
Goal
• Minimize the time of creating a domain dataset for NER in German by automating the
annotation process
→ a very time-consuming task
Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
Research question
4
How to use knowledge graphs (e.g., Wiktionary) to automatically
1. extract domain terms (nouns),
2. derive entity categories,
3. annotate these terms into categories?
Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
Domain graph
Preprocessing and Wiktionary pages
6
German compound nouns
Sechszylindermotor (six-cylinder motor)
= sechs + Zylinder + Motor
A Wiktionary page (WP) matches “Motor”:
Extracted noun phrases (NPs)
NP has a matching
Wiktionary page (WP)
NP doesn’t match
any Wiktionary page
Extract head of NP
+ match the head to
Wiktionary page
Properties from WPs: (1) Hypernyms,
(2) Hyponyms, (3) Definitions and areas
Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
→ Use the NPs that were mapped to
Wiktionary pages
Domain graph
7
Prune branches from
other areas
Get multiple label
candidates
Domain graph is a Wiktionary subgraph with nodes related
to the current domain
Leaves are the domain terms.
Nodes are candidate labels for the entity categories.
Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
ANEA
Candidate Entity Categories
9
Specific
(Actor, Politician)
General
(Humanity)
Optimal
(Person)
The quality metric:
𝑸𝒊 = 𝑻𝒊 ∙ 𝑳𝒊 ∙ 𝑶𝒊 ∙ 𝐦𝐚𝒙(𝐥𝐨𝐠𝟐 𝑬𝑪𝒊 , 𝟏) ∙ 𝒅𝒂𝒗𝒈𝒊
𝑇𝑖 is a mean cross-term cosine similarity
𝐿𝑖 is a mean label-terms cosine similarity
𝑂𝑖 = 𝑇𝑖 + 𝐿𝑖 is an overall similarity
𝐸𝐶𝑖 is a number of terms in an entity category
𝑑𝑎𝑣𝑔𝑖
is an average of non-zero distances between terms and
a label
Word embeddings:
fastText represents well out-of-vocabulary terms and labels
Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
Optimization steps
10
1. Candidate filtering
1) a mean cross-term similarity too small
𝑇𝑖 < 0.2
2) a mean label-terms similarity too small
𝐿𝑖 < 0.3
3) EC is too broad (contains > 15% of all terms-
to- annotate)
4) EC is too narrow (contains < 5 terms)
2. Resolution of full overlaps
When containing same terms, keep an 𝐸𝐶𝑖 with
the largest 𝑄𝑖
Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
11
3. Resolution of the substantial overlaps
When terms overlap ≥ 50%, keep a big 𝐸𝐶𝑖 that is a
best replacement to a small 𝐸𝐶𝑗 and to itself
4. Resolution of the conflicting terms
Resolve the conflicting terms to “clean” candidate
𝐸𝐶𝑖 with the highest overall similarity 𝑂𝑖
Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
Evaluation
Evaluation: User study
13
No German domain-specific dataset available →
performed a user study to evaluate the results
• 4 datasets: processing industry, software development, databases,
and travelling
• 9 native German study participants: 4 f, 5 m, aged between 23-60
• 2-4 evaluators per dataset
• 4 various configurations per dataset: different number of terms-to-
annotate
• 2 methods: ANEA and a hierarchical clustering baseline
Tasks
1) Evaluate cross-term relatedness within a category:
0-9 where 9 is the best
2) Evaluate relatedness of a label to terms in a category:
0-9 where 9 is the best
label terms
Label-to-terms
relatedness
cross--terms
relatedness
Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
Dataset properties
14
Dataset Databases
Software
development
Traveling Processing
All words 8161 8581 6293 7984
Terms 1209 1041 1040 552
Heads 713 673 801 328
Assessors 3 3 2 4
Cross-term
relatedness
Label-terms
relatedness
• Distribution of the relatedness
scores between the datasets
differ.
• The most frequent score per
dataset is used as thresholds
for creating silver datasets.
Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
Collection of a silver dataset
15
Silver datasets are required to compare configurations of ANEA against it.
Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
Evaluation methods
16
• Silver dataset
• Hierarchical clustering
– A baseline for terms relatedness
• ANEA
• ANEA voting
– A final result is derived in an ensemble/voting strategy of multiple ANEA configurations
Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
Results
17
Topic Method
# terms-to-
annotate
# entity
categories (ECs)
# annotated
terms
ECs’ average
size
Term
similarity
Label
similarity
Average
similarity
Databases silver 420 5 113 23 7.2 7 7.2
HC 253 8 52 7 7.2 -- 7.2*
ANEA 253 18 179 10 5.7 5 5.4
ANEA voting 253-316 12 122 10 6.3 5.9 6.1
Software dev. silver 356 6 57 10 6.2 6 6.1
HC 303 15 152 10 5.5 -- 5.5*
ANEA 191 10 119 12 5 5.3 5.2
ANEA voting 191-255 4 44 11 5.6 6.5 6.0
Traveling silver 363 6 115 19 7.8 6.7 7.3
HC 363 19 156 8 7.3 -- 7.3*
ANEA 363 22 239 11 5.4 4.8 5.1
ANEA voting 258-363 12 146 12 6.2 5.6 5.9
Processing silver 282 7 102 15 6.6 6.2 6.4
HC 183 7 56 8 6.1 -- 6.1*
ANEA 227 16 172 11 5.3 4.9 5.1
ANEA voting 181-282 9 157 17 5.7 5.6 5.6
ANEA voting shows improvement of 13-15% to the original ANEA average similarity scores.
Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
ANEA summary
18
150
200
250
300
350
400
300 400 500 600 700 800 900
#
terms-to-annotate
# heads from the terms-to-annotate
• ANEA hasn’t achieved the relatedness
scores of the silver datasets yet.
• The voting strategy shows a significant
improvement to the ANEA results
Recommended configurations for ANEA with
voting:
1) 𝑦 = 158 + 0.167𝑥
2) 𝑦 + 50
3) 𝑦 − 50
where 𝑥 is a number of unique heads among the
terms to annotate and 𝑦 is a number of terms-to-
annotate by ANEA
Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
Conclusion
Conclusion
20
• Proposed ANEA, i.e., an unsupervised approach for automated creation of a small dataset for
domain-specific NER.
• Evaluated ANEA with a user study on four domain datasets.
• The produced entity categories required less than one hour, which is significantly faster than
manual annotation.
• The produced entity categories are slightly worse than the silver datasets but a voting strategy
improves the scores by 13-15%.
A suggested use case with using ANEA:
(1) annotate a small dataset,
(2) validate and improve the dataset with manual inspection,
(3) use the produced dataset in a semi-supervised or transfer learning
Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
Questions
21
Anastasia Zhukova
zhukova@uni-wuppertal.de

Más contenido relacionado

Similar a ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts

Table Retrieval and Generation
Table Retrieval and GenerationTable Retrieval and Generation
Table Retrieval and Generationkrisztianbalog
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
 
Topic Segmentation in Dialogue
Topic Segmentation in DialogueTopic Segmentation in Dialogue
Topic Segmentation in DialogueJinho Choi
 
DATA MINING USING R (1).pptx
DATA MINING USING R (1).pptxDATA MINING USING R (1).pptx
DATA MINING USING R (1).pptxmyworld93
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersVitomir Kovanovic
 
NLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inNLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inKumari Naveen
 
lect36-tasks.ppt
lect36-tasks.pptlect36-tasks.ppt
lect36-tasks.pptHaHa501620
 
Query Optimization - Brandon Latronica
Query Optimization - Brandon LatronicaQuery Optimization - Brandon Latronica
Query Optimization - Brandon Latronica"FENG "GEORGE"" YU
 
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Noemi Derzsy
 
Data Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsData Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsPyData
 
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...Lifeng (Aaron) Han
 
Compiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_VanamaCompiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_VanamaSrikanth Vanama
 
Search explained T3DD15
Search explained T3DD15Search explained T3DD15
Search explained T3DD15Hans Höchtl
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak
 

Similar a ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts (20)

Eurolan 2005 Pedersen
Eurolan 2005 PedersenEurolan 2005 Pedersen
Eurolan 2005 Pedersen
 
Construction of knowledge base for automatic indexing and classification base...
Construction of knowledge base for automatic indexing and classification base...Construction of knowledge base for automatic indexing and classification base...
Construction of knowledge base for automatic indexing and classification base...
 
Table Retrieval and Generation
Table Retrieval and GenerationTable Retrieval and Generation
Table Retrieval and Generation
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
Text summarization
Text summarizationText summarization
Text summarization
 
Language processors
Language processorsLanguage processors
Language processors
 
Topic Segmentation in Dialogue
Topic Segmentation in DialogueTopic Segmentation in Dialogue
Topic Segmentation in Dialogue
 
DATA MINING USING R (1).pptx
DATA MINING USING R (1).pptxDATA MINING USING R (1).pptx
DATA MINING USING R (1).pptx
 
For project
For projectFor project
For project
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics Researchers
 
NLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inNLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful in
 
lect36-tasks.ppt
lect36-tasks.pptlect36-tasks.ppt
lect36-tasks.ppt
 
CICLing_2016_paper_52
CICLing_2016_paper_52CICLing_2016_paper_52
CICLing_2016_paper_52
 
Query Optimization - Brandon Latronica
Query Optimization - Brandon LatronicaQuery Optimization - Brandon Latronica
Query Optimization - Brandon Latronica
 
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
 
Data Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsData Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA Datasets
 
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
 
Compiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_VanamaCompiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_Vanama
 
Search explained T3DD15
Search explained T3DD15Search explained T3DD15
Search explained T3DD15
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
 

Más de Anastasia Zhukova

What's in the News? Towards Identification of Bias by Commission, Omission, a...
What's in the News? Towards Identification of Bias by Commission, Omission, a...What's in the News? Towards Identification of Bias by Commission, Omission, a...
What's in the News? Towards Identification of Bias by Commission, Omission, a...Anastasia Zhukova
 
Seminar Paper: Putting News in a Perspective: Framing by Word Choice and Labe...
Seminar Paper: Putting News in a Perspective: Framing by Word Choice and Labe...Seminar Paper: Putting News in a Perspective: Framing by Word Choice and Labe...
Seminar Paper: Putting News in a Perspective: Framing by Word Choice and Labe...Anastasia Zhukova
 
M.Sc. Thesis: Automated Identification of Framing by Word Choice and Labeling...
M.Sc. Thesis: Automated Identification of Framing by Word Choice and Labeling...M.Sc. Thesis: Automated Identification of Framing by Word Choice and Labeling...
M.Sc. Thesis: Automated Identification of Framing by Word Choice and Labeling...Anastasia Zhukova
 
Talk: Automated Identification of Media Bias by Word Choice and Labeling in N...
Talk: Automated Identification of Media Bias by Word Choice and Labeling in N...Talk: Automated Identification of Media Bias by Word Choice and Labeling in N...
Talk: Automated Identification of Media Bias by Word Choice and Labeling in N...Anastasia Zhukova
 
Automated Identification of Framing by Word Choice and Labeling to Reveal Med...
Automated Identification of Framing by Word Choice and Labeling to Reveal Med...Automated Identification of Framing by Word Choice and Labeling to Reveal Med...
Automated Identification of Framing by Word Choice and Labeling to Reveal Med...Anastasia Zhukova
 
Putting News in a Perspective: Framing by Word Choice and Labeling
Putting News in a Perspective: Framing by Word Choice and LabelingPutting News in a Perspective: Framing by Word Choice and Labeling
Putting News in a Perspective: Framing by Word Choice and LabelingAnastasia Zhukova
 
Interpretable Topic Modeling Using Near-Identity Cross-Document Coreference R...
Interpretable Topic Modeling Using Near-Identity Cross-Document Coreference R...Interpretable Topic Modeling Using Near-Identity Cross-Document Coreference R...
Interpretable Topic Modeling Using Near-Identity Cross-Document Coreference R...Anastasia Zhukova
 
Interpretable and Comparative Textual Dataset Exploration Using Near-Identity...
Interpretable and Comparative Textual Dataset Exploration Using Near-Identity...Interpretable and Comparative Textual Dataset Exploration Using Near-Identity...
Interpretable and Comparative Textual Dataset Exploration Using Near-Identity...Anastasia Zhukova
 
Towards Evaluation of Cross-document Coreference Resolution Models Using Data...
Towards Evaluation of Cross-document Coreference Resolution Models Using Data...Towards Evaluation of Cross-document Coreference Resolution Models Using Data...
Towards Evaluation of Cross-document Coreference Resolution Models Using Data...Anastasia Zhukova
 
Concept Identification of Directly and Indirectly Related Mentions Referring ...
Concept Identification of Directly and Indirectly Related Mentions Referring ...Concept Identification of Directly and Indirectly Related Mentions Referring ...
Concept Identification of Directly and Indirectly Related Mentions Referring ...Anastasia Zhukova
 
XCoref: Cross-document Coreference Resolution in the Wild
XCoref: Cross-document Coreference Resolution in the WildXCoref: Cross-document Coreference Resolution in the Wild
XCoref: Cross-document Coreference Resolution in the WildAnastasia Zhukova
 

Más de Anastasia Zhukova (11)

What's in the News? Towards Identification of Bias by Commission, Omission, a...
What's in the News? Towards Identification of Bias by Commission, Omission, a...What's in the News? Towards Identification of Bias by Commission, Omission, a...
What's in the News? Towards Identification of Bias by Commission, Omission, a...
 
Seminar Paper: Putting News in a Perspective: Framing by Word Choice and Labe...
Seminar Paper: Putting News in a Perspective: Framing by Word Choice and Labe...Seminar Paper: Putting News in a Perspective: Framing by Word Choice and Labe...
Seminar Paper: Putting News in a Perspective: Framing by Word Choice and Labe...
 
M.Sc. Thesis: Automated Identification of Framing by Word Choice and Labeling...
M.Sc. Thesis: Automated Identification of Framing by Word Choice and Labeling...M.Sc. Thesis: Automated Identification of Framing by Word Choice and Labeling...
M.Sc. Thesis: Automated Identification of Framing by Word Choice and Labeling...
 
Talk: Automated Identification of Media Bias by Word Choice and Labeling in N...
Talk: Automated Identification of Media Bias by Word Choice and Labeling in N...Talk: Automated Identification of Media Bias by Word Choice and Labeling in N...
Talk: Automated Identification of Media Bias by Word Choice and Labeling in N...
 
Automated Identification of Framing by Word Choice and Labeling to Reveal Med...
Automated Identification of Framing by Word Choice and Labeling to Reveal Med...Automated Identification of Framing by Word Choice and Labeling to Reveal Med...
Automated Identification of Framing by Word Choice and Labeling to Reveal Med...
 
Putting News in a Perspective: Framing by Word Choice and Labeling
Putting News in a Perspective: Framing by Word Choice and LabelingPutting News in a Perspective: Framing by Word Choice and Labeling
Putting News in a Perspective: Framing by Word Choice and Labeling
 
Interpretable Topic Modeling Using Near-Identity Cross-Document Coreference R...
Interpretable Topic Modeling Using Near-Identity Cross-Document Coreference R...Interpretable Topic Modeling Using Near-Identity Cross-Document Coreference R...
Interpretable Topic Modeling Using Near-Identity Cross-Document Coreference R...
 
Interpretable and Comparative Textual Dataset Exploration Using Near-Identity...
Interpretable and Comparative Textual Dataset Exploration Using Near-Identity...Interpretable and Comparative Textual Dataset Exploration Using Near-Identity...
Interpretable and Comparative Textual Dataset Exploration Using Near-Identity...
 
Towards Evaluation of Cross-document Coreference Resolution Models Using Data...
Towards Evaluation of Cross-document Coreference Resolution Models Using Data...Towards Evaluation of Cross-document Coreference Resolution Models Using Data...
Towards Evaluation of Cross-document Coreference Resolution Models Using Data...
 
Concept Identification of Directly and Indirectly Related Mentions Referring ...
Concept Identification of Directly and Indirectly Related Mentions Referring ...Concept Identification of Directly and Indirectly Related Mentions Referring ...
Concept Identification of Directly and Indirectly Related Mentions Referring ...
 
XCoref: Cross-document Coreference Resolution in the Wild
XCoref: Cross-document Coreference Resolution in the WildXCoref: Cross-document Coreference Resolution in the Wild
XCoref: Cross-document Coreference Resolution in the Wild
 

Último

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionPriyansha Singh
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 

Último (20)

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorption
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 

ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts

  • 1. ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts 2nd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2021) 30 September 2021 Anastasia Zhukova, Felix Hamborg, Bela Gipp
  • 3. Introduction 3 • Named entity recognition (NER) is a well-known NLP task. • NER datasets contain general categories, e.g., person, location, time, etc. Problems 1. General NER reflects no categories of the other domains, e.g., technology, production 2. A small number of NLP datasets for German, i.e., a low-resources language 3. Domain NER requires annotating a dataset for training a NER model Goal • Minimize the time of creating a domain dataset for NER in German by automating the annotation process → a very time-consuming task Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  • 4. Research question 4 How to use knowledge graphs (e.g., Wiktionary) to automatically 1. extract domain terms (nouns), 2. derive entity categories, 3. annotate these terms into categories? Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  • 6. Preprocessing and Wiktionary pages 6 German compound nouns Sechszylindermotor (six-cylinder motor) = sechs + Zylinder + Motor A Wiktionary page (WP) matches “Motor”: Extracted noun phrases (NPs) NP has a matching Wiktionary page (WP) NP doesn’t match any Wiktionary page Extract head of NP + match the head to Wiktionary page Properties from WPs: (1) Hypernyms, (2) Hyponyms, (3) Definitions and areas Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts” → Use the NPs that were mapped to Wiktionary pages
  • 7. Domain graph 7 Prune branches from other areas Get multiple label candidates Domain graph is a Wiktionary subgraph with nodes related to the current domain Leaves are the domain terms. Nodes are candidate labels for the entity categories. Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  • 9. Candidate Entity Categories 9 Specific (Actor, Politician) General (Humanity) Optimal (Person) The quality metric: 𝑸𝒊 = 𝑻𝒊 ∙ 𝑳𝒊 ∙ 𝑶𝒊 ∙ 𝐦𝐚𝒙(𝐥𝐨𝐠𝟐 𝑬𝑪𝒊 , 𝟏) ∙ 𝒅𝒂𝒗𝒈𝒊 𝑇𝑖 is a mean cross-term cosine similarity 𝐿𝑖 is a mean label-terms cosine similarity 𝑂𝑖 = 𝑇𝑖 + 𝐿𝑖 is an overall similarity 𝐸𝐶𝑖 is a number of terms in an entity category 𝑑𝑎𝑣𝑔𝑖 is an average of non-zero distances between terms and a label Word embeddings: fastText represents well out-of-vocabulary terms and labels Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  • 10. Optimization steps 10 1. Candidate filtering 1) a mean cross-term similarity too small 𝑇𝑖 < 0.2 2) a mean label-terms similarity too small 𝐿𝑖 < 0.3 3) EC is too broad (contains > 15% of all terms- to- annotate) 4) EC is too narrow (contains < 5 terms) 2. Resolution of full overlaps When containing same terms, keep an 𝐸𝐶𝑖 with the largest 𝑄𝑖 Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  • 11. 11 3. Resolution of the substantial overlaps When terms overlap ≥ 50%, keep a big 𝐸𝐶𝑖 that is a best replacement to a small 𝐸𝐶𝑗 and to itself 4. Resolution of the conflicting terms Resolve the conflicting terms to “clean” candidate 𝐸𝐶𝑖 with the highest overall similarity 𝑂𝑖 Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  • 13. Evaluation: User study 13 No German domain-specific dataset available → performed a user study to evaluate the results • 4 datasets: processing industry, software development, databases, and travelling • 9 native German study participants: 4 f, 5 m, aged between 23-60 • 2-4 evaluators per dataset • 4 various configurations per dataset: different number of terms-to- annotate • 2 methods: ANEA and a hierarchical clustering baseline Tasks 1) Evaluate cross-term relatedness within a category: 0-9 where 9 is the best 2) Evaluate relatedness of a label to terms in a category: 0-9 where 9 is the best label terms Label-to-terms relatedness cross--terms relatedness Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  • 14. Dataset properties 14 Dataset Databases Software development Traveling Processing All words 8161 8581 6293 7984 Terms 1209 1041 1040 552 Heads 713 673 801 328 Assessors 3 3 2 4 Cross-term relatedness Label-terms relatedness • Distribution of the relatedness scores between the datasets differ. • The most frequent score per dataset is used as thresholds for creating silver datasets. Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  • 15. Collection of a silver dataset 15 Silver datasets are required to compare configurations of ANEA against it. Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  • 16. Evaluation methods 16 • Silver dataset • Hierarchical clustering – A baseline for terms relatedness • ANEA • ANEA voting – A final result is derived in an ensemble/voting strategy of multiple ANEA configurations Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  • 17. Results 17 Topic Method # terms-to- annotate # entity categories (ECs) # annotated terms ECs’ average size Term similarity Label similarity Average similarity Databases silver 420 5 113 23 7.2 7 7.2 HC 253 8 52 7 7.2 -- 7.2* ANEA 253 18 179 10 5.7 5 5.4 ANEA voting 253-316 12 122 10 6.3 5.9 6.1 Software dev. silver 356 6 57 10 6.2 6 6.1 HC 303 15 152 10 5.5 -- 5.5* ANEA 191 10 119 12 5 5.3 5.2 ANEA voting 191-255 4 44 11 5.6 6.5 6.0 Traveling silver 363 6 115 19 7.8 6.7 7.3 HC 363 19 156 8 7.3 -- 7.3* ANEA 363 22 239 11 5.4 4.8 5.1 ANEA voting 258-363 12 146 12 6.2 5.6 5.9 Processing silver 282 7 102 15 6.6 6.2 6.4 HC 183 7 56 8 6.1 -- 6.1* ANEA 227 16 172 11 5.3 4.9 5.1 ANEA voting 181-282 9 157 17 5.7 5.6 5.6 ANEA voting shows improvement of 13-15% to the original ANEA average similarity scores. Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  • 18. ANEA summary 18 150 200 250 300 350 400 300 400 500 600 700 800 900 # terms-to-annotate # heads from the terms-to-annotate • ANEA hasn’t achieved the relatedness scores of the silver datasets yet. • The voting strategy shows a significant improvement to the ANEA results Recommended configurations for ANEA with voting: 1) 𝑦 = 158 + 0.167𝑥 2) 𝑦 + 50 3) 𝑦 − 50 where 𝑥 is a number of unique heads among the terms to annotate and 𝑦 is a number of terms-to- annotate by ANEA Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  • 20. Conclusion 20 • Proposed ANEA, i.e., an unsupervised approach for automated creation of a small dataset for domain-specific NER. • Evaluated ANEA with a user study on four domain datasets. • The produced entity categories required less than one hour, which is significantly faster than manual annotation. • The produced entity categories are slightly worse than the silver datasets but a voting strategy improves the scores by 13-15%. A suggested use case with using ANEA: (1) annotate a small dataset, (2) validate and improve the dataset with manual inspection, (3) use the produced dataset in a semi-supervised or transfer learning Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”