Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 21 Anuncio

ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts

Descargar para leer sin conexión

Authors: Anastasia Zhukova, Felix Hamborg, Bela Gipp

Publication date: 2021/9/30
In Proceedings of the 2nd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE 2021) co-located with JCDL 2021

Abstract: Named entity recognition (NER) is an important task that aims to resolve universal categories of named entities, e.g., persons, locations, organizations, and times. Despite its common and viable use in many use cases, NER is barely applicable in domains where general categories are suboptimal, such as engineering or medicine. To facilitate NER of domain-specific types, we propose ANEA, an automated (named) entity annotator to assist human annotators in creating domain-specific NER corpora for German text collections when given a set of domain-specific texts. In our evaluation, we find that ANEA automatically identifies terms that best represent the texts' content, identifies groups of coherent terms, and extracts and assigns descriptive labels to these groups, i.e., annotates text datasets into the domain (named) entities.

PDF: https://ceur-ws.org/Vol-3004/paper1.pdf

Authors: Anastasia Zhukova, Felix Hamborg, Bela Gipp

Publication date: 2021/9/30
In Proceedings of the 2nd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE 2021) co-located with JCDL 2021

Abstract: Named entity recognition (NER) is an important task that aims to resolve universal categories of named entities, e.g., persons, locations, organizations, and times. Despite its common and viable use in many use cases, NER is barely applicable in domains where general categories are suboptimal, such as engineering or medicine. To facilitate NER of domain-specific types, we propose ANEA, an automated (named) entity annotator to assist human annotators in creating domain-specific NER corpora for German text collections when given a set of domain-specific texts. In our evaluation, we find that ANEA automatically identifies terms that best represent the texts' content, identifies groups of coherent terms, and extracts and assigns descriptive labels to these groups, i.e., annotates text datasets into the domain (named) entities.

PDF: https://ceur-ws.org/Vol-3004/paper1.pdf

Anuncio
Anuncio

Más Contenido Relacionado

Similares a ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts (20)

Más de Anastasia Zhukova (10)

Anuncio

Más reciente (20)

ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts

  1. 1. ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts 2nd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2021) 30 September 2021 Anastasia Zhukova, Felix Hamborg, Bela Gipp
  2. 2. Introduction
  3. 3. Introduction 3 • Named entity recognition (NER) is a well-known NLP task. • NER datasets contain general categories, e.g., person, location, time, etc. Problems 1. General NER reflects no categories of the other domains, e.g., technology, production 2. A small number of NLP datasets for German, i.e., a low-resources language 3. Domain NER requires annotating a dataset for training a NER model Goal • Minimize the time of creating a domain dataset for NER in German by automating the annotation process → a very time-consuming task Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  4. 4. Research question 4 How to use knowledge graphs (e.g., Wiktionary) to automatically 1. extract domain terms (nouns), 2. derive entity categories, 3. annotate these terms into categories? Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  5. 5. Domain graph
  6. 6. Preprocessing and Wiktionary pages 6 German compound nouns Sechszylindermotor (six-cylinder motor) = sechs + Zylinder + Motor A Wiktionary page (WP) matches “Motor”: Extracted noun phrases (NPs) NP has a matching Wiktionary page (WP) NP doesn’t match any Wiktionary page Extract head of NP + match the head to Wiktionary page Properties from WPs: (1) Hypernyms, (2) Hyponyms, (3) Definitions and areas Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts” → Use the NPs that were mapped to Wiktionary pages
  7. 7. Domain graph 7 Prune branches from other areas Get multiple label candidates Domain graph is a Wiktionary subgraph with nodes related to the current domain Leaves are the domain terms. Nodes are candidate labels for the entity categories. Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  8. 8. ANEA
  9. 9. Candidate Entity Categories 9 Specific (Actor, Politician) General (Humanity) Optimal (Person) The quality metric: 𝑸𝒊 = 𝑻𝒊 ∙ 𝑳𝒊 ∙ 𝑶𝒊 ∙ 𝐦𝐚𝒙(𝐥𝐨𝐠𝟐 𝑬𝑪𝒊 , 𝟏) ∙ 𝒅𝒂𝒗𝒈𝒊 𝑇𝑖 is a mean cross-term cosine similarity 𝐿𝑖 is a mean label-terms cosine similarity 𝑂𝑖 = 𝑇𝑖 + 𝐿𝑖 is an overall similarity 𝐸𝐶𝑖 is a number of terms in an entity category 𝑑𝑎𝑣𝑔𝑖 is an average of non-zero distances between terms and a label Word embeddings: fastText represents well out-of-vocabulary terms and labels Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  10. 10. Optimization steps 10 1. Candidate filtering 1) a mean cross-term similarity too small 𝑇𝑖 < 0.2 2) a mean label-terms similarity too small 𝐿𝑖 < 0.3 3) EC is too broad (contains > 15% of all terms- to- annotate) 4) EC is too narrow (contains < 5 terms) 2. Resolution of full overlaps When containing same terms, keep an 𝐸𝐶𝑖 with the largest 𝑄𝑖 Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  11. 11. 11 3. Resolution of the substantial overlaps When terms overlap ≥ 50%, keep a big 𝐸𝐶𝑖 that is a best replacement to a small 𝐸𝐶𝑗 and to itself 4. Resolution of the conflicting terms Resolve the conflicting terms to “clean” candidate 𝐸𝐶𝑖 with the highest overall similarity 𝑂𝑖 Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  12. 12. Evaluation
  13. 13. Evaluation: User study 13 No German domain-specific dataset available → performed a user study to evaluate the results • 4 datasets: processing industry, software development, databases, and travelling • 9 native German study participants: 4 f, 5 m, aged between 23-60 • 2-4 evaluators per dataset • 4 various configurations per dataset: different number of terms-to- annotate • 2 methods: ANEA and a hierarchical clustering baseline Tasks 1) Evaluate cross-term relatedness within a category: 0-9 where 9 is the best 2) Evaluate relatedness of a label to terms in a category: 0-9 where 9 is the best label terms Label-to-terms relatedness cross--terms relatedness Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  14. 14. Dataset properties 14 Dataset Databases Software development Traveling Processing All words 8161 8581 6293 7984 Terms 1209 1041 1040 552 Heads 713 673 801 328 Assessors 3 3 2 4 Cross-term relatedness Label-terms relatedness • Distribution of the relatedness scores between the datasets differ. • The most frequent score per dataset is used as thresholds for creating silver datasets. Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  15. 15. Collection of a silver dataset 15 Silver datasets are required to compare configurations of ANEA against it. Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  16. 16. Evaluation methods 16 • Silver dataset • Hierarchical clustering – A baseline for terms relatedness • ANEA • ANEA voting – A final result is derived in an ensemble/voting strategy of multiple ANEA configurations Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  17. 17. Results 17 Topic Method # terms-to- annotate # entity categories (ECs) # annotated terms ECs’ average size Term similarity Label similarity Average similarity Databases silver 420 5 113 23 7.2 7 7.2 HC 253 8 52 7 7.2 -- 7.2* ANEA 253 18 179 10 5.7 5 5.4 ANEA voting 253-316 12 122 10 6.3 5.9 6.1 Software dev. silver 356 6 57 10 6.2 6 6.1 HC 303 15 152 10 5.5 -- 5.5* ANEA 191 10 119 12 5 5.3 5.2 ANEA voting 191-255 4 44 11 5.6 6.5 6.0 Traveling silver 363 6 115 19 7.8 6.7 7.3 HC 363 19 156 8 7.3 -- 7.3* ANEA 363 22 239 11 5.4 4.8 5.1 ANEA voting 258-363 12 146 12 6.2 5.6 5.9 Processing silver 282 7 102 15 6.6 6.2 6.4 HC 183 7 56 8 6.1 -- 6.1* ANEA 227 16 172 11 5.3 4.9 5.1 ANEA voting 181-282 9 157 17 5.7 5.6 5.6 ANEA voting shows improvement of 13-15% to the original ANEA average similarity scores. Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  18. 18. ANEA summary 18 150 200 250 300 350 400 300 400 500 600 700 800 900 # terms-to-annotate # heads from the terms-to-annotate • ANEA hasn’t achieved the relatedness scores of the silver datasets yet. • The voting strategy shows a significant improvement to the ANEA results Recommended configurations for ANEA with voting: 1) 𝑦 = 158 + 0.167𝑥 2) 𝑦 + 50 3) 𝑦 − 50 where 𝑥 is a number of unique heads among the terms to annotate and 𝑦 is a number of terms-to- annotate by ANEA Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  19. 19. Conclusion
  20. 20. Conclusion 20 • Proposed ANEA, i.e., an unsupervised approach for automated creation of a small dataset for domain-specific NER. • Evaluated ANEA with a user study on four domain datasets. • The produced entity categories required less than one hour, which is significantly faster than manual annotation. • The produced entity categories are slightly worse than the silver datasets but a voting strategy improves the scores by 13-15%. A suggested use case with using ANEA: (1) annotate a small dataset, (2) validate and improve the dataset with manual inspection, (3) use the produced dataset in a semi-supervised or transfer learning Zhukova et al. “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts”
  21. 21. Questions 21 Anastasia Zhukova zhukova@uni-wuppertal.de

×