BOA tries to extract knowledge (binary relations) from unstructured data like free text. This is a tutorial based on the Korean language on how to adopt the BOA approach to your language.
1. Daniel Gerber
Axel-Cyrille Ngonga Ngomo
AKSW, Universität Leipzig
BOA
How To Integrate Your
Language
2. Bootstrapping the Data Web
General Overview
Background
Corpus Indexing Surface forms
Knowledge
Search &
Korean features RDF extraction Evaluation
Scoring
AKSW@KAIST - 17.01.2012 - Page 2 http://boa.aksw.org
3. Bootstrapping the Data Web
1. Create a corpus in your language
๏ At least 25M sentences
๏ Chunked into one sentence per line
๏ No HTML
๏ UTF-8?
๏ For later Coreference Resolution, resource
URL needs to be available
AKSW@KAIST - 17.01.2012 - Page 3 http://boa.aksw.org
4. Bootstrapping the Data Web
2. Corpus indexing
๏ Apache Lucene 3.4.0
๏ Set of >20 UTF-8 RegEx filters
๏ Whitespace Analyzer
➡ No stemming
➡ Tokenization on every token
➡ Stop-words included in index
➡ Lowercase version
AKSW@KAIST - 17.01.2012 - Page 4 http://boa.aksw.org
5. Bootstrapping the Data Web
3. Background knowledge I
Object Datatype
Properties vs Properties
AKSW@KAIST - 17.01.2012 - Page 5 http://boa.aksw.org
6. Bootstrapping the Data Web
3. Background knowledge II
Line #1 Line #2
URI1 http://dbpedia.org/resource/South_Korea http://dbpedia.org/resource/KAIST
Label1 대한민국 한국 과학 기술원
Property http://dbpedia.org/ontology/capital http://dbpedia.org/ontology/country
URI2 http://dbpedia.org/resource/Seoul http://dbpedia.org/resource/South_Korea
Label2 서울 대한민국
Domain http://dbpedia.org/ontology/PopulatedPlace ⎯
Range http://dbpedia.org/ontology/PopulatedPlace http://dbpedia.org/ontology/Country
AKSW@KAIST - 17.01.2012 - Page 6 http://boa.aksw.org
7. Bootstrapping the Data Web
4. Surface form generation
๏ DBpedia Spotlight
๏ Labels
๏ Redirects
๏ Disambiguation
๏ Datatype Properties
๏ Person XY is born on 1st of October in 1972.
๏ Person XY is born on 1 October in 1972.
๏ Person XY is born on a Thursday in 1972
๏ Find and Create those surface forms
AKSW@KAIST - 17.01.2012 - Page 7 http://boa.aksw.org
8. Bootstrapping the Data Web
5. Korean feature extraction
Language Language
Dependent Independent
ReVerb # of stopwords
Wordnet # of words
Distance ? # of occurrences
?
? ?
AKSW@KAIST - 17.01.2012 - Page 8 http://boa.aksw.org
9. Bootstrapping the Data Web
6. Pattern search and scoring
Barack Obama was born in Honolulu.
was born in
Predicate?
Subject? Object?
버락 오바마는 호놀룰루에서 태어났습니다.
AKSW@KAIST - 17.01.2012 - Page 9 http://boa.aksw.org
10. Bootstrapping the Data Web
7. RDF extraction
Barack Obama was born in Honolulu.
was born in
Barack Obama Named Entity Disambiguation!
dbpedia-owl:birthPlace dbpedia-owl:birthPlace
Honolulu
버락 오바마는 호놀룰루에서 태어났습니다.
에서 태어났습니다.
AKSW@KAIST - 17.01.2012 - Page 10 http://boa.aksw.org
11. Bootstrapping the Data Web
8. Evaluation
1. Select properties P to evaluate (T100)
2. Query DBpedia for triples (and labels) with p ∈ P
3. Find sentence with labels
4. Assess if triple can be found in sentence
➡ Gold Standard with 1000 annotated sentence/triples
5. Run one BOA iteration on Gold Standard
6. Measure Precision/Recall/F-Measure
AKSW@KAIST - 17.01.2012 - Page 11 http://boa.aksw.org
12. Bootstrapping the Data Web
Necessary resources for new language
๏ 50M sentence (best general knowledge)
๏ Sentence Boundary Disambiguation
๏ Part of speech tagger helpful
๏ Named Entity Recognition
๏ Named Entity Disambiguation
๏ Labels for resources
๏ SPARQL endpoint
AKSW@KAIST - 17.01.2012 - Page 12 http://boa.aksw.org