SlideShare una empresa de Scribd logo
1 de 13
Descargar para leer sin conexión
Daniel Gerber
     Axel-Cyrille Ngonga Ngomo
                    AKSW, Universität Leipzig




BOA
How To Integrate Your
Language
Bootstrapping the Data Web
General Overview




                                               Background
            Corpus                 Indexing                    Surface forms
                                               Knowledge




                                   Search &
      Korean features                         RDF extraction    Evaluation
                                    Scoring



AKSW@KAIST - 17.01.2012 - Page 2                                       http://boa.aksw.org
Bootstrapping the Data Web
1. Create a corpus in your language



             ๏      At least 25M sentences
             ๏      Chunked into one sentence per line
             ๏      No HTML
             ๏      UTF-8?
             ๏      For later Coreference Resolution, resource
                    URL needs to be available

AKSW@KAIST - 17.01.2012 - Page 3                                 http://boa.aksw.org
Bootstrapping the Data Web
2. Corpus indexing

             ๏       Apache Lucene 3.4.0
             ๏       Set of >20 UTF-8 RegEx filters
             ๏       Whitespace Analyzer
                 ➡       No stemming
                 ➡       Tokenization on every token
                 ➡       Stop-words included in index
                 ➡       Lowercase version
AKSW@KAIST - 17.01.2012 - Page 4                        http://boa.aksw.org
Bootstrapping the Data Web
3. Background knowledge I




                 Object                     Datatype
               Properties            vs     Properties



AKSW@KAIST - 17.01.2012 - Page 5                         http://boa.aksw.org
Bootstrapping the Data Web
3. Background knowledge II

                                   Line #1                                  Line #2

      URI1        http://dbpedia.org/resource/South_Korea        http://dbpedia.org/resource/KAIST

     Label1                        대한민국                                 한국 과학 기술원

    Property          http://dbpedia.org/ontology/capital       http://dbpedia.org/ontology/country

      URI2            http://dbpedia.org/resource/Seoul       http://dbpedia.org/resource/South_Korea

     Label2                         서울                                      대한민국

    Domain       http://dbpedia.org/ontology/PopulatedPlace                     ⎯

     Range       http://dbpedia.org/ontology/PopulatedPlace     http://dbpedia.org/ontology/Country

AKSW@KAIST - 17.01.2012 - Page 6                                                            http://boa.aksw.org
Bootstrapping the Data Web
4. Surface form generation

            ๏ DBpedia Spotlight
             ๏ Labels
             ๏ Redirects
             ๏ Disambiguation
            ๏ Datatype Properties
             ๏ Person XY is born on 1st of October in 1972.
             ๏ Person XY is born on 1 October in 1972.
             ๏ Person XY is born on a Thursday in 1972
              ๏ Find and Create those surface forms
AKSW@KAIST - 17.01.2012 - Page 7                         http://boa.aksw.org
Bootstrapping the Data Web
5. Korean feature extraction

                     Language                             Language
                    Dependent                           Independent


                                   ReVerb                    # of stopwords

         Wordnet                                # of words
         Distance                           ?                 # of occurrences
                                                  ?
                              ?                                ?


AKSW@KAIST - 17.01.2012 - Page 8                                        http://boa.aksw.org
Bootstrapping the Data Web
6. Pattern search and scoring




     Barack Obama                  was born in Honolulu.
                                   was born in



                                   Predicate?
             Subject?                            Object?


           버락 오바마는 호놀룰루에서 태어났습니다.
AKSW@KAIST - 17.01.2012 - Page 9                       http://boa.aksw.org
Bootstrapping the Data Web
7. RDF extraction


         Barack Obama                         was born in Honolulu.
                                              was born in

           Barack Obama             Named Entity Disambiguation!


         dbpedia-owl:birthPlace                                    dbpedia-owl:birthPlace



             Honolulu



         버락 오바마는 호놀룰루에서 태어났습니다.
                     에서 태어났습니다.
AKSW@KAIST - 17.01.2012 - Page 10                                                    http://boa.aksw.org
Bootstrapping the Data Web
8. Evaluation


    1. Select properties P to evaluate (T100)
    2. Query DBpedia for triples (and labels) with p ∈ P
    3. Find sentence with labels
    4. Assess if triple can be found in sentence
    ➡ Gold Standard with 1000 annotated sentence/triples
    5. Run one BOA iteration on Gold Standard
    6. Measure Precision/Recall/F-Measure
AKSW@KAIST - 17.01.2012 - Page 11                     http://boa.aksw.org
Bootstrapping the Data Web
Necessary resources for new language


            ๏   50M sentence (best general knowledge)
            ๏   Sentence Boundary Disambiguation
            ๏   Part of speech tagger helpful
            ๏   Named Entity Recognition
            ๏   Named Entity Disambiguation
            ๏   Labels for resources
            ๏   SPARQL endpoint
AKSW@KAIST - 17.01.2012 - Page 12                       http://boa.aksw.org
Thank you!
                                           Questions?
Daniel Gerber
Johannisgasse 26, Room 5-21
04103 Leipzig, Germany
SIMBA@AKSW
http://bis.informatik.uni-leipzig.de/DanielGerber
http://boa.aksw.org
http://code.google.com/p/boa

   LOD2 Presentation . 02.09.2010 . Page                http://lod2.eu

Más contenido relacionado

Similar a BOA - How To Integrate Your Language

Extracting Multilingual Natural-Language Patterns for RDF Predicates
Extracting Multilingual Natural-Language Patterns for RDF PredicatesExtracting Multilingual Natural-Language Patterns for RDF Predicates
Extracting Multilingual Natural-Language Patterns for RDF PredicatesDaniel Gerber
 
Aqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State UniversityAqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State Universityyouthelectronix
 
It's not rocket surgery - Linked In: ALA 2011
It's not rocket surgery - Linked In: ALA 2011It's not rocket surgery - Linked In: ALA 2011
It's not rocket surgery - Linked In: ALA 2011Ross Singer
 
Are You Afraid of Semantic Web?
Are You Afraid of Semantic Web?Are You Afraid of Semantic Web?
Are You Afraid of Semantic Web?Sabin Buraga
 
Web Data Management in RDF Age
Web Data Management in RDF AgeWeb Data Management in RDF Age
Web Data Management in RDF AgeINRIA-OAK
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosEUCLID project
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web PagesMichael Nelson
 
Introduction to BioHackathon 2014
Introduction to BioHackathon 2014Introduction to BioHackathon 2014
Introduction to BioHackathon 2014Toshiaki Katayama
 
Why libraries should embrace Linked Data
Why libraries should embrace Linked DataWhy libraries should embrace Linked Data
Why libraries should embrace Linked Dataeby
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Jane Stevenson
 
GDG Meets U event - Big data & Wikidata - no lies codelab
GDG Meets U event - Big data & Wikidata -  no lies codelabGDG Meets U event - Big data & Wikidata -  no lies codelab
GDG Meets U event - Big data & Wikidata - no lies codelabCAMELIA BOBAN
 
MojoMojo - the Elegant wiki, Catalyst-powered
MojoMojo - the Elegant wiki, Catalyst-poweredMojoMojo - the Elegant wiki, Catalyst-powered
MojoMojo - the Elegant wiki, Catalyst-poweredDan Dascalescu
 
Geography in Linked Ancient World Data
Geography in Linked Ancient World DataGeography in Linked Ancient World Data
Geography in Linked Ancient World Dataparegorios
 
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesSynchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesMichael Nelson
 
#LAWDI Open Context, publishing linked data in archaeology
#LAWDI Open Context, publishing linked data in archaeology#LAWDI Open Context, publishing linked data in archaeology
#LAWDI Open Context, publishing linked data in archaeologyekansa
 
Intro to Linked, Dutch Ships and Sailors and SPARQL handson
Intro to Linked, Dutch Ships and Sailors and SPARQL handson Intro to Linked, Dutch Ships and Sailors and SPARQL handson
Intro to Linked, Dutch Ships and Sailors and SPARQL handson Victor de Boer
 

Similar a BOA - How To Integrate Your Language (20)

Extracting Multilingual Natural-Language Patterns for RDF Predicates
Extracting Multilingual Natural-Language Patterns for RDF PredicatesExtracting Multilingual Natural-Language Patterns for RDF Predicates
Extracting Multilingual Natural-Language Patterns for RDF Predicates
 
Aqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State UniversityAqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State University
 
It's not rocket surgery - Linked In: ALA 2011
It's not rocket surgery - Linked In: ALA 2011It's not rocket surgery - Linked In: ALA 2011
It's not rocket surgery - Linked In: ALA 2011
 
Are You Afraid of Semantic Web?
Are You Afraid of Semantic Web?Are You Afraid of Semantic Web?
Are You Afraid of Semantic Web?
 
Biodiversity Informatics on the Semantic Web
Biodiversity Informatics on the Semantic WebBiodiversity Informatics on the Semantic Web
Biodiversity Informatics on the Semantic Web
 
Web Data Management in RDF Age
Web Data Management in RDF AgeWeb Data Management in RDF Age
Web Data Management in RDF Age
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages
 
Introduction to BioHackathon 2014
Introduction to BioHackathon 2014Introduction to BioHackathon 2014
Introduction to BioHackathon 2014
 
Why libraries should embrace Linked Data
Why libraries should embrace Linked DataWhy libraries should embrace Linked Data
Why libraries should embrace Linked Data
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011
 
GDG Meets U event - Big data & Wikidata - no lies codelab
GDG Meets U event - Big data & Wikidata -  no lies codelabGDG Meets U event - Big data & Wikidata -  no lies codelab
GDG Meets U event - Big data & Wikidata - no lies codelab
 
MojoMojo - the Elegant wiki, Catalyst-powered
MojoMojo - the Elegant wiki, Catalyst-poweredMojoMojo - the Elegant wiki, Catalyst-powered
MojoMojo - the Elegant wiki, Catalyst-powered
 
Geography in Linked Ancient World Data
Geography in Linked Ancient World DataGeography in Linked Ancient World Data
Geography in Linked Ancient World Data
 
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesSynchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web Pages
 
#LAWDI Open Context, publishing linked data in archaeology
#LAWDI Open Context, publishing linked data in archaeology#LAWDI Open Context, publishing linked data in archaeology
#LAWDI Open Context, publishing linked data in archaeology
 
Semtech2006
Semtech2006Semtech2006
Semtech2006
 
Semantic Web and Linked Open Data
Semantic Web and Linked Open DataSemantic Web and Linked Open Data
Semantic Web and Linked Open Data
 
Intro to Linked, Dutch Ships and Sailors and SPARQL handson
Intro to Linked, Dutch Ships and Sailors and SPARQL handson Intro to Linked, Dutch Ships and Sailors and SPARQL handson
Intro to Linked, Dutch Ships and Sailors and SPARQL handson
 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDF
 

Último

Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Developmentchesterberbo7
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptxmary850239
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDhatriParmar
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...Nguyen Thanh Tu Collection
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxSayali Powar
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptxJonalynLegaspi2
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 

Último (20)

Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Development
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptx
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 

BOA - How To Integrate Your Language

  • 1. Daniel Gerber Axel-Cyrille Ngonga Ngomo AKSW, Universität Leipzig BOA How To Integrate Your Language
  • 2. Bootstrapping the Data Web General Overview Background Corpus Indexing Surface forms Knowledge Search & Korean features RDF extraction Evaluation Scoring AKSW@KAIST - 17.01.2012 - Page 2 http://boa.aksw.org
  • 3. Bootstrapping the Data Web 1. Create a corpus in your language ๏ At least 25M sentences ๏ Chunked into one sentence per line ๏ No HTML ๏ UTF-8? ๏ For later Coreference Resolution, resource URL needs to be available AKSW@KAIST - 17.01.2012 - Page 3 http://boa.aksw.org
  • 4. Bootstrapping the Data Web 2. Corpus indexing ๏ Apache Lucene 3.4.0 ๏ Set of >20 UTF-8 RegEx filters ๏ Whitespace Analyzer ➡ No stemming ➡ Tokenization on every token ➡ Stop-words included in index ➡ Lowercase version AKSW@KAIST - 17.01.2012 - Page 4 http://boa.aksw.org
  • 5. Bootstrapping the Data Web 3. Background knowledge I Object Datatype Properties vs Properties AKSW@KAIST - 17.01.2012 - Page 5 http://boa.aksw.org
  • 6. Bootstrapping the Data Web 3. Background knowledge II Line #1 Line #2 URI1 http://dbpedia.org/resource/South_Korea http://dbpedia.org/resource/KAIST Label1 대한민국 한국 과학 기술원 Property http://dbpedia.org/ontology/capital http://dbpedia.org/ontology/country URI2 http://dbpedia.org/resource/Seoul http://dbpedia.org/resource/South_Korea Label2 서울 대한민국 Domain http://dbpedia.org/ontology/PopulatedPlace ⎯ Range http://dbpedia.org/ontology/PopulatedPlace http://dbpedia.org/ontology/Country AKSW@KAIST - 17.01.2012 - Page 6 http://boa.aksw.org
  • 7. Bootstrapping the Data Web 4. Surface form generation ๏ DBpedia Spotlight ๏ Labels ๏ Redirects ๏ Disambiguation ๏ Datatype Properties ๏ Person XY is born on 1st of October in 1972. ๏ Person XY is born on 1 October in 1972. ๏ Person XY is born on a Thursday in 1972 ๏ Find and Create those surface forms AKSW@KAIST - 17.01.2012 - Page 7 http://boa.aksw.org
  • 8. Bootstrapping the Data Web 5. Korean feature extraction Language Language Dependent Independent ReVerb # of stopwords Wordnet # of words Distance ? # of occurrences ? ? ? AKSW@KAIST - 17.01.2012 - Page 8 http://boa.aksw.org
  • 9. Bootstrapping the Data Web 6. Pattern search and scoring Barack Obama was born in Honolulu. was born in Predicate? Subject? Object? 버락 오바마는 호놀룰루에서 태어났습니다. AKSW@KAIST - 17.01.2012 - Page 9 http://boa.aksw.org
  • 10. Bootstrapping the Data Web 7. RDF extraction Barack Obama was born in Honolulu. was born in Barack Obama Named Entity Disambiguation! dbpedia-owl:birthPlace dbpedia-owl:birthPlace Honolulu 버락 오바마는 호놀룰루에서 태어났습니다. 에서 태어났습니다. AKSW@KAIST - 17.01.2012 - Page 10 http://boa.aksw.org
  • 11. Bootstrapping the Data Web 8. Evaluation 1. Select properties P to evaluate (T100) 2. Query DBpedia for triples (and labels) with p ∈ P 3. Find sentence with labels 4. Assess if triple can be found in sentence ➡ Gold Standard with 1000 annotated sentence/triples 5. Run one BOA iteration on Gold Standard 6. Measure Precision/Recall/F-Measure AKSW@KAIST - 17.01.2012 - Page 11 http://boa.aksw.org
  • 12. Bootstrapping the Data Web Necessary resources for new language ๏ 50M sentence (best general knowledge) ๏ Sentence Boundary Disambiguation ๏ Part of speech tagger helpful ๏ Named Entity Recognition ๏ Named Entity Disambiguation ๏ Labels for resources ๏ SPARQL endpoint AKSW@KAIST - 17.01.2012 - Page 12 http://boa.aksw.org
  • 13. Thank you! Questions? Daniel Gerber Johannisgasse 26, Room 5-21 04103 Leipzig, Germany SIMBA@AKSW http://bis.informatik.uni-leipzig.de/DanielGerber http://boa.aksw.org http://code.google.com/p/boa LOD2 Presentation . 02.09.2010 . Page http://lod2.eu