SlideShare una empresa de Scribd logo
1 de 45
Descargar para leer sin conexión
IS in African
 Languages

Chiarcos et al.

CRC IS

                   Information Structure in African Languages:
ANNIS


                               Corpora and Tools

                    Christian Chiarcos, Ines Fiedler, Mira Grubic, Andreas
                    Haida, Katharina Hartmann, Julia Ritz, Anne Schwarz,
                              Amir Zeldes, Malte Zimmermann

                             Collaborative Research Centre ‘Information Structure’
                  Universit¨t Potsdam, Germany & Humboldt Universit¨t zu Berlin, Germany
                           a                                           a


                                           March 31, 2009
IS in African
 Languages
                                                 Table of contents
Chiarcos et al.

CRC IS

ANNIS




                  1 The Collaborative Research Centre ‘Information structure’




                  2 ANNIS
IS in African
 Languages
                               Introduction to the work of the
Chiarcos et al.

                                                      CRC IS
CRC IS

ANNIS


                  The Collaborative Research Centre ‘Information structure’ .
                    • 42 researchers
                    • 4 disciplines (Linguistics, Psychology, German Studies,
                      African Studies)
                    • 15 projects
                    • 2 universities (Humboldt-University Berlin, University of
                      Potsdam)
                    • Funded by the German Research Foundation
                    • Common goal: better understanding of information
                      structure across languages
IS in African
 Languages
                               Introduction to the work of the
Chiarcos et al.

                                                      CRC IS
CRC IS

ANNIS


                  The Collaborative Research Centre ‘Information structure’ .
                    • 42 researchers
                    • 4 disciplines (Linguistics, Psychology, German Studies,
                      African Studies)
                    • 15 projects
                    • 2 universities (Humboldt-University Berlin, University of
                      Potsdam)
                    • Funded by the German Research Foundation
                    • Common goal: better understanding of information
                      structure across languages
IS in African
 Languages
                                What is Information Structure?
Chiarcos et al.

CRC IS

ANNIS




                  Information Structure
                  Information Structure is the structuring of linguistic
                  information in order to optimize information transfer relative to
                  the temporary communicative needs of interlocutors.
IS in African
 Languages
                               What is Information Structure?
Chiarcos et al.

CRC IS

ANNIS




                  The same information needs to be ‘packaged’ in different ways
                  depending on the knowledge and goals of the speakers.
                    (1)   a. I have a cat, and I had to bring my cat to the vet.
                          b. #I had to bring my cat to the vet, and I have a cat.
IS in African
 Languages
                               What is Information Structure?
Chiarcos et al.

CRC IS

ANNIS




                  The same information needs to be ‘packaged’ in different ways
                  depending on the knowledge and goals of the speakers.
                    (2)   a. I have a cat, and I had to bring my cat to the vet.
                          b. #I had to bring my cat to the vet, and I have a cat.
IS in African
 Languages
                                What is Information Structure?
Chiarcos et al.

CRC IS

ANNIS




                  Important concepts: Focus
                  Focus indicates the presence of alternatives that are relevant
                  for the interpretation of linguistic expressions.

                    (3)   a. Clyde had to marry BERthaF in order to be eligible
                             for the inheritance.
                          b. Clyde had to MARryF Bertha in order to be eligible
                             for the inheritance.
IS in African
 Languages
                                What is Information Structure?
Chiarcos et al.

CRC IS

ANNIS




                  Important concepts: Focus
                  Focus indicates the presence of alternatives that are relevant
                  for the interpretation of linguistic expressions.

                    (4)   a. Clyde had to marry BERthaF in order to be eligible
                             for the inheritance.
                          b. Clyde had to MARryF Bertha in order to be eligible
                             for the inheritance.
IS in African
 Languages
                             What is Information Structure?
Chiarcos et al.

CRC IS

ANNIS




                  (5)   a. Who stole the cookie?
                        b. PEterF stole the cookie.
                        c. #Peter stole the COOkieF .
IS in African
 Languages
                                What is Information Structure?
Chiarcos et al.

CRC IS

ANNIS




                  Important concepts: Givenness
                  Givenness is the indication that a concept is immediately
                  present in the shared knowledge of the speakers, e.g. previously
                  mentioned:

                    (6)   a. Who stole the cookie?
                          b. PEterF [stole the cookie]Given .
IS in African
 Languages
                                What is Information Structure?
Chiarcos et al.

CRC IS

ANNIS




                  Important concepts: Givenness
                  Givenness is the indication that a concept is immediately
                  present in the shared knowledge of the speakers, e.g. previously
                  mentioned:

                    (7)   a. Who stole the cookie?
                          b. PEterF [stole the cookie]Given .
IS in African
 Languages
                               What is Information Structure?
Chiarcos et al.

CRC IS

ANNIS




                  Important concepts: Givenness

                    (8)   a. I know that John stole a cookie. What did he do
                             then?
                          b. He [reTURNed [the cookie]Given ]F
IS in African
 Languages
                                 What is Information Structure?
Chiarcos et al.

CRC IS

ANNIS




                  Important concepts: Topic
                  The topic constituent identifies the entity under which the
                  information expressed in the comment constituent should be ‘stored’.

                    (9)   a. Aristotle OnassisTopic married Jacqueline KennedyComment .
                          b. Jacqueline KennedyTopic married Aristotle OnassisComment .
IS in African
 Languages
                                 What is Information Structure?
Chiarcos et al.

CRC IS

ANNIS




                  Important concepts: Topic
                  The topic constituent identifies the entity under which the
                  information expressed in the comment constituent should be ‘stored’.

                   (10)   a. Aristotle OnassisTopic married Jacqueline KennedyComment .
                          b. Jacqueline KennedyTopic married Aristotle OnassisComment .
IS in African
 Languages
                                         Research at the CRC
Chiarcos et al.

CRC IS

                             Chadic
ANNIS
                                          Focus
                  Gur and
                             lan-
                                          project
                  Kwa
                             guages


                       Elicitation with QUIS

                       Transcription/Annotation



                       Elicited Hausar
                                       HIC
                       Data Baka
IS in African
 Languages
                             Information Structure in African
Chiarcos et al.

                                                  Languages
CRC IS

ANNIS



                   • Focus marking by movement (Ex-situ focus)

                  (11)   Kiifii n`e Kande ta-k`e
                                e            e      daf`a-waa.
                                                       a
                         fish PRT Kande 3sg-rel.cont cook-NMLZ
                         (Hausa, Chadic)

                         ‘Kande is cooking FISH.’
                  (12)   padgo taab´` Kai (Tangale, Chadic)
                                    ee
                         bought tobacco Kai
                         ‘KAI bought tobacco.’
IS in African
 Languages
                             Information Structure in African
Chiarcos et al.

                                                  Languages
CRC IS

ANNIS



                   • Focus marking without movement (In-situ focus)

                  (13)   p´u
                          u¯     n¯nd´ u
                                  U @¯              y` s´b`-l´.
                                                     @ a@@
                                            ı´ @
                                           b´ı-g¯
                         woman buy CL.POSS child-CL FM book-CL
                         (Byali, Gur)

                         ‘The woman bought a book for her CHILD.’
                  (14)   Yaa      s`okee sh` d` wuÎaa. (Hausa, Chadic)
                                   o       ıa
                         3sg.perf stab him with knife
                         ‘He stabbed him with a KNIFE.’
IS in African
 Languages
                                         Research at the CRC
Chiarcos et al.

CRC IS

                             Chadic
ANNIS
                                          Focus
                  Gur and
                             lan-
                                          project
                  Kwa
                             guages


                       Elicitation with QUIS

                       Transcription/Annotation



                       Elicited Hausar
                                       HIC
                       Data Baka
IS in African
 Languages
                                               Questionnaire on IS
Chiarcos et al.

CRC IS

ANNIS




                  • (Skopeteas et al., 2006)
                  • Elicitation on the basis of pictures / short movies
                  • Descriptions, Narration, Questions/answers, Games
                  • highly controlled as well as less controlled settings
IS in African
 Languages
                  Questionnaire on IS
Chiarcos et al.

CRC IS

ANNIS
IS in African
 Languages
                                         Research at the CRC
Chiarcos et al.

CRC IS

                             Chadic
ANNIS
                                          Focus
                  Gur and
                             lan-
                                          project
                  Kwa
                             guages


                       Elicitation with QUIS

                       Transcription/Annotation



                       Elicited Hausar
                                       HIC
                       Data Baka
IS in African
 Languages
                               Transcription and Annotation
Chiarcos et al.

CRC IS

ANNIS




                  • annotation scheme LISA, (Dipper et al., 2007)
                  • applicable across typologically different languages
                  • guidelines for annotation of phonology, morphology,
                    syntax, semantics and information structure
                  • (Semi-)automatic annotation also possible
IS in African
 Languages
                  Transcription and Annotation
Chiarcos et al.

CRC IS

ANNIS
IS in African
 Languages
                                         Research at the CRC
Chiarcos et al.

CRC IS

                             Chadic
ANNIS
                                          Focus
                  Gur and
                             lan-
                                          project
                  Kwa
                             guages


                       Elicitation with QUIS

                       Transcription/Annotation



                       Elicited Hausar
                                       HIC
                       Data Baka
IS in African
 Languages
                                                    Elicited Data
Chiarcos et al.

CRC IS

ANNIS




                  • 19 Gur/Kwa languages: Baatonum, Buli, Byali, Dagbani,
                    Ditammari, Gurene, Konkomba, Konni, Nateni, Waama,
                    Yom (Gur languages) and Aja, Akan, Efutu, Ewe, Fon,
                    Foodo, Lelemi, Anii (Kwa languages).
                  • 6 Chadic languages: Hausa, Tangale, Guruntum (West
                    Chadic) and Bura, South Marghi, Tera (Central Chadic).
                  • elicited with QUIS and language-specific additional tasks.
IS in African
 Languages
                                         Research at the CRC
Chiarcos et al.

CRC IS

                             Chadic
ANNIS
                                          Focus
                  Gur and
                             lan-
                                          project
                  Kwa
                             guages


                       Elicitation with QUIS

                       Transcription/Annotation



                       Elicited Hausar
                                       HIC
                       Data Baka
IS in African
 Languages
                                           Hausar Baka Corpus
Chiarcos et al.

CRC IS

ANNIS




                  • by Randell, Bature and Schuh, 1998
                  • collection of videotaped dialogues
                  • about 1500 Hausa sentences
                  • annotated using LISA
IS in African
 Languages
                                         Research at the CRC
Chiarcos et al.

CRC IS

                             Chadic
ANNIS
                                          Focus
                  Gur and
                             lan-
                                          project
                  Kwa
                             guages


                       Elicitation with QUIS

                       Transcription/Annotation



                       Elicited Hausar
                                       HIC
                       Data Baka
IS in African
 Languages
                                         Hausa Internet Corpus
Chiarcos et al.

CRC IS

ANNIS
                  • current project
                  • in cooperation with another NLP project of the CRC
                  • large amounts of Hausa material available on the internet
                  • parallel sections: novel Ruwan Bagaja by Abubakar Imam,
                    Bible and Qur’an sections, Declaration of Human Rights.
                  • These parallel sections open the possibility of
                    semiautomatic annotation:
                  • POS annotation projection from English to Hausa
                  • Projected annotation used to train tagger/chunker
                  • Existing manual annotations used as a gold standard for
                    evaluation
IS in African
 Languages
                                         Hausa Internet Corpus
Chiarcos et al.

CRC IS

ANNIS
                  • current project
                  • in cooperation with another NLP project of the CRC
                  • large amounts of Hausa material available on the internet
                  • parallel sections: novel Ruwan Bagaja by Abubakar Imam,
                    Bible and Qur’an sections, Declaration of Human Rights.
                  • These parallel sections open the possibility of
                    semiautomatic annotation:
                  • POS annotation projection from English to Hausa
                  • Projected annotation used to train tagger/chunker
                  • Existing manual annotations used as a gold standard for
                    evaluation
IS in African
 Languages
                                         Research at the CRC
Chiarcos et al.

CRC IS

                            Chadic
ANNIS
                                         Focus
                  Gur and
                            lan-
                                         project
                  Kwa
                            guages


                     Elicitation with QUIS

                     Transcription/Annotation




                        ANNIS Database
IS in African
 Languages
                  Framework Architecture
Chiarcos et al.

CRC IS

ANNIS
IS in African
 Languages
                                                               ANNIS
Chiarcos et al.

                  • web-based corpus interface
CRC IS

                  • query and visualization of annotations
ANNIS

                      • (sequences of) tokens
                      • trees (labeled edges, crossing edges)
                      • pointing relations
                      • nested, overlapping, conflicting, discontinuous
                  • user management
                      • authorized access
                      • acc. to legal status of corpus
IS in African
 Languages
                                           Querying in ANNIS
Chiarcos et al.

                  • ANNIS Query Language
CRC IS

ANNIS
                  • graphical Query Builder (drag & drop)




                    basic concepts:
                    nodes, relations between nodes
IS in African
 Languages
                                        ANNIS Query Language
Chiarcos et al.

                  • nodes (sequentially numbered variables)
CRC IS

                       • generalized category
ANNIS

                         tok (= any token), node (= any annotation)
                       • regular expressions / exact expressions
                         pos=/ADJ[AD]/, pos=/P.*/, cat=quot;NPquot;
                  • relations between nodes
                       • co-extension, overlapping, contained/adjacent span
                         lemma=/.*ing/ & pos=quot;NNquot; & #1 = #2
                       • dominance (direct/indirect, left-/rightmost child, common
                         parent, etc., including edge labels)
                         cat=quot;NPquot; & cat=quot;PPquot; & #1 > #2
IS in African
 Languages
                  Query Processing
Chiarcos et al.

CRC IS

ANNIS
IS in African
 Languages
                                            Corpus Presentation
Chiarcos et al.

                  • match count for quantitative studies
CRC IS

ANNIS
                  • full Unicode support (diacritics, e.g. for tone)
IS in African
 Languages
                                            Corpus Presentation
Chiarcos et al.

                  • match count for quantitative studies
CRC IS

ANNIS
                  • full Unicode support (diacritics, e.g. for tone)
IS in African
 Languages
                                            Corpus Presentation
Chiarcos et al.

                  • match count for quantitative studies
CRC IS

ANNIS
                  • full Unicode support (diacritics, e.g. for tone)
                  • visualization of annotations
                       • tokens, spans
IS in African
 Languages
                                            Corpus Presentation
Chiarcos et al.

                  • match count for quantitative studies
CRC IS

ANNIS
                  • full Unicode support (diacritics, e.g. for tone)
                  • visualization of annotations
                       • tokens, spans
                       • trees
IS in African
 Languages
                                            Corpus Presentation
Chiarcos et al.

                  • match count for quantitative studies
CRC IS

ANNIS
                  • full Unicode support (diacritics, e.g. for tone)
                  • visualization of annotations
                       • tokens, spans
                       • trees
                       • pointing relations
IS in African
 Languages
                                            Corpus Presentation
Chiarcos et al.

                  • match count for quantitative studies
CRC IS

ANNIS
                  • full Unicode support (diacritics, e.g. for tone)
                  • visualization of annotations
                       • tokens, spans
                       • trees
                       • pointing relations
                  • rendering of audio files (embedded media player)
                  • save and export facilities
                       • ’deep links’ for citation
                       • export to tabular format ARFF
                         (WEKA machine learning environment)
IS in African
 Languages
                                            Corpus Presentation
Chiarcos et al.

                  • match count for quantitative studies
CRC IS

ANNIS
                  • full Unicode support (diacritics, e.g. for tone)
                  • visualization of annotations
                       • tokens, spans
                       • trees
                       • pointing relations
                  • rendering of audio files (embedded media player)
                  • save and export facilities
                       • ’deep links’ for citation
                       • export to tabular format ARFF
                         (WEKA machine learning environment)
IS in African
 Languages
                                                            Summary
Chiarcos et al.

                  • Resources
CRC IS

                      • deeply annotated
ANNIS
                      • specialized on IS
                      • tools allowing for query and evaluation
                  • extend corpus studies
                      • near-natural language
                      • larger amounts of data
                  • better understanding of IS

Más contenido relacionado

Más de Guy De Pauw

Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusGuy De Pauw
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of SantomeGuy De Pauw
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Guy De Pauw
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTGuy De Pauw
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionGuy De Pauw
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsGuy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentGuy De Pauw
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersGuy De Pauw
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemGuy De Pauw
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemGuy De Pauw
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Guy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
 
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...Guy De Pauw
 
Amharic document clustering
Amharic document clusteringAmharic document clustering
Amharic document clusteringGuy De Pauw
 
Modeling Improved Syllabification Algorithm for Amharic
Modeling Improved Syllabification Algorithm for AmharicModeling Improved Syllabification Algorithm for Amharic
Modeling Improved Syllabification Algorithm for AmharicGuy De Pauw
 
Ge'ez Verbs Morphology and Declaration Model
Ge'ez Verbs Morphology and Declaration ModelGe'ez Verbs Morphology and Declaration Model
Ge'ez Verbs Morphology and Declaration ModelGuy De Pauw
 
Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...Guy De Pauw
 

Más de Guy De Pauw (20)

Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
 
Amharic document clustering
Amharic document clusteringAmharic document clustering
Amharic document clustering
 
Modeling Improved Syllabification Algorithm for Amharic
Modeling Improved Syllabification Algorithm for AmharicModeling Improved Syllabification Algorithm for Amharic
Modeling Improved Syllabification Algorithm for Amharic
 
Ge'ez Verbs Morphology and Declaration Model
Ge'ez Verbs Morphology and Declaration ModelGe'ez Verbs Morphology and Declaration Model
Ge'ez Verbs Morphology and Declaration Model
 
Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...
 

Information Structure in African Languages

  • 1. IS in African Languages Chiarcos et al. CRC IS Information Structure in African Languages: ANNIS Corpora and Tools Christian Chiarcos, Ines Fiedler, Mira Grubic, Andreas Haida, Katharina Hartmann, Julia Ritz, Anne Schwarz, Amir Zeldes, Malte Zimmermann Collaborative Research Centre ‘Information Structure’ Universit¨t Potsdam, Germany & Humboldt Universit¨t zu Berlin, Germany a a March 31, 2009
  • 2. IS in African Languages Table of contents Chiarcos et al. CRC IS ANNIS 1 The Collaborative Research Centre ‘Information structure’ 2 ANNIS
  • 3. IS in African Languages Introduction to the work of the Chiarcos et al. CRC IS CRC IS ANNIS The Collaborative Research Centre ‘Information structure’ . • 42 researchers • 4 disciplines (Linguistics, Psychology, German Studies, African Studies) • 15 projects • 2 universities (Humboldt-University Berlin, University of Potsdam) • Funded by the German Research Foundation • Common goal: better understanding of information structure across languages
  • 4. IS in African Languages Introduction to the work of the Chiarcos et al. CRC IS CRC IS ANNIS The Collaborative Research Centre ‘Information structure’ . • 42 researchers • 4 disciplines (Linguistics, Psychology, German Studies, African Studies) • 15 projects • 2 universities (Humboldt-University Berlin, University of Potsdam) • Funded by the German Research Foundation • Common goal: better understanding of information structure across languages
  • 5. IS in African Languages What is Information Structure? Chiarcos et al. CRC IS ANNIS Information Structure Information Structure is the structuring of linguistic information in order to optimize information transfer relative to the temporary communicative needs of interlocutors.
  • 6. IS in African Languages What is Information Structure? Chiarcos et al. CRC IS ANNIS The same information needs to be ‘packaged’ in different ways depending on the knowledge and goals of the speakers. (1) a. I have a cat, and I had to bring my cat to the vet. b. #I had to bring my cat to the vet, and I have a cat.
  • 7. IS in African Languages What is Information Structure? Chiarcos et al. CRC IS ANNIS The same information needs to be ‘packaged’ in different ways depending on the knowledge and goals of the speakers. (2) a. I have a cat, and I had to bring my cat to the vet. b. #I had to bring my cat to the vet, and I have a cat.
  • 8. IS in African Languages What is Information Structure? Chiarcos et al. CRC IS ANNIS Important concepts: Focus Focus indicates the presence of alternatives that are relevant for the interpretation of linguistic expressions. (3) a. Clyde had to marry BERthaF in order to be eligible for the inheritance. b. Clyde had to MARryF Bertha in order to be eligible for the inheritance.
  • 9. IS in African Languages What is Information Structure? Chiarcos et al. CRC IS ANNIS Important concepts: Focus Focus indicates the presence of alternatives that are relevant for the interpretation of linguistic expressions. (4) a. Clyde had to marry BERthaF in order to be eligible for the inheritance. b. Clyde had to MARryF Bertha in order to be eligible for the inheritance.
  • 10. IS in African Languages What is Information Structure? Chiarcos et al. CRC IS ANNIS (5) a. Who stole the cookie? b. PEterF stole the cookie. c. #Peter stole the COOkieF .
  • 11. IS in African Languages What is Information Structure? Chiarcos et al. CRC IS ANNIS Important concepts: Givenness Givenness is the indication that a concept is immediately present in the shared knowledge of the speakers, e.g. previously mentioned: (6) a. Who stole the cookie? b. PEterF [stole the cookie]Given .
  • 12. IS in African Languages What is Information Structure? Chiarcos et al. CRC IS ANNIS Important concepts: Givenness Givenness is the indication that a concept is immediately present in the shared knowledge of the speakers, e.g. previously mentioned: (7) a. Who stole the cookie? b. PEterF [stole the cookie]Given .
  • 13. IS in African Languages What is Information Structure? Chiarcos et al. CRC IS ANNIS Important concepts: Givenness (8) a. I know that John stole a cookie. What did he do then? b. He [reTURNed [the cookie]Given ]F
  • 14. IS in African Languages What is Information Structure? Chiarcos et al. CRC IS ANNIS Important concepts: Topic The topic constituent identifies the entity under which the information expressed in the comment constituent should be ‘stored’. (9) a. Aristotle OnassisTopic married Jacqueline KennedyComment . b. Jacqueline KennedyTopic married Aristotle OnassisComment .
  • 15. IS in African Languages What is Information Structure? Chiarcos et al. CRC IS ANNIS Important concepts: Topic The topic constituent identifies the entity under which the information expressed in the comment constituent should be ‘stored’. (10) a. Aristotle OnassisTopic married Jacqueline KennedyComment . b. Jacqueline KennedyTopic married Aristotle OnassisComment .
  • 16. IS in African Languages Research at the CRC Chiarcos et al. CRC IS Chadic ANNIS Focus Gur and lan- project Kwa guages Elicitation with QUIS Transcription/Annotation Elicited Hausar HIC Data Baka
  • 17. IS in African Languages Information Structure in African Chiarcos et al. Languages CRC IS ANNIS • Focus marking by movement (Ex-situ focus) (11) Kiifii n`e Kande ta-k`e e e daf`a-waa. a fish PRT Kande 3sg-rel.cont cook-NMLZ (Hausa, Chadic) ‘Kande is cooking FISH.’ (12) padgo taab´` Kai (Tangale, Chadic) ee bought tobacco Kai ‘KAI bought tobacco.’
  • 18. IS in African Languages Information Structure in African Chiarcos et al. Languages CRC IS ANNIS • Focus marking without movement (In-situ focus) (13) p´u u¯ n¯nd´ u U @¯ y` s´b`-l´. @ a@@ ı´ @ b´ı-g¯ woman buy CL.POSS child-CL FM book-CL (Byali, Gur) ‘The woman bought a book for her CHILD.’ (14) Yaa s`okee sh` d` wuÎaa. (Hausa, Chadic) o ıa 3sg.perf stab him with knife ‘He stabbed him with a KNIFE.’
  • 19. IS in African Languages Research at the CRC Chiarcos et al. CRC IS Chadic ANNIS Focus Gur and lan- project Kwa guages Elicitation with QUIS Transcription/Annotation Elicited Hausar HIC Data Baka
  • 20. IS in African Languages Questionnaire on IS Chiarcos et al. CRC IS ANNIS • (Skopeteas et al., 2006) • Elicitation on the basis of pictures / short movies • Descriptions, Narration, Questions/answers, Games • highly controlled as well as less controlled settings
  • 21. IS in African Languages Questionnaire on IS Chiarcos et al. CRC IS ANNIS
  • 22. IS in African Languages Research at the CRC Chiarcos et al. CRC IS Chadic ANNIS Focus Gur and lan- project Kwa guages Elicitation with QUIS Transcription/Annotation Elicited Hausar HIC Data Baka
  • 23. IS in African Languages Transcription and Annotation Chiarcos et al. CRC IS ANNIS • annotation scheme LISA, (Dipper et al., 2007) • applicable across typologically different languages • guidelines for annotation of phonology, morphology, syntax, semantics and information structure • (Semi-)automatic annotation also possible
  • 24. IS in African Languages Transcription and Annotation Chiarcos et al. CRC IS ANNIS
  • 25. IS in African Languages Research at the CRC Chiarcos et al. CRC IS Chadic ANNIS Focus Gur and lan- project Kwa guages Elicitation with QUIS Transcription/Annotation Elicited Hausar HIC Data Baka
  • 26. IS in African Languages Elicited Data Chiarcos et al. CRC IS ANNIS • 19 Gur/Kwa languages: Baatonum, Buli, Byali, Dagbani, Ditammari, Gurene, Konkomba, Konni, Nateni, Waama, Yom (Gur languages) and Aja, Akan, Efutu, Ewe, Fon, Foodo, Lelemi, Anii (Kwa languages). • 6 Chadic languages: Hausa, Tangale, Guruntum (West Chadic) and Bura, South Marghi, Tera (Central Chadic). • elicited with QUIS and language-specific additional tasks.
  • 27. IS in African Languages Research at the CRC Chiarcos et al. CRC IS Chadic ANNIS Focus Gur and lan- project Kwa guages Elicitation with QUIS Transcription/Annotation Elicited Hausar HIC Data Baka
  • 28. IS in African Languages Hausar Baka Corpus Chiarcos et al. CRC IS ANNIS • by Randell, Bature and Schuh, 1998 • collection of videotaped dialogues • about 1500 Hausa sentences • annotated using LISA
  • 29. IS in African Languages Research at the CRC Chiarcos et al. CRC IS Chadic ANNIS Focus Gur and lan- project Kwa guages Elicitation with QUIS Transcription/Annotation Elicited Hausar HIC Data Baka
  • 30. IS in African Languages Hausa Internet Corpus Chiarcos et al. CRC IS ANNIS • current project • in cooperation with another NLP project of the CRC • large amounts of Hausa material available on the internet • parallel sections: novel Ruwan Bagaja by Abubakar Imam, Bible and Qur’an sections, Declaration of Human Rights. • These parallel sections open the possibility of semiautomatic annotation: • POS annotation projection from English to Hausa • Projected annotation used to train tagger/chunker • Existing manual annotations used as a gold standard for evaluation
  • 31. IS in African Languages Hausa Internet Corpus Chiarcos et al. CRC IS ANNIS • current project • in cooperation with another NLP project of the CRC • large amounts of Hausa material available on the internet • parallel sections: novel Ruwan Bagaja by Abubakar Imam, Bible and Qur’an sections, Declaration of Human Rights. • These parallel sections open the possibility of semiautomatic annotation: • POS annotation projection from English to Hausa • Projected annotation used to train tagger/chunker • Existing manual annotations used as a gold standard for evaluation
  • 32. IS in African Languages Research at the CRC Chiarcos et al. CRC IS Chadic ANNIS Focus Gur and lan- project Kwa guages Elicitation with QUIS Transcription/Annotation ANNIS Database
  • 33. IS in African Languages Framework Architecture Chiarcos et al. CRC IS ANNIS
  • 34. IS in African Languages ANNIS Chiarcos et al. • web-based corpus interface CRC IS • query and visualization of annotations ANNIS • (sequences of) tokens • trees (labeled edges, crossing edges) • pointing relations • nested, overlapping, conflicting, discontinuous • user management • authorized access • acc. to legal status of corpus
  • 35. IS in African Languages Querying in ANNIS Chiarcos et al. • ANNIS Query Language CRC IS ANNIS • graphical Query Builder (drag & drop) basic concepts: nodes, relations between nodes
  • 36. IS in African Languages ANNIS Query Language Chiarcos et al. • nodes (sequentially numbered variables) CRC IS • generalized category ANNIS tok (= any token), node (= any annotation) • regular expressions / exact expressions pos=/ADJ[AD]/, pos=/P.*/, cat=quot;NPquot; • relations between nodes • co-extension, overlapping, contained/adjacent span lemma=/.*ing/ & pos=quot;NNquot; & #1 = #2 • dominance (direct/indirect, left-/rightmost child, common parent, etc., including edge labels) cat=quot;NPquot; & cat=quot;PPquot; & #1 > #2
  • 37. IS in African Languages Query Processing Chiarcos et al. CRC IS ANNIS
  • 38. IS in African Languages Corpus Presentation Chiarcos et al. • match count for quantitative studies CRC IS ANNIS • full Unicode support (diacritics, e.g. for tone)
  • 39. IS in African Languages Corpus Presentation Chiarcos et al. • match count for quantitative studies CRC IS ANNIS • full Unicode support (diacritics, e.g. for tone)
  • 40. IS in African Languages Corpus Presentation Chiarcos et al. • match count for quantitative studies CRC IS ANNIS • full Unicode support (diacritics, e.g. for tone) • visualization of annotations • tokens, spans
  • 41. IS in African Languages Corpus Presentation Chiarcos et al. • match count for quantitative studies CRC IS ANNIS • full Unicode support (diacritics, e.g. for tone) • visualization of annotations • tokens, spans • trees
  • 42. IS in African Languages Corpus Presentation Chiarcos et al. • match count for quantitative studies CRC IS ANNIS • full Unicode support (diacritics, e.g. for tone) • visualization of annotations • tokens, spans • trees • pointing relations
  • 43. IS in African Languages Corpus Presentation Chiarcos et al. • match count for quantitative studies CRC IS ANNIS • full Unicode support (diacritics, e.g. for tone) • visualization of annotations • tokens, spans • trees • pointing relations • rendering of audio files (embedded media player) • save and export facilities • ’deep links’ for citation • export to tabular format ARFF (WEKA machine learning environment)
  • 44. IS in African Languages Corpus Presentation Chiarcos et al. • match count for quantitative studies CRC IS ANNIS • full Unicode support (diacritics, e.g. for tone) • visualization of annotations • tokens, spans • trees • pointing relations • rendering of audio files (embedded media player) • save and export facilities • ’deep links’ for citation • export to tabular format ARFF (WEKA machine learning environment)
  • 45. IS in African Languages Summary Chiarcos et al. • Resources CRC IS • deeply annotated ANNIS • specialized on IS • tools allowing for query and evaluation • extend corpus studies • near-natural language • larger amounts of data • better understanding of IS