SlideShare una empresa de Scribd logo
1 de 8
Descargar para leer sin conexión
.
               Compiling Apertium dictionaries with HFST
    leveraging generalised compilation formulas to get more and better end
                 applications with fewer language description
.

                                  Tommi A Pirinen, Francis Tyers
                                  tommi.pirinen@helsinki.fi

                                 University of Helsinki, Universitat d’Alacant


                                              May 22, 2012




                                                                      .      .   .      .     .     .
    Tommi A Pirinen (Helsinki)        Compiling apertium monodix with HFST           May 22, 2012       1/8
Outline




.
1    Introduction


.
2    Benefits of this work


.
3    Conclusion




                                                                 .      .   .      .     .     .
    Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       2/8
Finite-state automata and HFST and apertium

     Finite-state automata are one efficient way to encode dictionaries,
     morphological analysers etc.
     HFST stands for Helsinki Finite-State Technology— consisting of a
     library working as a compatibility layer between different open-source
     finite-state implementations,
            SFST
            OpenFST
            Foma
     Also a set of finite-state tools built on top of the library, and set of
     end products using the automata in real-world applications (sold
     separately)
     HFST is still a research project in a computational linguistics’ research
     group—not computer science or engineering
     apertium is a machine-translation platform that uses finite-state
     dictionaries
                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       3/8
Compiling apertium dictionaries with HFST—rationale

     “just an engineering exercise”
     getting all language descriptions to compile natively in HFST (as
     opposed to converting compiled automata)
     using existing (and future) HFST algorithms to improve the resulting
     automata
     using bits of linguistic information to get better auxiliary automata for
     HFST end applications — data that may not be possible to induct
     from converted compiled automata
     possibility to integrate more complex features in of finite-state
     morphology in apertium dictionaries—morphophonetics, reduplication
     etc. that may be supported by other HFST tools
     this paper fits nicely in my PhD thesis under “State of the art of in
     language models”

                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       4/8
Examples of immediate benefits to dictionary writers




     A lot of current work in building NLP software involves management
     of huge amounts of lexical data
     ...like generating different language models in different morphology
     programming formalisms: apertium, hunspell, xerox tools
     getting native and uniform compilation formulas for all lets you write
     dictionaries once and use everywhere
     or pick and mix tools and features from different formalisms




                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       5/8
Examples of additional applications that can be generated
from apertium dictionaries with this work



     Spell-checkers! A basic spell-checker with generic edit distance
     suggestion generator can be automatically generated—and used in
     majority of current open-source software without any extra effort
     Predictive text entry, for mobiles, such T9, XT9, possibly swype and
     keyboard as well
     Morphological analysers, lemmatisers, segmenters, tokenisers, etc.,
     obviously




                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       6/8
Examples of benefits that come for free—automatic
optimisation


     depending on library / end format you choose for compiled
     dictionaries, you get speed–space tradeoffs (or improvements in both)
     This is work-in-progress, but once done it can be used in all
     dictionaries without modifications to sources
     automatic flag diacritic induction
     hyperminimisation
     all this can be based on things like finding homomorphic components
     from the finite-state automaton
     the linguistic concepts present in source code but missing from the
     compiled automaton should prove very useful here!


                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       7/8
What now?

     The reference material for the article is in our svn http:
     //hfst.svn.sf.net/svnroot/hfst/trunk/lrec-2011-apertium,
     includes compilation of spell-checkers for most apertium dictionaries
     what do we do to remove duplicate work, duplicate versions of
     dictionaries, conversion scripts. . .
     more compilers? Conversion scripts? New programming languages?
     New “standards” that everyone will use?
     I’ll throw you this: I need more linguistic data and less engineering in
     the language model implementations to compile more applications
     from one source dictionary. Example: LR/RL concept in apertium or
     asymmetric flags in Xerox FSM is engineering hack POV; had the
     description called it substandard or dialectal word form it would
     already be usable in all applications!

                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       8/8

Más contenido relacionado

La actualidad más candente

Using translation memory_to_speed_up_tra
Using translation memory_to_speed_up_traUsing translation memory_to_speed_up_tra
Using translation memory_to_speed_up_traCamillaTonanzi
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for TranslationRIILP
 
Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...ijnlc
 
Types of machine translation
Types of machine translationTypes of machine translation
Types of machine translationRushdi Shams
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...John Tinsley
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translationMarcis Pinnis
 
Lec 15,16,17 NLP.machine translation
Lec 15,16,17  NLP.machine translationLec 15,16,17  NLP.machine translation
Lec 15,16,17 NLP.machine translationguest873a50
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine TranslationJaganadh Gopinadhan
 
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...ijnlc
 
Use of ontologies in natural language processing
Use of ontologies in natural language processingUse of ontologies in natural language processing
Use of ontologies in natural language processingATHMAN HAJ-HAMOU
 
Natural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and DifficultiesNatural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and Difficultiesijtsrd
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...Editor IJARCET
 
Sanskrit and Computational Linguistic
Sanskrit and Computational Linguistic Sanskrit and Computational Linguistic
Sanskrit and Computational Linguistic Jaganadh Gopinadhan
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmMeetupDataScienceRoma
 

La actualidad más candente (18)

Using translation memory_to_speed_up_tra
Using translation memory_to_speed_up_traUsing translation memory_to_speed_up_tra
Using translation memory_to_speed_up_tra
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
 
Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...
 
Types of machine translation
Types of machine translationTypes of machine translation
Types of machine translation
 
Machine translation
Machine translationMachine translation
Machine translation
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 
Lec 15,16,17 NLP.machine translation
Lec 15,16,17  NLP.machine translationLec 15,16,17  NLP.machine translation
Lec 15,16,17 NLP.machine translation
 
Chat adapted pos tagger for romanian language
Chat adapted pos tagger for romanian languageChat adapted pos tagger for romanian language
Chat adapted pos tagger for romanian language
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine Translation
 
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
 
Using ontology for natural language processing
Using ontology for natural language processingUsing ontology for natural language processing
Using ontology for natural language processing
 
Use of ontologies in natural language processing
Use of ontologies in natural language processingUse of ontologies in natural language processing
Use of ontologies in natural language processing
 
Natural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and DifficultiesNatural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and Difficulties
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
 
Sanskrit and Computational Linguistic
Sanskrit and Computational Linguistic Sanskrit and Computational Linguistic
Sanskrit and Computational Linguistic
 
Machine Translation
Machine TranslationMachine Translation
Machine Translation
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigm
 

Destacado

Lezione corso running
Lezione corso runningLezione corso running
Lezione corso runningReti
 
Prepostales
PrepostalesPrepostales
Prepostalesyinalis
 
2.2.crisostomo inovacao no semiarido
2.2.crisostomo inovacao no semiarido2.2.crisostomo inovacao no semiarido
2.2.crisostomo inovacao no semiaridosmtpinov
 
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облака
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облакаЭволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облака
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облакаSQALab
 
Meetingin finland
Meetingin finlandMeetingin finland
Meetingin finlandanglimo
 

Destacado (8)

Power To People
Power To PeoplePower To People
Power To People
 
Lezione corso running
Lezione corso runningLezione corso running
Lezione corso running
 
Prepostales
PrepostalesPrepostales
Prepostales
 
2.2.crisostomo inovacao no semiarido
2.2.crisostomo inovacao no semiarido2.2.crisostomo inovacao no semiarido
2.2.crisostomo inovacao no semiarido
 
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облака
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облакаЭволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облака
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облака
 
Meetingin finland
Meetingin finlandMeetingin finland
Meetingin finland
 
Cactus
CactusCactus
Cactus
 
Castilla La Mancha
Castilla La ManchaCastilla La Mancha
Castilla La Mancha
 

Similar a Compiling Apertium Dictionaries with HFST

G2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian languageG2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian languageijnlc
 
MOLTO Annual Report 2011
MOLTO Annual Report 2011MOLTO Annual Report 2011
MOLTO Annual Report 2011Olga Caprotti
 
Lexigraf - a multilingual lexicography DTP engine
Lexigraf - a multilingual lexicography DTP engineLexigraf - a multilingual lexicography DTP engine
Lexigraf - a multilingual lexicography DTP engineYiannis Hatzopoulos
 
The Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
The Distributed Ontology Language (DOL): Use Cases, Syntax, and ExtensibilityThe Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
The Distributed Ontology Language (DOL): Use Cases, Syntax, and ExtensibilityChristoph Lange
 
A proposal for technique to use common terms among multiple systems
A proposal for technique to use common terms among multiple systemsA proposal for technique to use common terms among multiple systems
A proposal for technique to use common terms among multiple systemsMahara Hui
 
MOLTO poster for META Forum, Brussels 2010, Belgium.
MOLTO poster for META Forum, Brussels 2010, Belgium.MOLTO poster for META Forum, Brussels 2010, Belgium.
MOLTO poster for META Forum, Brussels 2010, Belgium.Olga Caprotti
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)kevig
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)kevig
 
International Journal on Natural Language Computing(IJNLC)
 International Journal on Natural Language Computing(IJNLC) International Journal on Natural Language Computing(IJNLC)
International Journal on Natural Language Computing(IJNLC)kevig
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)kevig
 
Call for papers - International Journal on Natural Language Computing (IJNLC)
Call for papers - International Journal on Natural Language Computing (IJNLC)Call for papers - International Journal on Natural Language Computing (IJNLC)
Call for papers - International Journal on Natural Language Computing (IJNLC)kevig
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)kevig
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)kevig
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)kevig
 

Similar a Compiling Apertium Dictionaries with HFST (20)

Tlf2016
Tlf2016Tlf2016
Tlf2016
 
Lfnw2016
Lfnw2016Lfnw2016
Lfnw2016
 
G2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian languageG2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian language
 
MOLTO Annual Report 2011
MOLTO Annual Report 2011MOLTO Annual Report 2011
MOLTO Annual Report 2011
 
Lexigraf - a multilingual lexicography DTP engine
Lexigraf - a multilingual lexicography DTP engineLexigraf - a multilingual lexicography DTP engine
Lexigraf - a multilingual lexicography DTP engine
 
Olf2016
Olf2016Olf2016
Olf2016
 
Concordances
Concordances Concordances
Concordances
 
Lit mtap
Lit mtapLit mtap
Lit mtap
 
The Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
The Distributed Ontology Language (DOL): Use Cases, Syntax, and ExtensibilityThe Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
The Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
 
A proposal for technique to use common terms among multiple systems
A proposal for technique to use common terms among multiple systemsA proposal for technique to use common terms among multiple systems
A proposal for technique to use common terms among multiple systems
 
Pandoc: a universal document converter
Pandoc: a universal document converterPandoc: a universal document converter
Pandoc: a universal document converter
 
MOLTO poster for META Forum, Brussels 2010, Belgium.
MOLTO poster for META Forum, Brussels 2010, Belgium.MOLTO poster for META Forum, Brussels 2010, Belgium.
MOLTO poster for META Forum, Brussels 2010, Belgium.
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 
International Journal on Natural Language Computing(IJNLC)
 International Journal on Natural Language Computing(IJNLC) International Journal on Natural Language Computing(IJNLC)
International Journal on Natural Language Computing(IJNLC)
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 
Call for papers - International Journal on Natural Language Computing (IJNLC)
Call for papers - International Journal on Natural Language Computing (IJNLC)Call for papers - International Journal on Natural Language Computing (IJNLC)
Call for papers - International Journal on Natural Language Computing (IJNLC)
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 

Más de Guy De Pauw

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Guy De Pauw
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Guy De Pauw
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingGuy De Pauw
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageGuy De Pauw
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguageGuy De Pauw
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)Guy De Pauw
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Guy De Pauw
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusGuy De Pauw
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of SantomeGuy De Pauw
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Guy De Pauw
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionGuy De Pauw
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsGuy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentGuy De Pauw
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersGuy De Pauw
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemGuy De Pauw
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemGuy De Pauw
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Guy De Pauw
 

Más de Guy De Pauw (20)

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh Language
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
 

Último

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Último (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Compiling Apertium Dictionaries with HFST

  • 1. . Compiling Apertium dictionaries with HFST leveraging generalised compilation formulas to get more and better end applications with fewer language description . Tommi A Pirinen, Francis Tyers tommi.pirinen@helsinki.fi University of Helsinki, Universitat d’Alacant May 22, 2012 . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 1/8
  • 2. Outline . 1 Introduction . 2 Benefits of this work . 3 Conclusion . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 2/8
  • 3. Finite-state automata and HFST and apertium Finite-state automata are one efficient way to encode dictionaries, morphological analysers etc. HFST stands for Helsinki Finite-State Technology— consisting of a library working as a compatibility layer between different open-source finite-state implementations, SFST OpenFST Foma Also a set of finite-state tools built on top of the library, and set of end products using the automata in real-world applications (sold separately) HFST is still a research project in a computational linguistics’ research group—not computer science or engineering apertium is a machine-translation platform that uses finite-state dictionaries . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 3/8
  • 4. Compiling apertium dictionaries with HFST—rationale “just an engineering exercise” getting all language descriptions to compile natively in HFST (as opposed to converting compiled automata) using existing (and future) HFST algorithms to improve the resulting automata using bits of linguistic information to get better auxiliary automata for HFST end applications — data that may not be possible to induct from converted compiled automata possibility to integrate more complex features in of finite-state morphology in apertium dictionaries—morphophonetics, reduplication etc. that may be supported by other HFST tools this paper fits nicely in my PhD thesis under “State of the art of in language models” . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 4/8
  • 5. Examples of immediate benefits to dictionary writers A lot of current work in building NLP software involves management of huge amounts of lexical data ...like generating different language models in different morphology programming formalisms: apertium, hunspell, xerox tools getting native and uniform compilation formulas for all lets you write dictionaries once and use everywhere or pick and mix tools and features from different formalisms . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 5/8
  • 6. Examples of additional applications that can be generated from apertium dictionaries with this work Spell-checkers! A basic spell-checker with generic edit distance suggestion generator can be automatically generated—and used in majority of current open-source software without any extra effort Predictive text entry, for mobiles, such T9, XT9, possibly swype and keyboard as well Morphological analysers, lemmatisers, segmenters, tokenisers, etc., obviously . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 6/8
  • 7. Examples of benefits that come for free—automatic optimisation depending on library / end format you choose for compiled dictionaries, you get speed–space tradeoffs (or improvements in both) This is work-in-progress, but once done it can be used in all dictionaries without modifications to sources automatic flag diacritic induction hyperminimisation all this can be based on things like finding homomorphic components from the finite-state automaton the linguistic concepts present in source code but missing from the compiled automaton should prove very useful here! . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 7/8
  • 8. What now? The reference material for the article is in our svn http: //hfst.svn.sf.net/svnroot/hfst/trunk/lrec-2011-apertium, includes compilation of spell-checkers for most apertium dictionaries what do we do to remove duplicate work, duplicate versions of dictionaries, conversion scripts. . . more compilers? Conversion scripts? New programming languages? New “standards” that everyone will use? I’ll throw you this: I need more linguistic data and less engineering in the language model implementations to compile more applications from one source dictionary. Example: LR/RL concept in apertium or asymmetric flags in Xerox FSM is engineering hack POV; had the description called it substandard or dialectal word form it would already be usable in all applications! . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 8/8