DSPy a system for AI to Write Prompts and Do Fine Tuning
Human Language Technologies for Ethiopian Languages: Challenges and Future Directions
1. Human Language Technologies for Ethiopian
Languages: Challenges and Future Directions
Solomon Teferra Abate, Binyam Ephrem,
Enchalew Yifru, Kassa Tilahun, Lemlem Hagos, Mohammed-
hussen Abubeker and Taye Girma
LIG, Université Joseph Fourier (UJF)
ITPhD Program, Addis Ababa University
solomon_teferra_7@yahoo.com
AGIS'11 Conference, Addis Ababa
2. Outline
● Ethiopian Languages
● Human Language Technology (HLT)
– Role in Development
– HLT in the World
● HLT for Ethiopian Languages
– Language and Technology Coverage
– Challenges and limitations
– Future Directions and Strategies
AGIS'11 Conference, Addis Ababa
3. Ethiopian Languages
● There are about 90 languages
● Most belong to the Afro-Asiatic language family
● Amharic, Afan-Oromo and Tigringa are the 3 most spoken
● Amharic is federal working language
– Regions have their own working language
– The language policy states that everyone has the right to in
his/her mother tongue
– More than 20 languages are MOI in primary (I&II) school
AGIS'11 Conference, Addis Ababa
4. Human Language Technology
● Is an interdisciplinary field that encompasses most sub-
disciplines of linguistics, Computational Linguistics, Natural
Language Processing, computer science, Artificial Intelligence,
psychology, philosophy, mathematics and statistics
✔ Morphological analysis/synthesis,
✔ Stemming
Covers ASR,✔
✔ Information Extraction,
areas ✔ MT,
TTS,✔
✔ Text/document categorization
like: OCR,
✔ POS tagging,
Spelling and Grammar checking,
✔
✔
✔ Parsing,
✔ etc.
AGIS'11 Conference, Addis Ababa
5. Human Language Technology - Role
● Enables ICT products to have knowledge of human language
● Increases the acceptance of the technology and the
productivity of its users in the information age
● Helps people collaborate, conduct business, share knowledge
and participate in social and political debates regardless of
language barriers or computer skills
● Relevant for the disadvantaged to have access to information:
✔ the illiterate, ✔ the physically impaired population
✔ the rural poor,
AGIS'11 Conference, Addis Ababa
6. HLT in the World
● Well developed for a few languages of the world like English
● IBM Watson Computer
● Passed its first test winning a QA competition with $1 M value
● The goal of its design is to have intelligent computer that can
interact in a natural language
✔ Understanding any question asked in a natural speech
✔ Answer questions as humans do
● Uses a number of HLT modules such as: ASR, QA, TTS
✗ Requires a lot of expensive servers (about a total of $1 billion)
AGIS'11 Conference, Addis Ababa
7. HLT in the World
● Siri is a simple iphone based system that:
● Receives commands in a natural speech
● Send message
● Schedule meetings
● Place phone calls
● Siri has been claimed to:
● understand what you say
● know what you mean
● speak back in a natural speech
AGIS'11 Conference, Addis Ababa
8. HLT in the World: Europe
● Europe is a continent that is united to one multilingual
economic country with 23 official languages
● To enable the European languages, the European Union:
✔ Invested over €130 M to promote language technologies
and language resource infrastructures in 2009-2011
✔ Allocated €35 M for SME action on Digital Content and
Languages and €50 M for Language Technologies in its
Work Program 2011-2012
✔ Proposed a simple platform that enables availability of any
online content and services in all European languages
AGIS'11 Conference, Addis Ababa
9. HLT in the World: South Africa
● South African government has identified HLT as a priority area
to enable (technologically) its 11 official languages
➢ Various R&D projects and initiatives have been funded by
government through:
● Department of Arts and Culture (DAC),
● Department of Science and Technology (DST), and
● National Research Foundation (NRF)
● The key challenge is fragmentation of R&D activities in HLT
● Addressed by the South African HLT Audit (SAHLTA)
AGIS'11 Conference, Addis Ababa
10. HLT for Ethiopian Languages
● Research on HLT for Ethiopian languages started in the 1990s
✔ There are now a lot of (>200) encouraging and valuable works
on: ➢ Thesaurus contraction,
➢ ASR, ➢ Stemming,
➢ Text classification
➢ MT ➢ Parsing,
➢ Text categorization,
➢ Text-to-speech, ➢ POS tagging,
➢ Morphological analysis,
➢ OCR, ➢ Spell checking,
➢ Information Extraction
✗ Most of them are based on LRs developed for the experiment
AGIS'11 Conference, Addis Ababa
11. HLT for Ethiopian Languages
✗ HLT research covers a limited number of Ethiopian languages
HLT for Ethiopian Languages (Masters theses)
25
NLP
Speech Processing
OCR
20 CSE
Research Areas
15
10
5
0
Amharic Afan Oromo Tigringa Welayta Ge'ez Sidama
Languages
AGIS'11 Conference, Addis Ababa
12. Challenges and Limitations
● Challenges that hinder Ethiopian HLT include:
– lack of language resources: speech and text corpora
– Lack of standardized evaluation corpora and platform
– lack of expertise on both language and technology
– time shortage
● done only for academic achievement in the given time
– absence of national HLT research plan - HLT road-map
● based only on individuals' interest
– lack of sustainable and coordinated research fund
AGIS'11 Conference, Addis Ababa
13. Challenges and Limitations
➔ They have limitations:
– use of insufficient and low quality language resource
➢ research results are not conclusive
– research results are not well evaluated, analyzed and
documented
➢ Their achievements and gaps are vague
– research attempts in HLT are fragmented
➢ lack of integration, consolidation and continuity
● Tokenizer POS Parser LA ASR/MT
AGIS'11 Conference, Addis Ababa
14. Future Directions and Strategies
● Is there any other way to escape the cost of the language barrier
or to cover it with out HLT in the information age? NO!!!
● Are we rich enough to continue spending for only academic
exercises? NO!!!
– 6 months of at least 10 research students doing their thesis on
any one of HLT areas every year and their supervisors
– 3 years of at least 6 PhD research students (admitted every year)
and their research supervisors
– The time of academic researchers doing research for publication
purpose (for academic promotion)
AGIS'11 Conference, Addis Ababa
15. Future Directions and Strategies
● Give emphasis and recognition to R&D activities in HLT
● Develop national HLT road-map (HLT Audit)
– Shows research priorities
– Avoids duplication (even across languages)
– Reduces R&D cost
– Provides a means of evaluation/assessment
– Enforces consolidation, integration and continuity
– Inspires researchers and developers
– Shows the benefit areas for the HLT industry
AGIS'11 Conference, Addis Ababa
16. Future Directions and Strategies
● Establish Institutional/National R&D units
– Fund, coordinate and evaluate R&D projects
– Store, maintain, distribute language resources and R&D
outputs
– Promote the utility of R&D outputs
– Coordinate and support private industries
– Coordinate the cooperation of the academia and the industry
– Promote/attract international investments on HLT industries
AGIS'11 Conference, Addis Ababa
17. Conclusion
● We have 85 living languages
● All have speakers who need information and the right
to get it in a language and the way they understand
– HLT is the way to realize it
● We need to have a strategy to put it in place
– Cooperation across:
● Time: past->present->future ● Language,
● Research area, ● Sector: academic<->industry
AGIS'11 Conference, Addis Ababa
18. We can
make it
BY
AGIS'11 Conference, Addis Ababa