SlideShare una empresa de Scribd logo
1 de 37
Descargar para leer sin conexión
Linguistic Hacking
Martin.Haase@uni-bamberg.de
maha@jabber.ccc.de
How to know what a text in an unknown
language is about?
1
Contents
• how to identify the language of a written text
• in traditional ways,
• with the help of computer technology.
• how to get at least some information out of
an unknown text.
2
“the intellectual challenge of creatively overcoming or
circumventing limitations.”
Hacking
Eric Raymond (1996):
The New Hackerʼs
Dictionary.
3
Spoken texts?
multi-language corpus of telephone calls
4
Writing Systems
• Roman (thousands of languages)
• Cyrillic (> 60 languages)
• Arabic (> 20 languages)
• Devana̅garī (> 10 languages, not
counting derivative writing systems)
• Hebrew (~ 3 – 5 languages)
5
Devanāgarī
!वनागरी
6
7
8
9
10
Hebrew
• Old and Modern Hebrew,
• Ladino (with different varieties),
• Judeo-Arabic,
• Yiddish.
11
12
13
Norman C. Ingle (1980):
Language Identification
Table. London: Technical
Translation International.
14
15
Computer-aided
identification
• frequencies of unique characters and
character strings
• common words recognition
• n-gram analysis
16
“Text”
17
( TE), (TEX), (EXT), (XT )
18
19
variant of the unique character string approach
20
compression efficiency
21
reference
model
22
reference
model
text in language
to be identified
+
23
reference
model
text in language
to be identified
+
gzip
24
reference
model
text in language
to be identified
+
gzip
compression efficiency
25
Interesting
applications
• measuring linguistic difference
> language families
• determining types of text
• spam detection?
26
• TextCat (http://odur.let.rug.nl/vannoord/TextCat/Demo/), n-gram
based, 76 languages, usable as a web application,
• Languid (http://languid.cantbedone.org/), downloadable
program, web application not running,
• Langid (http://complingone.georgetown.edu/∼langid/), n-gram
based, 65 languages, web application,
• LanguageGuesser (http://www.xrce.xerox.com/cgi-bin/mltt/
LanguageGuesser), frequency tests on characters and character
sequences, about 40 languages, web application,
• Polyglot 3000 (http://www.polyglot3000.com/), corpora, method
unknown, currently 441 languages, closed-source Windows
freeware. :-(
27
approaching “content analysis”
28
Hackerʼs approach
• numbers, dates, words from another
language
• typographic hints:
• bold or italic print,
• colored or underlined text chunks,
• capital letters
29
Zipfʼs law
Very frequent words are shorter and contain less
lexical information, whereas infrequent words are
longer and contain more lexical information.
30
less lexical information implies more grammatical
information and vice versa
31
most interesting for us:
words with more specific lexical information
32
Ignore all short words!
(even if they reiterate throughout the text)
33
Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘i
gagana, ‘ua ‘avea ma gagana lona lua a le tele o
tagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai le
manatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonu
i ai le tele o tagata Samoa e fa‘apea ‘o le gagana e
maua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o i
latou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi,
‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ua
leva ona atamamai ma popoto tagata Samoa e fai lo
latou soifua ma lo latou lalolagi.
34
Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘i
gagana, ‘ua ‘avea ma gagana lona lua a le tele o
tagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai le
manatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonu
i ai le tele o tagata Samoa e fa‘apea ‘o le gagana e
maua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o i
latou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi,
‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ua
leva ona atamamai ma popoto tagata Samoa e fai lo
latou soifua ma lo latou lalolagi.
35
Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘i
gagana, ‘ua ‘avea ma gagana lona lua a le tele o
tagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai le
manatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonu
i ai le tele o tagata Samoa e fa‘apea ‘o le gagana e
maua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o i
latou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi,
‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ua
leva ona atamamai ma popoto tagata Samoa e fai lo
latou soifua ma lo latou lalolagi.
36
Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘i
gagana, ‘ua ‘avea ma gagana lona lua a le tele o
tagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai le
manatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonu
i ai le tele o tagata Samoa e fa‘apea ‘o le gagana e
maua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o i
latou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi,
‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ua
leva ona atamamai ma popoto tagata Samoa e fai lo
latou soifua ma lo latou lalolagi.
37

Más contenido relacionado

La actualidad más candente

Endangered languages
Endangered languagesEndangered languages
Endangered languages
Rick McKinnon
 
Language Presentation 2013
Language Presentation 2013Language Presentation 2013
Language Presentation 2013
cindipatten
 

La actualidad más candente (11)

Endangered languages
Endangered languagesEndangered languages
Endangered languages
 
Presentation endangered languages and linguistic diversity
Presentation endangered languages and linguistic diversityPresentation endangered languages and linguistic diversity
Presentation endangered languages and linguistic diversity
 
Bilingualism
BilingualismBilingualism
Bilingualism
 
Language extinction
Language extinctionLanguage extinction
Language extinction
 
Meeting 6 multilingualism
Meeting 6 multilingualismMeeting 6 multilingualism
Meeting 6 multilingualism
 
Language Presentation 2013
Language Presentation 2013Language Presentation 2013
Language Presentation 2013
 
Languages of the world
Languages of the  worldLanguages of the  world
Languages of the world
 
Spanish in Texas: Open learning tools for exploring language diversity
Spanish in Texas: Open learning tools for exploring language diversitySpanish in Texas: Open learning tools for exploring language diversity
Spanish in Texas: Open learning tools for exploring language diversity
 
Language and culture, s w hypothesis
Language and culture, s w hypothesisLanguage and culture, s w hypothesis
Language and culture, s w hypothesis
 
Languages of the word
Languages of the wordLanguages of the word
Languages of the word
 
Multilingualism
Multilingualism Multilingualism
Multilingualism
 

Similar a Martin Haase: Linguistic Hacking [24c3]

Culture and language01
Culture and language01Culture and language01
Culture and language01
Naseem Akhtar
 
The Future Of Native Languages
The Future Of Native LanguagesThe Future Of Native Languages
The Future Of Native Languages
angelicag315
 
[Challenge:Future] Language Death - The Language Box
[Challenge:Future] Language Death - The Language Box[Challenge:Future] Language Death - The Language Box
[Challenge:Future] Language Death - The Language Box
Challenge:Future
 

Similar a Martin Haase: Linguistic Hacking [24c3] (20)

Development of language
Development of languageDevelopment of language
Development of language
 
Culture and language
Culture and languageCulture and language
Culture and language
 
Bittinger & Hieber - Language revitalization: Issues with reference to Navajo
Bittinger & Hieber - Language revitalization: Issues with reference to NavajoBittinger & Hieber - Language revitalization: Issues with reference to Navajo
Bittinger & Hieber - Language revitalization: Issues with reference to Navajo
 
Myth 2
Myth 2Myth 2
Myth 2
 
Culture and language01
Culture and language01Culture and language01
Culture and language01
 
what is Linguistics
what is Linguisticswhat is Linguistics
what is Linguistics
 
APHG Unit 3: Language
APHG Unit 3: LanguageAPHG Unit 3: Language
APHG Unit 3: Language
 
The Future Of Native Languages
The Future Of Native LanguagesThe Future Of Native Languages
The Future Of Native Languages
 
What is english; what is language (online new)
What is english; what is language (online new)What is english; what is language (online new)
What is english; what is language (online new)
 
Empres of the Word
Empres of the WordEmpres of the Word
Empres of the Word
 
Edl school-visit-en
Edl school-visit-enEdl school-visit-en
Edl school-visit-en
 
European Day of Languages Quiz
European Day of Languages QuizEuropean Day of Languages Quiz
European Day of Languages Quiz
 
Tamil People and Cultures
 Tamil People  and Cultures  Tamil People  and Cultures
Tamil People and Cultures
 
[Challenge:Future] Language Death - The Language Box
[Challenge:Future] Language Death - The Language Box[Challenge:Future] Language Death - The Language Box
[Challenge:Future] Language Death - The Language Box
 
Myths and realities of samskrit
Myths and realities of samskritMyths and realities of samskrit
Myths and realities of samskrit
 
07 - Sociolinguistics.pdf based on sides
07 - Sociolinguistics.pdf based on sides07 - Sociolinguistics.pdf based on sides
07 - Sociolinguistics.pdf based on sides
 
English online-lesson-1-unit-6-dmtss
English online-lesson-1-unit-6-dmtssEnglish online-lesson-1-unit-6-dmtss
English online-lesson-1-unit-6-dmtss
 
English online-lesson-1-unit-6-dmtss
English online-lesson-1-unit-6-dmtssEnglish online-lesson-1-unit-6-dmtss
English online-lesson-1-unit-6-dmtss
 
A world of many languages.ppt
A world of many languages.pptA world of many languages.ppt
A world of many languages.ppt
 
Meeting 13 language contact
Meeting 13 language contactMeeting 13 language contact
Meeting 13 language contact
 

Más de OpenSlidesArchive

Más de OpenSlidesArchive (7)

Paparazzi - The Free Autopilot [24c3]
Paparazzi - The Free Autopilot [24c3]Paparazzi - The Free Autopilot [24c3]
Paparazzi - The Free Autopilot [24c3]
 
Peter Voigt: GPLv3 [24c3]
Peter Voigt: GPLv3 [24c3]Peter Voigt: GPLv3 [24c3]
Peter Voigt: GPLv3 [24c3]
 
Frederik Ramm: OpenStreetMap, the free Wiki world map [24c3]
Frederik Ramm: OpenStreetMap, the free Wiki world map [24c3]Frederik Ramm: OpenStreetMap, the free Wiki world map [24c3]
Frederik Ramm: OpenStreetMap, the free Wiki world map [24c3]
 
Olivier Cleynen: Overtaking Proprietary Software Without Writing Code [24c3]
Olivier Cleynen: Overtaking Proprietary Software Without Writing Code [24c3]Olivier Cleynen: Overtaking Proprietary Software Without Writing Code [24c3]
Olivier Cleynen: Overtaking Proprietary Software Without Writing Code [24c3]
 
Arjen Kamphuis: Open Source Lobbying, tips from the trenches [24C3]
 Arjen Kamphuis: Open Source Lobbying, tips from the trenches [24C3] Arjen Kamphuis: Open Source Lobbying, tips from the trenches [24C3]
Arjen Kamphuis: Open Source Lobbying, tips from the trenches [24C3]
 
The Arctic Cold War
The Arctic Cold WarThe Arctic Cold War
The Arctic Cold War
 
Inside the Mac OS X Kernel
Inside the Mac OS X KernelInside the Mac OS X Kernel
Inside the Mac OS X Kernel
 

Último

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Último (20)

ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 

Martin Haase: Linguistic Hacking [24c3]

  • 1. Linguistic Hacking Martin.Haase@uni-bamberg.de maha@jabber.ccc.de How to know what a text in an unknown language is about? 1
  • 2. Contents • how to identify the language of a written text • in traditional ways, • with the help of computer technology. • how to get at least some information out of an unknown text. 2
  • 3. “the intellectual challenge of creatively overcoming or circumventing limitations.” Hacking Eric Raymond (1996): The New Hackerʼs Dictionary. 3
  • 4. Spoken texts? multi-language corpus of telephone calls 4
  • 5. Writing Systems • Roman (thousands of languages) • Cyrillic (> 60 languages) • Arabic (> 20 languages) • Devana̅garī (> 10 languages, not counting derivative writing systems) • Hebrew (~ 3 – 5 languages) 5
  • 7. 7
  • 8. 8
  • 9. 9
  • 10. 10
  • 11. Hebrew • Old and Modern Hebrew, • Ladino (with different varieties), • Judeo-Arabic, • Yiddish. 11
  • 12. 12
  • 13. 13
  • 14. Norman C. Ingle (1980): Language Identification Table. London: Technical Translation International. 14
  • 15. 15
  • 16. Computer-aided identification • frequencies of unique characters and character strings • common words recognition • n-gram analysis 16
  • 18. ( TE), (TEX), (EXT), (XT ) 18
  • 19. 19
  • 20. variant of the unique character string approach 20
  • 23. reference model text in language to be identified + 23
  • 24. reference model text in language to be identified + gzip 24
  • 25. reference model text in language to be identified + gzip compression efficiency 25
  • 26. Interesting applications • measuring linguistic difference > language families • determining types of text • spam detection? 26
  • 27. • TextCat (http://odur.let.rug.nl/vannoord/TextCat/Demo/), n-gram based, 76 languages, usable as a web application, • Languid (http://languid.cantbedone.org/), downloadable program, web application not running, • Langid (http://complingone.georgetown.edu/∼langid/), n-gram based, 65 languages, web application, • LanguageGuesser (http://www.xrce.xerox.com/cgi-bin/mltt/ LanguageGuesser), frequency tests on characters and character sequences, about 40 languages, web application, • Polyglot 3000 (http://www.polyglot3000.com/), corpora, method unknown, currently 441 languages, closed-source Windows freeware. :-( 27
  • 29. Hackerʼs approach • numbers, dates, words from another language • typographic hints: • bold or italic print, • colored or underlined text chunks, • capital letters 29
  • 30. Zipfʼs law Very frequent words are shorter and contain less lexical information, whereas infrequent words are longer and contain more lexical information. 30
  • 31. less lexical information implies more grammatical information and vice versa 31
  • 32. most interesting for us: words with more specific lexical information 32
  • 33. Ignore all short words! (even if they reiterate throughout the text) 33
  • 34. Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘i gagana, ‘ua ‘avea ma gagana lona lua a le tele o tagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai le manatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonu i ai le tele o tagata Samoa e fa‘apea ‘o le gagana e maua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o i latou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi, ‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ua leva ona atamamai ma popoto tagata Samoa e fai lo latou soifua ma lo latou lalolagi. 34
  • 35. Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘i gagana, ‘ua ‘avea ma gagana lona lua a le tele o tagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai le manatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonu i ai le tele o tagata Samoa e fa‘apea ‘o le gagana e maua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o i latou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi, ‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ua leva ona atamamai ma popoto tagata Samoa e fai lo latou soifua ma lo latou lalolagi. 35
  • 36. Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘i gagana, ‘ua ‘avea ma gagana lona lua a le tele o tagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai le manatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonu i ai le tele o tagata Samoa e fa‘apea ‘o le gagana e maua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o i latou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi, ‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ua leva ona atamamai ma popoto tagata Samoa e fai lo latou soifua ma lo latou lalolagi. 36
  • 37. Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘i gagana, ‘ua ‘avea ma gagana lona lua a le tele o tagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai le manatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonu i ai le tele o tagata Samoa e fa‘apea ‘o le gagana e maua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o i latou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi, ‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ua leva ona atamamai ma popoto tagata Samoa e fai lo latou soifua ma lo latou lalolagi. 37