SlideShare una empresa de Scribd logo
1 de 68
Descargar para leer sin conexión
The Rosetta Project:




Building a 10,000 Year Library !
  of All Human Language!
A bit of background…
The 10,000 Year Clock

               “I want to build a clock that ticks
               once a year. The century hand
               advances once every one hundred
               years, and the cuckoo comes out on
               the millennium.”!

Danny Hillis
Prototype 1
Clock Mountain
The 10,000 Year Library
                “The Clock dramatizes the scope of
                historic time past and to come but
                offers no content. The Library is all
                content, especially past content with
                future significance…The value could
                lie in providing civilizations with a
                wisdom line: slow, robust, apparently
Stewart brand
                inefficient. ”!
                         -from “Clock/Library” in The Clock of the Long Now!
Library Projects: 
A Responsibility Record
Library Projects:
   All Species
Library Projects: 
             Time  Bits




…We risk creating a Digital Dark Age – a void in the continuity
of cultural record – because the formats and hardware which we
entrust with our data are unlikely to outlast even the next ten
years, much less our own lives.	

                                                     - Danny Hillis
Strategies That Improve 
     Data Longevity
•  For starters expand your scope: aim for at least
   500 years (can you do better than paper?) 	

•  Use it or lose it – unused data dies	

•  Provide access – promotes use, reuse, LOCKSS	

•  Consider saving everything (e.g. Internet Archive)	

•  Move it or lose it (“Movage”)	

•  Consider atoms over bits (analog)
Library Projects:
   Long Server
Library Projects:
           The Rosetta Project
•  Thousands of years ago we stored
   information on stone tablets –
   some of these are still around.	

•  Hundreds of years ago we stored
   information in books – print on
   acid free paper can reliably be
   preserved 500 years.	

•  Now we store information
   digitally, using hardware, software
   and encodings that are highly
   ephemeral.
The Rosetta Disk
(One Possible Solution)
Microscopic Analog 
   Data Storage
Microetched Pages
Human Eye Readable Side
Parallel Content in
           Multiple languages
•  The Rosetta Stone includes a
   decree of the divine cult of King
   Ptolomy V carved in 196 BC	

•  Same text written in three different
   forms: Egyptian Hieroglyphs,
   Demotic (Early Egyptian Script
   preceding Coptic), and Ancient
   Greek	

•  Working back from the Greek and
   somewhat known Demotic, were
   able to decipher the Hieroglyphs –
   thereby unlocking records of an
   entire ancient civilization
Rosetta Disk Goal - Parallel
 content for all languages
Vocabulary	

        Maps	


Sound Structure	

   Writing Systems	


Word and Sentence    Ethnographic Information	

Structure	

Parallel texts	

    Numbering Systems	


Other texts	

       Color Systems
Building the Collection
    Book Scanning
Building the Collection:
  Swadesh Wordlists
Audio Digitization
Google Earth Interface
“Born Digital” Materials




      Endangered Language Documentation Project!
6 First Edition Disks
•  Brewster Kahle, Internet Archive	


•  Charles Butcher, Lazy 8 Foundation –
   now in the permanent special collection
   of the University of Colorado Boulder
   Library	


•  William Lidwell, author of Universal
   Principles of Design	


•  Oliver Wilke – Oliver Wilke Stiftung für
   Sprachen	


•  One is held by an anonymous donor, and
   one is in the Long Now Museum
02004 Rosetta
European Space
Agency Mission
Rosetta Disk
           Museum Edition




In August 02009 we presented
the prototype of the Rosetta
Disk Museum Edition to
Secretary Wayne Clough for
the Smithsonian.!
Endangered Languages


“The coming century will see either the death
or doom of 90% of mankind’s languages”!
                                - Michael Krauss!
Top Ten languages by
Native Speakers (Millions)
  Mandarin!
   Spanish!
    English!
    Bengali!
     Hindi!
 Portuguese!
   Russian!
   Japanese!
   German!
   Javanese!

               0!   100! 200! 300! 400! 500! 600! 700! 800! 900!
    Data: The Ethnologue (02009) available at www.ethnologue.com!
Language Distribution
1 Billion
              Half the world population speaks one of 10 languages (1%)!
100 Million



                 Most everyone else speaks one of 300 languages (4%)!




                    5% of the world speaks one of 6,500 languages (95%) !
10 Thousand




                                Number of Languages!
Why does it matter?
Languages are...




   Great Works of Art!
Languages are...




    Great Libraries!
Languages are “How to” guides
  for Living on Planet Earth
Languages Provide 
a window into our minds
Freedom of Language - 
 an inalienable human right
Individually you have:	

•  The right to be recognized as a member of a language
   community	

•  The right to use your language in private and in public	

•  The right to use your own name	

•  The right to interrelate and associate with your native
   speech community	

•  The right to maintain and develop your own culture
Freedom of Language - 
an inalienable human right
Collectively your speech community has:	

•  The right for your own language and culture to be taught	

•  The right of access to cultural services	

•  The right to an equitable presence of your language and
   culture in the communications media	

•  The right to receive attention in your own language from
   government bodies and in socioeconomic relations 	


 From the Universal Declaration on Linguistic Rights, Barcelona, June 1996!
Rosetta Project:
Long Now, Here  Now
Open Digital Collection on
  All Human Languages
Rosetta Special Collection 
  In the Internet Archive
Rosetta Language Base –
Linguistic Metastructure
             •  Freebase: over 10,000
                languages and linguistic
                entities linked by language
                family relationship	


             •  All data is linked to other
                kinds of data in Freebase	


             •  We have rectified ~1500
                Wikipedia pages about human
                languages to our data set
Rosetta Prototype Wiki
New Initiative
The Language Commons
The Language Commons
    Working Group
Language Commons
                Goals:
•  To scale the amount of open language data (PD/CCZero to
   GPL to CCNC-BY to MIT/BSD)!

•  To seek the participation of holders of language data
   including publishers, corporations, and authors (including
   web authors), funders of research that generates language
   data, and the institutes, researchers, and projects who are
   themselves creating and/or curating language data.  !

•  To build open and available language data resources to
   further research, development, and global access to knowledge !

•  To help preserve and promote endangered languages!
Language Commons
               Participants
•  Translate.org, Meedan.net, Miro Project, Rosetta Project / Long Now
   Foundation, the Kamusi Project, Rosetta Foundation (translation
   service organization in Ireland), Fostering Language Resources
   Network (FLaReNet), European Language Resources Assocation
   (ELRA), The Berkman Center for Internet and Society

•  Biblotheca Alexandrina, Berkman Center for Internet and Society,
   IBM Watson Language Group, Center for Research in Computational
   Linguistics, King Abdullah’s Initiative for Arabic Content,
   International Development Research Center (Canada)

•  Saint Louis University, University of Melbourne, University of
   Michigan, Vassar, Universitat d’Alacant, University of Edinburgh,
   University of Pittsburgh, University of Pennsylvania, Eastern Michigan
   University, Tufts University
Language Distribution
1 Billion
              Half the world population speaks one of 10 languages (1%)!
100 Million



                 Most everyone else speaks one of 300 languages (4%)!




                    5% of the world speaks one of 6,500 languages (95%) !
10 Thousand




                                Number of Languages!
Want to use your language
  in the digital domain?
1.  Is there a writing system for your language?!

    a.  Yes! Continue to (2)!

    b.  No! But you can still talk on your mobile phone, and post
        YouTube videos of yourself and your friends. Note you will
        need to type alphanumeric text (or use voice commands) in
        another more widely used language.!
Want to use your language
  in the digital domain?
2.  Is there a unique identifier (ISO 639 code) for your language?!

    a.  Yes! Continue to (3)!

    b.  No! Bummer. Go back to (1).!
Want to use your language
  in the digital domain?
 3.  Is your writing system in Unicode?!

     a.  Yes! Congratulations! Your script is now supported in the
         essential architecture of the digital domain.!

     b.  No! Bummer. Either create one by adapting a supported
         script, build a proposal to get your script/unique characters
         supported in Unicode (contact the Script Encoding Initiative
         for help on this), go back to (1).!
Want to use your language
  in the digital domain?
4.  Do you have a large corpus of natural texts – written and spoken?!

    a.  Yes! Congratulations! You must be a speaker of a very
        economically powerful language. You continue to grow these
        corpora as you interact online every day (email, internet searches,
        SMS texts, depending somewhat on which ones you use) – and the
        services based on them keep getting better for you – natural
        language search, machine translation, speech recognition, etc.!

    b.  No! Bummer. Go back to (3). You and billions of others are in the
        same circumstance. Many give up and simply use a mainstream
        language in the digital domain.!
The Growing 
Linguistic Digital Divide
“There are hundreds of seriously under-documented
languages that remain very much alive with hundreds
of thousands to tens of millions of speakers each.
The speakers of these languages number collectively
in the billions, and as linguistic technology grows in
importance, they find themselves of the far side of
an increasingly large digital divide.”


                - NSF Proposal “Seeding The Language Commons”
Enabling Top 300 Languages 
   as well as The Long Tail
•  We have substantial machine readable corpora for only about
   20-30 of the world’s 6,900 languages. [Bird and Abney, 2010]!

•  There is a commercial motivation in enabling the 300 most
   widely spoken languages – if digital services and devices work
   for this group, that is 95% of humanity.!

•  The other 6,500 or so – the long tail – has no commercial
   motivation, but these languages can be documented and
   enabled by non-profit/academic/philanthropic efforts.!

•  The Long Tail can benefit from development of the 300 (and
   vice versa – if we are building better algorithms that can work
   with less data. !
What we want to build…
The Language Commons
                          Proposal: Build an Encyclopedia of Human
                                           Language	

                             An aggregation and discovery portal for
                          information and resources on all 6,900 human
                                           languages.	





For use by:	

 • language speakers	

 • educators	

 • researchers	

 • general public
Why an Encyclopedia of
       human language?
•  To create the go-to place for information and resources on any and all
   human languages – for education, for research, for preservation!

•  To provide resources on lower density languages in case of crisis or
   emergency!

•  To take action in the face of impending language loss!

•  To act as testament for the genius of human cultural and linguistic
   diversity, and stand for freedom of language as a basic human right!

•  To provide a forward path for the use of the world’s languages in the
   digital domain (by building a massive repository of open linguistic
   corpora)!
Basic Design Principles:
•  Comprehensive – One page (minimum for every human language)!

•  Extensible – includes language families, subgroups, languages, dialects,
   maybe even unique/noteworthy ideolects!

•  Flexible – multiple navigation options and suited for a variety of users
   and user views: by language taxonomy, by alternate taxonomy, by other
   grouping – like linguistic area, geographic, with robust search by
   language name, alternate names, ISO 639 code!

•  Open – open content, open contribution – the world should build it!

•  Visible – the site should be easily discoverable and references to it
   ubiquitous !
Model: WikiLanguage
Model: 
The Encyclopedia of Life
Where will the Data Come
         From?
Where will the Data Come
         From?
Where will the Data Come
         From?




               Global Lives Project	

               World Premiere, February 02010	

               San Francisco, California	

               Yerba Buena Center for the Arts
Where will the Data Come
         From?




            Photo by Erik Hersman!




 You! Everyone has a language and can help document it.!
Language Commons
   What we’ve done this year
•  Established a special collection at the Internet Archive, built
   an uploader, and have accessioned several major corpora
   from working group participants!

•  Declaration of purpose, Identity!

•  Written grants, most notably to NSF for “Seeding the
   Language Commons: Software for Large Scale Transcription
   and Translation of Oral Literature”!

•  Participants have made presentations about The Language
   Commons all over the world (Long Now presented at
   Wikimania in Gdansk last summer)!
How Long Now is Helping

•  Long Now has offered to be the umbrella organization for
   The Language Commons, as a project closely related to the
   aims and goals of The Rosetta Project.!

•  We are looking towards integrating the two digital
   collections – so that Rosetta’s parallel collection can seed the
   Language Commons.!

•  The Language Commons collection would continue to serve
   as a source for future Rosetta Disks and other Long Now
   data preservation projects. !
Language Commons
          How YOU can help!
•  Please tell other people about The Language Commons –
   Tweet, Facebook, write blog posts or articles about the
   need for an open Language Commons.!

•  We need serious funding to build the Encyclopedia of
   Human Language – and we are working on this! But if
   you have any leads or suggestions please let us know.!

•  Consider a generous contribution of open language data.!
Thank you!



laura@longnow.org!

Más contenido relacionado

La actualidad más candente

Introduction to Library & Archives Canada
Introduction to Library & Archives Canada Introduction to Library & Archives Canada
Introduction to Library & Archives Canada Manisha Khetarpal
 
Bittinger & Hieber - Language revitalization: Issues with reference to Navajo
Bittinger & Hieber - Language revitalization: Issues with reference to NavajoBittinger & Hieber - Language revitalization: Issues with reference to Navajo
Bittinger & Hieber - Language revitalization: Issues with reference to NavajoDaniel Hieber
 
Martin Haase: Linguistic Hacking [24c3]
Martin Haase: Linguistic Hacking [24c3]Martin Haase: Linguistic Hacking [24c3]
Martin Haase: Linguistic Hacking [24c3]OpenSlidesArchive
 
APPLIED COMPUTER TECHNOLOGY IN CREE AND NASKAPI LANGUAGE PROGRAMS.
APPLIED COMPUTER TECHNOLOGY IN CREE AND NASKAPI LANGUAGE PROGRAMS.APPLIED COMPUTER TECHNOLOGY IN CREE AND NASKAPI LANGUAGE PROGRAMS.
APPLIED COMPUTER TECHNOLOGY IN CREE AND NASKAPI LANGUAGE PROGRAMS.internetretailer25
 
Spanish in the U.S.: Developing an open linguistic corpus
Spanish in the U.S.: Developing an open linguistic corpusSpanish in the U.S.: Developing an open linguistic corpus
Spanish in the U.S.: Developing an open linguistic corpusSpanish in Texas Project
 
A Right to Our Voice: Linguistic Human Rights and Peace Education
A Right to Our Voice: Linguistic Human Rights and Peace EducationA Right to Our Voice: Linguistic Human Rights and Peace Education
A Right to Our Voice: Linguistic Human Rights and Peace EducationCheryl Woelk
 
WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...
WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...
WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...wnradmin
 
Open Learning Resources
Open Learning ResourcesOpen Learning Resources
Open Learning Resourcesalwright1
 
ALIA QLD Mini Conference - Becoming a diversity and inclusion champion by Nik...
ALIA QLD Mini Conference - Becoming a diversity and inclusion champion by Nik...ALIA QLD Mini Conference - Becoming a diversity and inclusion champion by Nik...
ALIA QLD Mini Conference - Becoming a diversity and inclusion champion by Nik...Nikki Andersen
 
Cultural Identities in Wikipedia (Wikimania 2016)
Cultural Identities in Wikipedia (Wikimania 2016)Cultural Identities in Wikipedia (Wikimania 2016)
Cultural Identities in Wikipedia (Wikimania 2016)Marc Miquel
 
How to use modern dictionaries and encyclopedia in
How to use modern dictionaries and encyclopedia inHow to use modern dictionaries and encyclopedia in
How to use modern dictionaries and encyclopedia inEika Matari
 
21stCenturyEducation
21stCenturyEducation21stCenturyEducation
21stCenturyEducationdaniels_99
 

La actualidad más candente (15)

Introduction to Library & Archives Canada
Introduction to Library & Archives Canada Introduction to Library & Archives Canada
Introduction to Library & Archives Canada
 
Bittinger & Hieber - Language revitalization: Issues with reference to Navajo
Bittinger & Hieber - Language revitalization: Issues with reference to NavajoBittinger & Hieber - Language revitalization: Issues with reference to Navajo
Bittinger & Hieber - Language revitalization: Issues with reference to Navajo
 
Martin Haase: Linguistic Hacking [24c3]
Martin Haase: Linguistic Hacking [24c3]Martin Haase: Linguistic Hacking [24c3]
Martin Haase: Linguistic Hacking [24c3]
 
APPLIED COMPUTER TECHNOLOGY IN CREE AND NASKAPI LANGUAGE PROGRAMS.
APPLIED COMPUTER TECHNOLOGY IN CREE AND NASKAPI LANGUAGE PROGRAMS.APPLIED COMPUTER TECHNOLOGY IN CREE AND NASKAPI LANGUAGE PROGRAMS.
APPLIED COMPUTER TECHNOLOGY IN CREE AND NASKAPI LANGUAGE PROGRAMS.
 
Language and human rights
Language and human rightsLanguage and human rights
Language and human rights
 
Spanish in the U.S.: Developing an open linguistic corpus
Spanish in the U.S.: Developing an open linguistic corpusSpanish in the U.S.: Developing an open linguistic corpus
Spanish in the U.S.: Developing an open linguistic corpus
 
A Right to Our Voice: Linguistic Human Rights and Peace Education
A Right to Our Voice: Linguistic Human Rights and Peace EducationA Right to Our Voice: Linguistic Human Rights and Peace Education
A Right to Our Voice: Linguistic Human Rights and Peace Education
 
Language rights
Language rightsLanguage rights
Language rights
 
WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...
WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...
WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...
 
Open Learning Resources
Open Learning ResourcesOpen Learning Resources
Open Learning Resources
 
ALIA QLD Mini Conference - Becoming a diversity and inclusion champion by Nik...
ALIA QLD Mini Conference - Becoming a diversity and inclusion champion by Nik...ALIA QLD Mini Conference - Becoming a diversity and inclusion champion by Nik...
ALIA QLD Mini Conference - Becoming a diversity and inclusion champion by Nik...
 
Cultural Identities in Wikipedia (Wikimania 2016)
Cultural Identities in Wikipedia (Wikimania 2016)Cultural Identities in Wikipedia (Wikimania 2016)
Cultural Identities in Wikipedia (Wikimania 2016)
 
How to use modern dictionaries and encyclopedia in
How to use modern dictionaries and encyclopedia inHow to use modern dictionaries and encyclopedia in
How to use modern dictionaries and encyclopedia in
 
Peacecorps
PeacecorpsPeacecorps
Peacecorps
 
21stCenturyEducation
21stCenturyEducation21stCenturyEducation
21stCenturyEducation
 

Similar a The Rosetta Project: Building a 10,000 Year Library of All Human Languages

Wikipedia & Libraries: Ideas to enrich content through collaboration
Wikipedia & Libraries: Ideas to enrich content through collaborationWikipedia & Libraries: Ideas to enrich content through collaboration
Wikipedia & Libraries: Ideas to enrich content through collaborationwittylama
 
MOSAICA: Semantically Enhanced Multifaceted Collaborative Access to Cultural ...
MOSAICA: Semantically Enhanced Multifaceted Collaborative Access to Cultural ...MOSAICA: Semantically Enhanced Multifaceted Collaborative Access to Cultural ...
MOSAICA: Semantically Enhanced Multifaceted Collaborative Access to Cultural ...Dov Winer
 
Umd draft-2010 jun22
Umd draft-2010 jun22Umd draft-2010 jun22
Umd draft-2010 jun22Ed Bice
 
Linguistic Vitality (AILDI 2012)
Linguistic Vitality (AILDI 2012)Linguistic Vitality (AILDI 2012)
Linguistic Vitality (AILDI 2012)Rolando Coto
 
Communication skills about language
Communication skills about languageCommunication skills about language
Communication skills about languageIhsan Ullah Khan
 
Wikipedia & Museums - Qatar Presentation
Wikipedia & Museums - Qatar PresentationWikipedia & Museums - Qatar Presentation
Wikipedia & Museums - Qatar Presentationwittylama
 
Document Scanning Helps Preserve Endangered Languages
Document Scanning Helps Preserve Endangered LanguagesDocument Scanning Helps Preserve Endangered Languages
Document Scanning Helps Preserve Endangered Languagesmosmedicalreview
 
What is english; what is language (online new)
What is english; what is language (online new)What is english; what is language (online new)
What is english; what is language (online new)Simon Smith
 
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...WGBH Media Library and Archives
 
Antwerp call2014 eamer
Antwerp call2014 eamerAntwerp call2014 eamer
Antwerp call2014 eamerAllyson Eamer
 
The wisdom of Motivated Crowds and use of new media in creating services & pr...
The wisdom of Motivated Crowds and use of new media in creating services & pr...The wisdom of Motivated Crowds and use of new media in creating services & pr...
The wisdom of Motivated Crowds and use of new media in creating services & pr...Teemu Leinonen
 
Indigenous and minority languages about the use of virtual worlds
Indigenous and minority languages about the use of virtual worldsIndigenous and minority languages about the use of virtual worlds
Indigenous and minority languages about the use of virtual worldsKristi Jauregi Ondarra
 
Language, Culture, and Software
Language, Culture, and SoftwareLanguage, Culture, and Software
Language, Culture, and SoftwareESUG
 
TEI for building multilingual corpora
TEI for building multilingual corporaTEI for building multilingual corpora
TEI for building multilingual corporaMokhtar Ben Henda
 
Endangered languages
Endangered languagesEndangered languages
Endangered languagesRick McKinnon
 
Cata de lenguas extranjeras presentaciones
Cata de lenguas extranjeras presentacionesCata de lenguas extranjeras presentaciones
Cata de lenguas extranjeras presentacionesCarlos Araujo Viloria
 
201203021 comphumanities 2012 lm (nlp)
201203021 comphumanities 2012 lm (nlp)201203021 comphumanities 2012 lm (nlp)
201203021 comphumanities 2012 lm (nlp)Stefano Lariccia
 
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...cneudecker
 
Hieber - Language Endangerment: A History
Hieber - Language Endangerment: A HistoryHieber - Language Endangerment: A History
Hieber - Language Endangerment: A HistoryDaniel Hieber
 

Similar a The Rosetta Project: Building a 10,000 Year Library of All Human Languages (20)

Wikipedia & Libraries: Ideas to enrich content through collaboration
Wikipedia & Libraries: Ideas to enrich content through collaborationWikipedia & Libraries: Ideas to enrich content through collaboration
Wikipedia & Libraries: Ideas to enrich content through collaboration
 
MOSAICA: Semantically Enhanced Multifaceted Collaborative Access to Cultural ...
MOSAICA: Semantically Enhanced Multifaceted Collaborative Access to Cultural ...MOSAICA: Semantically Enhanced Multifaceted Collaborative Access to Cultural ...
MOSAICA: Semantically Enhanced Multifaceted Collaborative Access to Cultural ...
 
Umd draft-2010 jun22
Umd draft-2010 jun22Umd draft-2010 jun22
Umd draft-2010 jun22
 
Linguistic Vitality (AILDI 2012)
Linguistic Vitality (AILDI 2012)Linguistic Vitality (AILDI 2012)
Linguistic Vitality (AILDI 2012)
 
Communication skills about language
Communication skills about languageCommunication skills about language
Communication skills about language
 
Wikipedia & Museums - Qatar Presentation
Wikipedia & Museums - Qatar PresentationWikipedia & Museums - Qatar Presentation
Wikipedia & Museums - Qatar Presentation
 
Document Scanning Helps Preserve Endangered Languages
Document Scanning Helps Preserve Endangered LanguagesDocument Scanning Helps Preserve Endangered Languages
Document Scanning Helps Preserve Endangered Languages
 
Corpus Linguistics
Corpus LinguisticsCorpus Linguistics
Corpus Linguistics
 
What is english; what is language (online new)
What is english; what is language (online new)What is english; what is language (online new)
What is english; what is language (online new)
 
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
 
Antwerp call2014 eamer
Antwerp call2014 eamerAntwerp call2014 eamer
Antwerp call2014 eamer
 
The wisdom of Motivated Crowds and use of new media in creating services & pr...
The wisdom of Motivated Crowds and use of new media in creating services & pr...The wisdom of Motivated Crowds and use of new media in creating services & pr...
The wisdom of Motivated Crowds and use of new media in creating services & pr...
 
Indigenous and minority languages about the use of virtual worlds
Indigenous and minority languages about the use of virtual worldsIndigenous and minority languages about the use of virtual worlds
Indigenous and minority languages about the use of virtual worlds
 
Language, Culture, and Software
Language, Culture, and SoftwareLanguage, Culture, and Software
Language, Culture, and Software
 
TEI for building multilingual corpora
TEI for building multilingual corporaTEI for building multilingual corpora
TEI for building multilingual corpora
 
Endangered languages
Endangered languagesEndangered languages
Endangered languages
 
Cata de lenguas extranjeras presentaciones
Cata de lenguas extranjeras presentacionesCata de lenguas extranjeras presentaciones
Cata de lenguas extranjeras presentaciones
 
201203021 comphumanities 2012 lm (nlp)
201203021 comphumanities 2012 lm (nlp)201203021 comphumanities 2012 lm (nlp)
201203021 comphumanities 2012 lm (nlp)
 
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
 
Hieber - Language Endangerment: A History
Hieber - Language Endangerment: A HistoryHieber - Language Endangerment: A History
Hieber - Language Endangerment: A History
 

The Rosetta Project: Building a 10,000 Year Library of All Human Languages

  • 1. The Rosetta Project: Building a 10,000 Year Library ! of All Human Language!
  • 2. A bit of background…
  • 3. The 10,000 Year Clock “I want to build a clock that ticks once a year. The century hand advances once every one hundred years, and the cuckoo comes out on the millennium.”! Danny Hillis
  • 6.
  • 7. The 10,000 Year Library “The Clock dramatizes the scope of historic time past and to come but offers no content. The Library is all content, especially past content with future significance…The value could lie in providing civilizations with a wisdom line: slow, robust, apparently Stewart brand inefficient. ”! -from “Clock/Library” in The Clock of the Long Now!
  • 8. Library Projects: A Responsibility Record
  • 9. Library Projects: All Species
  • 10. Library Projects: Time Bits …We risk creating a Digital Dark Age – a void in the continuity of cultural record – because the formats and hardware which we entrust with our data are unlikely to outlast even the next ten years, much less our own lives. - Danny Hillis
  • 11. Strategies That Improve Data Longevity •  For starters expand your scope: aim for at least 500 years (can you do better than paper?) •  Use it or lose it – unused data dies •  Provide access – promotes use, reuse, LOCKSS •  Consider saving everything (e.g. Internet Archive) •  Move it or lose it (“Movage”) •  Consider atoms over bits (analog)
  • 12. Library Projects: Long Server
  • 13. Library Projects: The Rosetta Project •  Thousands of years ago we stored information on stone tablets – some of these are still around. •  Hundreds of years ago we stored information in books – print on acid free paper can reliably be preserved 500 years. •  Now we store information digitally, using hardware, software and encodings that are highly ephemeral.
  • 14. The Rosetta Disk (One Possible Solution)
  • 15. Microscopic Analog Data Storage
  • 18. Parallel Content in Multiple languages •  The Rosetta Stone includes a decree of the divine cult of King Ptolomy V carved in 196 BC •  Same text written in three different forms: Egyptian Hieroglyphs, Demotic (Early Egyptian Script preceding Coptic), and Ancient Greek •  Working back from the Greek and somewhat known Demotic, were able to decipher the Hieroglyphs – thereby unlocking records of an entire ancient civilization
  • 19. Rosetta Disk Goal - Parallel content for all languages Vocabulary Maps Sound Structure Writing Systems Word and Sentence Ethnographic Information Structure Parallel texts Numbering Systems Other texts Color Systems
  • 20. Building the Collection Book Scanning
  • 21. Building the Collection: Swadesh Wordlists
  • 24. “Born Digital” Materials Endangered Language Documentation Project!
  • 25. 6 First Edition Disks •  Brewster Kahle, Internet Archive •  Charles Butcher, Lazy 8 Foundation – now in the permanent special collection of the University of Colorado Boulder Library •  William Lidwell, author of Universal Principles of Design •  Oliver Wilke – Oliver Wilke Stiftung für Sprachen •  One is held by an anonymous donor, and one is in the Long Now Museum
  • 27. Rosetta Disk Museum Edition In August 02009 we presented the prototype of the Rosetta Disk Museum Edition to Secretary Wayne Clough for the Smithsonian.!
  • 28. Endangered Languages “The coming century will see either the death or doom of 90% of mankind’s languages”! - Michael Krauss!
  • 29. Top Ten languages by Native Speakers (Millions) Mandarin! Spanish! English! Bengali! Hindi! Portuguese! Russian! Japanese! German! Javanese! 0! 100! 200! 300! 400! 500! 600! 700! 800! 900! Data: The Ethnologue (02009) available at www.ethnologue.com!
  • 30. Language Distribution 1 Billion Half the world population speaks one of 10 languages (1%)! 100 Million Most everyone else speaks one of 300 languages (4%)! 5% of the world speaks one of 6,500 languages (95%) ! 10 Thousand Number of Languages!
  • 31. Why does it matter?
  • 32. Languages are... Great Works of Art!
  • 33. Languages are... Great Libraries!
  • 34. Languages are “How to” guides for Living on Planet Earth
  • 35. Languages Provide a window into our minds
  • 36. Freedom of Language - an inalienable human right Individually you have: •  The right to be recognized as a member of a language community •  The right to use your language in private and in public •  The right to use your own name •  The right to interrelate and associate with your native speech community •  The right to maintain and develop your own culture
  • 37. Freedom of Language - an inalienable human right Collectively your speech community has: •  The right for your own language and culture to be taught •  The right of access to cultural services •  The right to an equitable presence of your language and culture in the communications media •  The right to receive attention in your own language from government bodies and in socioeconomic relations From the Universal Declaration on Linguistic Rights, Barcelona, June 1996!
  • 39. Open Digital Collection on All Human Languages
  • 40. Rosetta Special Collection In the Internet Archive
  • 41. Rosetta Language Base – Linguistic Metastructure •  Freebase: over 10,000 languages and linguistic entities linked by language family relationship •  All data is linked to other kinds of data in Freebase •  We have rectified ~1500 Wikipedia pages about human languages to our data set
  • 44. The Language Commons Working Group
  • 45. Language Commons Goals: •  To scale the amount of open language data (PD/CCZero to GPL to CCNC-BY to MIT/BSD)! •  To seek the participation of holders of language data including publishers, corporations, and authors (including web authors), funders of research that generates language data, and the institutes, researchers, and projects who are themselves creating and/or curating language data.  ! •  To build open and available language data resources to further research, development, and global access to knowledge ! •  To help preserve and promote endangered languages!
  • 46. Language Commons Participants •  Translate.org, Meedan.net, Miro Project, Rosetta Project / Long Now Foundation, the Kamusi Project, Rosetta Foundation (translation service organization in Ireland), Fostering Language Resources Network (FLaReNet), European Language Resources Assocation (ELRA), The Berkman Center for Internet and Society •  Biblotheca Alexandrina, Berkman Center for Internet and Society, IBM Watson Language Group, Center for Research in Computational Linguistics, King Abdullah’s Initiative for Arabic Content, International Development Research Center (Canada) •  Saint Louis University, University of Melbourne, University of Michigan, Vassar, Universitat d’Alacant, University of Edinburgh, University of Pittsburgh, University of Pennsylvania, Eastern Michigan University, Tufts University
  • 47.
  • 48. Language Distribution 1 Billion Half the world population speaks one of 10 languages (1%)! 100 Million Most everyone else speaks one of 300 languages (4%)! 5% of the world speaks one of 6,500 languages (95%) ! 10 Thousand Number of Languages!
  • 49. Want to use your language in the digital domain? 1.  Is there a writing system for your language?! a.  Yes! Continue to (2)! b.  No! But you can still talk on your mobile phone, and post YouTube videos of yourself and your friends. Note you will need to type alphanumeric text (or use voice commands) in another more widely used language.!
  • 50. Want to use your language in the digital domain? 2.  Is there a unique identifier (ISO 639 code) for your language?! a.  Yes! Continue to (3)! b.  No! Bummer. Go back to (1).!
  • 51. Want to use your language in the digital domain? 3.  Is your writing system in Unicode?! a.  Yes! Congratulations! Your script is now supported in the essential architecture of the digital domain.! b.  No! Bummer. Either create one by adapting a supported script, build a proposal to get your script/unique characters supported in Unicode (contact the Script Encoding Initiative for help on this), go back to (1).!
  • 52. Want to use your language in the digital domain? 4.  Do you have a large corpus of natural texts – written and spoken?! a.  Yes! Congratulations! You must be a speaker of a very economically powerful language. You continue to grow these corpora as you interact online every day (email, internet searches, SMS texts, depending somewhat on which ones you use) – and the services based on them keep getting better for you – natural language search, machine translation, speech recognition, etc.! b.  No! Bummer. Go back to (3). You and billions of others are in the same circumstance. Many give up and simply use a mainstream language in the digital domain.!
  • 53. The Growing Linguistic Digital Divide “There are hundreds of seriously under-documented languages that remain very much alive with hundreds of thousands to tens of millions of speakers each. The speakers of these languages number collectively in the billions, and as linguistic technology grows in importance, they find themselves of the far side of an increasingly large digital divide.” - NSF Proposal “Seeding The Language Commons”
  • 54. Enabling Top 300 Languages as well as The Long Tail •  We have substantial machine readable corpora for only about 20-30 of the world’s 6,900 languages. [Bird and Abney, 2010]! •  There is a commercial motivation in enabling the 300 most widely spoken languages – if digital services and devices work for this group, that is 95% of humanity.! •  The other 6,500 or so – the long tail – has no commercial motivation, but these languages can be documented and enabled by non-profit/academic/philanthropic efforts.! •  The Long Tail can benefit from development of the 300 (and vice versa – if we are building better algorithms that can work with less data. !
  • 55. What we want to build…
  • 56. The Language Commons Proposal: Build an Encyclopedia of Human Language An aggregation and discovery portal for information and resources on all 6,900 human languages. For use by: • language speakers • educators • researchers • general public
  • 57. Why an Encyclopedia of human language? •  To create the go-to place for information and resources on any and all human languages – for education, for research, for preservation! •  To provide resources on lower density languages in case of crisis or emergency! •  To take action in the face of impending language loss! •  To act as testament for the genius of human cultural and linguistic diversity, and stand for freedom of language as a basic human right! •  To provide a forward path for the use of the world’s languages in the digital domain (by building a massive repository of open linguistic corpora)!
  • 58. Basic Design Principles: •  Comprehensive – One page (minimum for every human language)! •  Extensible – includes language families, subgroups, languages, dialects, maybe even unique/noteworthy ideolects! •  Flexible – multiple navigation options and suited for a variety of users and user views: by language taxonomy, by alternate taxonomy, by other grouping – like linguistic area, geographic, with robust search by language name, alternate names, ISO 639 code! •  Open – open content, open contribution – the world should build it! •  Visible – the site should be easily discoverable and references to it ubiquitous !
  • 61. Where will the Data Come From?
  • 62. Where will the Data Come From?
  • 63. Where will the Data Come From? Global Lives Project World Premiere, February 02010 San Francisco, California Yerba Buena Center for the Arts
  • 64. Where will the Data Come From? Photo by Erik Hersman! You! Everyone has a language and can help document it.!
  • 65. Language Commons What we’ve done this year •  Established a special collection at the Internet Archive, built an uploader, and have accessioned several major corpora from working group participants! •  Declaration of purpose, Identity! •  Written grants, most notably to NSF for “Seeding the Language Commons: Software for Large Scale Transcription and Translation of Oral Literature”! •  Participants have made presentations about The Language Commons all over the world (Long Now presented at Wikimania in Gdansk last summer)!
  • 66. How Long Now is Helping •  Long Now has offered to be the umbrella organization for The Language Commons, as a project closely related to the aims and goals of The Rosetta Project.! •  We are looking towards integrating the two digital collections – so that Rosetta’s parallel collection can seed the Language Commons.! •  The Language Commons collection would continue to serve as a source for future Rosetta Disks and other Long Now data preservation projects. !
  • 67. Language Commons How YOU can help! •  Please tell other people about The Language Commons – Tweet, Facebook, write blog posts or articles about the need for an open Language Commons.! •  We need serious funding to build the Encyclopedia of Human Language – and we are working on this! But if you have any leads or suggestions please let us know.! •  Consider a generous contribution of open language data.!