This document provides information about the Rosetta Project and its goal of creating a 10,000 year library of all human languages. It discusses the motivation to preserve languages and cultural knowledge for future generations. Specific initiatives described include creating Rosetta Disks with parallel text in multiple languages, building an open digital collection of language resources, and developing the proposed Language Commons Encyclopedia of Human Language to aggregate information on all 6,900 human languages. The role of the Long Now Foundation in supporting these initiatives is also outlined.
3. The 10,000 Year Clock
“I want to build a clock that ticks
once a year. The century hand
advances once every one hundred
years, and the cuckoo comes out on
the millennium.”!
Danny Hillis
7. The 10,000 Year Library
“The Clock dramatizes the scope of
historic time past and to come but
offers no content. The Library is all
content, especially past content with
future significance…The value could
lie in providing civilizations with a
wisdom line: slow, robust, apparently
Stewart brand
inefficient. ”!
-from “Clock/Library” in The Clock of the Long Now!
10. Library Projects:
Time Bits
…We risk creating a Digital Dark Age – a void in the continuity
of cultural record – because the formats and hardware which we
entrust with our data are unlikely to outlast even the next ten
years, much less our own lives.
- Danny Hillis
11. Strategies That Improve
Data Longevity
• For starters expand your scope: aim for at least
500 years (can you do better than paper?)
• Use it or lose it – unused data dies
• Provide access – promotes use, reuse, LOCKSS
• Consider saving everything (e.g. Internet Archive)
• Move it or lose it (“Movage”)
• Consider atoms over bits (analog)
13. Library Projects:
The Rosetta Project
• Thousands of years ago we stored
information on stone tablets –
some of these are still around.
• Hundreds of years ago we stored
information in books – print on
acid free paper can reliably be
preserved 500 years.
• Now we store information
digitally, using hardware, software
and encodings that are highly
ephemeral.
18. Parallel Content in
Multiple languages
• The Rosetta Stone includes a
decree of the divine cult of King
Ptolomy V carved in 196 BC
• Same text written in three different
forms: Egyptian Hieroglyphs,
Demotic (Early Egyptian Script
preceding Coptic), and Ancient
Greek
• Working back from the Greek and
somewhat known Demotic, were
able to decipher the Hieroglyphs –
thereby unlocking records of an
entire ancient civilization
19. Rosetta Disk Goal - Parallel
content for all languages
Vocabulary
Maps
Sound Structure
Writing Systems
Word and Sentence Ethnographic Information
Structure
Parallel texts
Numbering Systems
Other texts
Color Systems
25. 6 First Edition Disks
• Brewster Kahle, Internet Archive
• Charles Butcher, Lazy 8 Foundation –
now in the permanent special collection
of the University of Colorado Boulder
Library
• William Lidwell, author of Universal
Principles of Design
• Oliver Wilke – Oliver Wilke Stiftung für
Sprachen
• One is held by an anonymous donor, and
one is in the Long Now Museum
27. Rosetta Disk
Museum Edition
In August 02009 we presented
the prototype of the Rosetta
Disk Museum Edition to
Secretary Wayne Clough for
the Smithsonian.!
29. Top Ten languages by
Native Speakers (Millions)
Mandarin!
Spanish!
English!
Bengali!
Hindi!
Portuguese!
Russian!
Japanese!
German!
Javanese!
0! 100! 200! 300! 400! 500! 600! 700! 800! 900!
Data: The Ethnologue (02009) available at www.ethnologue.com!
30. Language Distribution
1 Billion
Half the world population speaks one of 10 languages (1%)!
100 Million
Most everyone else speaks one of 300 languages (4%)!
5% of the world speaks one of 6,500 languages (95%) !
10 Thousand
Number of Languages!
36. Freedom of Language -
an inalienable human right
Individually you have:
• The right to be recognized as a member of a language
community
• The right to use your language in private and in public
• The right to use your own name
• The right to interrelate and associate with your native
speech community
• The right to maintain and develop your own culture
37. Freedom of Language -
an inalienable human right
Collectively your speech community has:
• The right for your own language and culture to be taught
• The right of access to cultural services
• The right to an equitable presence of your language and
culture in the communications media
• The right to receive attention in your own language from
government bodies and in socioeconomic relations
From the Universal Declaration on Linguistic Rights, Barcelona, June 1996!
41. Rosetta Language Base –
Linguistic Metastructure
• Freebase: over 10,000
languages and linguistic
entities linked by language
family relationship
• All data is linked to other
kinds of data in Freebase
• We have rectified ~1500
Wikipedia pages about human
languages to our data set
45. Language Commons
Goals:
• To scale the amount of open language data (PD/CCZero to
GPL to CCNC-BY to MIT/BSD)!
• To seek the participation of holders of language data
including publishers, corporations, and authors (including
web authors), funders of research that generates language
data, and the institutes, researchers, and projects who are
themselves creating and/or curating language data. !
• To build open and available language data resources to
further research, development, and global access to knowledge !
• To help preserve and promote endangered languages!
46. Language Commons
Participants
• Translate.org, Meedan.net, Miro Project, Rosetta Project / Long Now
Foundation, the Kamusi Project, Rosetta Foundation (translation
service organization in Ireland), Fostering Language Resources
Network (FLaReNet), European Language Resources Assocation
(ELRA), The Berkman Center for Internet and Society
• Biblotheca Alexandrina, Berkman Center for Internet and Society,
IBM Watson Language Group, Center for Research in Computational
Linguistics, King Abdullah’s Initiative for Arabic Content,
International Development Research Center (Canada)
• Saint Louis University, University of Melbourne, University of
Michigan, Vassar, Universitat d’Alacant, University of Edinburgh,
University of Pittsburgh, University of Pennsylvania, Eastern Michigan
University, Tufts University
47.
48. Language Distribution
1 Billion
Half the world population speaks one of 10 languages (1%)!
100 Million
Most everyone else speaks one of 300 languages (4%)!
5% of the world speaks one of 6,500 languages (95%) !
10 Thousand
Number of Languages!
49. Want to use your language
in the digital domain?
1. Is there a writing system for your language?!
a. Yes! Continue to (2)!
b. No! But you can still talk on your mobile phone, and post
YouTube videos of yourself and your friends. Note you will
need to type alphanumeric text (or use voice commands) in
another more widely used language.!
50. Want to use your language
in the digital domain?
2. Is there a unique identifier (ISO 639 code) for your language?!
a. Yes! Continue to (3)!
b. No! Bummer. Go back to (1).!
51. Want to use your language
in the digital domain?
3. Is your writing system in Unicode?!
a. Yes! Congratulations! Your script is now supported in the
essential architecture of the digital domain.!
b. No! Bummer. Either create one by adapting a supported
script, build a proposal to get your script/unique characters
supported in Unicode (contact the Script Encoding Initiative
for help on this), go back to (1).!
52. Want to use your language
in the digital domain?
4. Do you have a large corpus of natural texts – written and spoken?!
a. Yes! Congratulations! You must be a speaker of a very
economically powerful language. You continue to grow these
corpora as you interact online every day (email, internet searches,
SMS texts, depending somewhat on which ones you use) – and the
services based on them keep getting better for you – natural
language search, machine translation, speech recognition, etc.!
b. No! Bummer. Go back to (3). You and billions of others are in the
same circumstance. Many give up and simply use a mainstream
language in the digital domain.!
53. The Growing
Linguistic Digital Divide
“There are hundreds of seriously under-documented
languages that remain very much alive with hundreds
of thousands to tens of millions of speakers each.
The speakers of these languages number collectively
in the billions, and as linguistic technology grows in
importance, they find themselves of the far side of
an increasingly large digital divide.”
- NSF Proposal “Seeding The Language Commons”
54. Enabling Top 300 Languages
as well as The Long Tail
• We have substantial machine readable corpora for only about
20-30 of the world’s 6,900 languages. [Bird and Abney, 2010]!
• There is a commercial motivation in enabling the 300 most
widely spoken languages – if digital services and devices work
for this group, that is 95% of humanity.!
• The other 6,500 or so – the long tail – has no commercial
motivation, but these languages can be documented and
enabled by non-profit/academic/philanthropic efforts.!
• The Long Tail can benefit from development of the 300 (and
vice versa – if we are building better algorithms that can work
with less data. !
56. The Language Commons
Proposal: Build an Encyclopedia of Human
Language
An aggregation and discovery portal for
information and resources on all 6,900 human
languages.
For use by:
• language speakers
• educators
• researchers
• general public
57. Why an Encyclopedia of
human language?
• To create the go-to place for information and resources on any and all
human languages – for education, for research, for preservation!
• To provide resources on lower density languages in case of crisis or
emergency!
• To take action in the face of impending language loss!
• To act as testament for the genius of human cultural and linguistic
diversity, and stand for freedom of language as a basic human right!
• To provide a forward path for the use of the world’s languages in the
digital domain (by building a massive repository of open linguistic
corpora)!
58. Basic Design Principles:
• Comprehensive – One page (minimum for every human language)!
• Extensible – includes language families, subgroups, languages, dialects,
maybe even unique/noteworthy ideolects!
• Flexible – multiple navigation options and suited for a variety of users
and user views: by language taxonomy, by alternate taxonomy, by other
grouping – like linguistic area, geographic, with robust search by
language name, alternate names, ISO 639 code!
• Open – open content, open contribution – the world should build it!
• Visible – the site should be easily discoverable and references to it
ubiquitous !
63. Where will the Data Come
From?
Global Lives Project
World Premiere, February 02010
San Francisco, California
Yerba Buena Center for the Arts
64. Where will the Data Come
From?
Photo by Erik Hersman!
You! Everyone has a language and can help document it.!
65. Language Commons
What we’ve done this year
• Established a special collection at the Internet Archive, built
an uploader, and have accessioned several major corpora
from working group participants!
• Declaration of purpose, Identity!
• Written grants, most notably to NSF for “Seeding the
Language Commons: Software for Large Scale Transcription
and Translation of Oral Literature”!
• Participants have made presentations about The Language
Commons all over the world (Long Now presented at
Wikimania in Gdansk last summer)!
66. How Long Now is Helping
• Long Now has offered to be the umbrella organization for
The Language Commons, as a project closely related to the
aims and goals of The Rosetta Project.!
• We are looking towards integrating the two digital
collections – so that Rosetta’s parallel collection can seed the
Language Commons.!
• The Language Commons collection would continue to serve
as a source for future Rosetta Disks and other Long Now
data preservation projects. !
67. Language Commons
How YOU can help!
• Please tell other people about The Language Commons –
Tweet, Facebook, write blog posts or articles about the
need for an open Language Commons.!
• We need serious funding to build the Encyclopedia of
Human Language – and we are working on this! But if
you have any leads or suggestions please let us know.!
• Consider a generous contribution of open language data.!