2. Kaur Alasoo
• Computer Science, University of Tartu
(2007-2010)
• Intern at European Molecular Biology
Laboratory (April - August 2010)
• Systems Biology at Aalto University (2010 -
2012)
7. To keep up with the lexicon, dictionaries are to supplant them (Fig. 2E and fig. S5). High- significant driver of re
updated regularly (13). We examined how well frequency irregulars, which are more readily 200 years. The regulariz
Culturomics: Using Google
these changes corresponded with changes in ac-
tual usage by studying the 2077 1-gram headwords
remembered, hold their ground better. For in-
stance, we found “found” (frequency: 5 × 10−4)
and spilt originated in
forms still cling to life i
Books to analyze culture
added to AHD4 in 2000. The overall frequency of
these words, such as “buckyball” and “netiquette”,
has soared since 1950: Two-thirds exhibited recent
200,000 times more often than we finded “finded.”
In contrast, “dwelt” (frequency: 1 × 10−5) dwelt in
our data only 60 times as often as “dwelled”
E and F). But the -t irre
England too. Each year
Cambridge adopts “bur
Fig. 1. Culturomic analy- A B
ses study millions of books
at once. (A) Top row: Au-
thors have been writing
for millennia; ~129 mil-
lion book editions have
been published since the
129 million books
advent of the printing press published
(upper left). Second row:
Libraries and publishing
houses provide books to
Google for scanning (mid-
dle left). Over 15 million 15 million books C
books have been digitized. scanned
Third row: Each book is
associated with metadata.
Five million books are cho-
sen for computational anal-
ysis (bottom left). Bottom 5 million books
row: A culturomic time line analyzed
Frequency of the
shows the frequency of word "apple"
“apple” in English books
over time (1800–2000).
Year
(B) Usage frequency of
8. Fame depends on
profession
F
Median frequency
3” 1871 (gray lines; median, thick dark gray line). Five examples are highlighted.
9. birth date and (Fig. 3E). The age of peak celebrity has been con- similar (7) (fig. S
1800 to 1950, sistent over time: about 75 years after birth. But famous than eve
of the 50 most the other parameters have been changing (fig. S8). more rapidly than
Tracking censorship
A B
Frequency
Frequency
wikipedia.org
10. D
B
Frequency
(fig. S8).
more rapidly than ever.
www.sciencemag.org on April 21, 2011
11. en
su
History of science ar
M
F w
re
“R
ex
m
en
fi
id
la
50. Conclusion
• There are already many successful
examples of data-rich applications.
• More and more data will become
available in many different fields.
• Collecting data is easy. Difficulties lie in
analyzing it and understanding what it
means.
51. References
• Quantitative Analysis of Culture Using Millions of Digitized Books,
Jean-Baptiste Michel, et al. Science 331, 176 (2011)
• OkTrends: Dating research from OkCupid
http://blog.okcupid.com
• The Data-Driven Life, Gary Wolf, The New York Times, http://
www.nytimes.com/2010/05/02/magazine/02self-measurement-
t.html
• The Quantified Self | self knowledge through numbers
http://quantifiedself.com