The document is a summary of a presentation given by George Oates on April 11, 2011 about the Internet Archive and its Open Library project. Some key points:
1) The Internet Archive is a non-profit that has been building a digital library since 1996 with the mission of "Universal Access to all Knowledge."
2) The Open Library project aims to create a "Wikipedia for Books" and currently has over 1 million free books available online that can be borrowed or read in the browser.
3) The Internet Archive scans and stores books, websites, videos, software, and other materials, with over 2.5 petabytes of storage currently housing over 100,000 hours of TV, 250,000
1. Hello.
MITH API Workshop
George Oates
Maryland, February 2011
Monday, April 11, 2011
2. Some rights reserved by mattdork
Monday, April 11, 2011
I work at the Internet Archive, leading The Open Library project. We recently moved in to this
church in The Richmond in San Francisco. We’re turning it into a library.
3. Monday, April 11, 2011
We’re based in San Francisco, California, where I happen to have been living for about 5
years.
4. Universal Access to
All Knowledge
Monday, April 11, 2011
Since 1996, the non-profit Internet Archive has been building a digital library of Internet sites
and other things in digital form. archive.org has a ton of texts, video, software, live music...
all sorts of things.
Our mission is Universal Access to all Knowledge. Not a bad reason to get out of bed each
day...
5. Some rights reserved by heather
Monday, April 11, 2011
It’s not your traditional non-profit... Lots of the staff are technologists and developers.
6. archive.org
Monday, April 11, 2011
We have many computers. They store over
- 100,000 hours of TV from channels all over the world
- 250,000 moving images or video
- 500,000 audio recordings
- 2.5 million scanned texts
- 150,000,000,000 web pages
7. By rkumar
Monday, April 11, 2011
Just the other day we had 2.88 petabytes of hard drives delivered. That’s enough storage for
about 2 billion books.
8. Monday, April 11, 2011
Another major part of what we do is scanning books. This is a picture of one of the scanning
centers in San Francisco. We currently employ about 200 staff scanning books
9. Monday, April 11, 2011
And today, we have over million free texts available online ‐ that includes over 1 million books
150 million pages scanned
1,000 books scanned EVERY day
24 scanning centers in 5 countries, and we hope for more.
10. Monday, April 11, 2011
We’re also scanning microfilm, which is much faster than individual books. Here’s an example of the record of the populaJon census from
1790 to 1930. Scanned from microfilm from the collecJons of the Allen County Public Library and originally from the United States
NaJonal Archives Record AdministraJon.
11. Monday, April 11, 2011
Examples of Cross Writing from Boston Public Library
12. Monday, April 11, 2011
Over 1 million free books that you can read on archive.org today, and access through the
Open Library site, by checking the little “Only eBooks” box as you search.
13. Monday, April 11, 2011
As well as being able to download these books in a variety of different formats, from PDF to
TXT and more, we also have a web-based book reader, which you can use to read our
scanned texts within your web browser, without the need for any additional software. At the
end of 2010, we released a new version of our open source, browser-based BookReader.
I’ve actually come to Wellington direct from a meeting in San Francisco called Books in
Browser, held at the Internet Archive last week. It was there that we announced an upcoming
new release of our bookreader, which will hopefully go live in the next few weeks... Here are
some screenshots...
14. Monday, April 11, 2011
The main reason we wanted to improve on the current design was to try to build an “app-
level quality” book reading experience right in the browser. This included several
improvement for touch interfaces in browsers on devices like the iPad.
From a straightforward design perspective, there were also improvements to be made on
usability and simple stuff like making the book bigger in the browser window.
15. Monday, April 11, 2011
This is a screenshot with the toolbar open, where you can see new features like a navigation
bar at the bottom that allows you to scroll through the book, a “read to me” feature which
plays the book in a computer-y voice, and highlights what’s being read. Also, if we know a
table of contents for the book, each chapter is mapped along the navigation bar.
We’ve also rewritten the full text search engine, and I’ll talk more about that a bit later.
16. By rkumar
Monday, April 11, 2011
Apologies for the slightly blurry picture, but this is my boss, Brewster Kahle, who founded the
Internet Archive back in 1996. He’s playing with a touchscreen which is displaying the new
bookreader. The screen’s been installed in one of the reading desks that used to sit in the
reading room of the Christian Science church before it became our new home. A big part of
the bookreader redesign was to evolve an app-level quality book reading experience within a
web browser. If you have an iPad, I’d encourage you to try it!
17. Monday, April 11, 2011
The Open Library project was launched back in 2007. In May 2010, we launched a total site
redesign. Just last week, we released a revised home page, building on our new Lending
program, and generally trying to do a better job of communicating that you can come to
Open Library to find something to read for free, or a book to borrow. We also added activity
graphs to try to show that there’s stuff happening, all day, every day.
18. A “Wikipedia for Books”
Monday, April 11, 2011
There are a few different ways to describe what Open Library is, but I think the explanation
that makes the most sense is “a Wikipedia for Books”.
20. Monday, April 11, 2011
We have a lending library of some 10,000 20th Century books. You can also access another
80,000 books if you’re (literally) sitting in one of the 150 or so libraries participating in our
“In-Library Lending” program. Each participating library contributes eBooks into the in-library
pool, and you can borrow anything in the pool, once you’re sitting in one of the libraries.
21. Monday, April 11, 2011
Yay! Graphs going up! (That peak you can see across the graphs is our lending launch. For
more info, read “Get Thee to a Library!” http://blog.openlibrary.org/2011/02/22/get-thee-
to-a-library/)
22. Monday, April 11, 2011
Snapshot of the various combinations of links we can provide to get you to books... For books
we can’t lend through our own lending program, we’ve connected to Overdrive... We’re
hoping to make the vendors you can buy from more dynamic, and open up the sources for
online free texts. Right now, it’s just the Internet Archive texts that we link to in full.
23. lending ebooks
• map / openstreen
Monday, April 11, 2011
You can browse a map of (mainly North American) libraries participating in the In-Library
lending program. If you’re interested to join in, please contact us!
24. borrow page
• screen
Monday, April 11, 2011
Here’s what a page looks like to borrow a book. You can see 3 options: In Browser, PDF, and
ePub.
In-browser is available immediately. You need to download/install Adobe Digital Editions to
read PDF or ePub versions.
36. API
http://openlibrary.org/developers/api
Monday, April 11, 2011
Open Library has a RESTful API, best used to link into Open Library data in JSON,
YAML and RDF/XML.
37. API
http://openlibrary.org/developers/api
Books
Covers
Search inside
Subjects
Recent Changes
Lists
Monday, April 11, 2011
Open Library has a RESTful API, best used to link into Open Library data in JSON,
YAML and RDF/XML.
38. Request:
Request:
http://openlibrary.org/dev/docs/api/lists
Monday, April 11, 2011
41. Monday, April 11, 2011
We built lists for a couple of reasons: 1, to help people collect things together, and 2, to
make it easy to get at smaller sets of records.
42. Covers
http://openlibrary.org/developers/api
Monday, April 11, 2011
43. Monday, April 11, 2011
Where:
• key can be any one of ISBN, OLCC, LCCN, OLID and ID (case-insensitive)
• value is the value of the chosen key
• size can be one of S, M and L for small, medium and large respectively.
44. (we use this)
Monday, April 11, 2011
Where:
• key can be any one of ISBN, OLCC, LCCN, OLID and ID (case-insensitive)
• value is the value of the chosen key
• size can be one of S, M and L for small, medium and large respectively.
50. Monday, April 11, 2011
One of quite a few examples of Open Library in the wild includes the National Library of
Australia’s new search engine, Trove.
51. Monday, April 11, 2011
You can see there that there are links to Open Library books wherever one can be sourced.
There are a growing number of sites making use of Open Library data... and that’s what we’re
all about - data in, data out. The more interconnections we can make with other systems, the
easier it will be for people to land where they want to go inside Open Library.
52. Monday, April 11, 2011
This is ImportBot. He gets new catalog records from the Library of Congress and puts them
into Open Library every Tuesday. We also import records from Amazon, and from the Internet
Archive. ImportBot looks for recently scanned books, and creates new records (or merges
them with existing ones) just a few minutes after the record is created on the Internet
Archive.
53. Monday, April 11, 2011
You can see ImportBot working away, just like you can see the Wiki’s edit history for every
person who edits something.
54. Monday, April 11, 2011
Another quick note on data in before I move on...
We’ve been experimenting with a couple of other “surgical” bots, that look across the catalog
and connect edition records directly to other services by stamping identifiers from other
systems into Open Library. This is a bot written by a developer called Ben Gimpert, that takes
a file mapping ISBN to Goodreads IDs, and looks for ISBN matches in OL, then adding the
Goodreads ID to those records. This allows us to construct links to Goodreads, and to make
the Goodreads ID available through the API.
55. Monday, April 11, 2011
You can see we’ve added a little widget on the page that connects to Goodreads, if you have
an account, you can add our records to your lists on Goodreads. There’s also a LibraryThing
ID too, added by a similar batch bot update.
Writing bots to do things like this is the sort of development we’d like to open up to external
developers too...
56. BookReader
http://openlibrary.org/dev/docs/ia
Monday, April 11, 2011
57. Monday, April 11, 2011
This is a screenshot with the toolbar open, where you can see new features like a navigation
bar at the bottom that allows you to scroll through the book, a “read to me” feature which
plays the book in a computer-y voice, and highlights what’s being read. Also, if we know a
table of contents for the book, each chapter is mapped along the navigation bar.
We’ve also rewritten the full text search engine, and I’ll talk more about that a bit later.
58. Monday, April 11, 2011
The Library of Congress is using our Bookreader on read.gov. There are quite a few other
examples of the IA Bookreader out there on the web. Hopefully the redesign (with touch
interactions etc) will attract new people too...
62. Raw Full Text
> 4 million documents
with metadata
Monday, April 11, 2011
63. Stanford NLP thing
http://nlp.stanford.edu/
Monday, April 11, 2011
We’ve just begun experimenting with some of the software made by the the Stanford Natural
Language Processing Group - that includes members of both the Linguistics Department and
the Computer Science Department, One idea is to fold this software into the scanning
process, so we can do a first pass on entity extraction on full text of a book, to extract things
like names, places and common subjects...
64. Monday, April 11, 2011
But then of course, you can do cool stuff like this :)
66. http://flic.kr/p/6zyU3U Tension?
Monday, April 11, 2011
The Taxonomy vs Folksonomy debate may be represented thusly.
67. 1) Books are for use.
2) Every reader his [or her] book.
3) Every book its reader.
4) Save the time of the User.
5) The library is a growing organism.
Monday, April 11, 2011
So, on the basis of the idea of our current catalog being a substrate, as Ranganathan
suggests in his five laws of library science...
68. 1) Books are for use.
2) Every reader his [or her] book.
3) Every book its reader.
4) Save the time of the User.
5) The library is a growing organism.
Monday, April 11, 2011
So, on the basis of the idea of our current catalog being a substrate, as Ranganathan
suggests in his five laws of library science...
69. Monday, April 11, 2011
So... Open Library is a virtual space. Its organization isn’t constrained like a physical catalog.
In fact, the more connections you can make into one of our “virtual index cards” the more
ways people have to discover and navigate its contents.
http://www.flickr.com/photos/brixton/1394845916/
70. http://flic.kr/p/6pmtQL
Monday, April 11, 2011
But, librarians are (very clever) humans too. And everyone who’s responsible for putting
books into a traditional catalogue must work within patterns. Patterns that have grown
semantically remarkable and deeply complex.
71. Unknown author 403
Unknown Author 358
Author unknown 254
No Author 145
Author Unknown 59
No Author. 54
Author 20
No author. 16
No author 12
unknown author 8
Unknown Author Unknown 7
no author 7
No Author Stated 7
(No Author) 6
No author noted 5
http://openlibrary.org/search No author noted. 4
no author listed 4
?author=author (no author) 4
Author Not Stated 4
Author. 4
No author specified 3
Miscellaneous Author 3
no Author 3
Author One 3
Multi-Author 3
No Author Listed 3
No Stated Author 3
Author Anonymous 2
(no author given) 2
Author 2
Author Wright 2
Unkown Author 2
No author stated 2
Mms suspense author 2
Author Test 2
TEST AUTHOR 2
Monday, April 11, 2011
Duplicate authors (and editions) are an issue... This is an example search for author records
with “author” in their names... you can see the variety of ways that catalogers have noted
unknown authors...
73. Substrate:
any surface on which a plant or animal lives or
on which a material sticks
Some rights reserved by Brynja Eldon
Monday, April 11, 2011
We have a repository that mostly contains records created by professionals. I find it useful to
consider these records as a substrate, something that can be reacted upon.
74. What if we consider the source
Open Library records like that?
Some rights reserved by Brynja Eldon
Monday, April 11, 2011
Now that we’ve begun to reveal this substrate, how will people react to it? What reactions has
it caused so far?
75. Monday, April 11, 2011
Handwritten scribbles and scrawls; annotations; corrections
76. Some rights reserved by jared
Monday, April 11, 2011
What if a catalog looks like this? Is crystalline? What if it is unconstrained by the need to sort,
say, alphabetically?
From the artist of this image, Jared Tarbell: “Lines like crystals form at perpendicular angles
to existing lines. A complex form emerges.
1000 classic computational substrate, color palette stolen from Jackson Pollock: A simple
perpendicular growth rule creates intricate city-like structures. The simple rule, the complex
results, the enormous potential for modification; this has got to be one of my all time favorite
self-discovered algorithms. Lines likes crystals grow on a computational substrate.”
77. Monday, April 11, 2011
What happens when you introduce turbulence into the catalog? Here are a few examples of
the sorts of edits we’re seeing... at a rate of about 100,000 edits per month.
http://www.flickr.com/photos/rreis/4859722551/sizes/l/
78. 000s of edits per month
Monday, April 11, 2011
What happens when you introduce turbulence into the catalog? Here are a few examples of
the sorts of edits we’re seeing... at a rate of about 100,000 edits per month.
if you don’t stimulate an organism, it atrophies
http://www.flickr.com/photos/rreis/4859722551/sizes/l/
79. Activity/History
Monday, April 11, 2011
One of the key components to any happy social system is the visibility of other people, and a
sense of activity. This is one of the key elements we’re focussed on in the redesign. This
particular list shows all edits by humans on Open Library, and actually, turns out to be a
handy way to spot check what’s happening. You’ll notice too, there’s a special tab for the
variety of edits that we run across the system using bots. Often pretty mechanical and
repetitive, we found that the bots obscure the humans if you just mush everything up in a big
list, so we separated them.
80. Activity/History
Live Data
Monday, April 11, 2011
One of the key components to any happy social system is the visibility of other people, and a
sense of activity. This is one of the key elements we’re focussed on in the redesign. This
particular list shows all edits by humans on Open Library, and actually, turns out to be a
handy way to spot check what’s happening. You’ll notice too, there’s a special tab for the
variety of edits that we run across the system using bots. Often pretty mechanical and
repetitive, we found that the bots obscure the humans if you just mush everything up in a big
list, so we separated them.
82. Shelf
http://www.flickr.com/photos/emdot/400280705/
Monday, April 11, 2011
I really like how Raymond described his book yesterday, that as soon as he’d written it, it
began to decay... Concrete, decay
84. Minimum
Viable Record
Monday, April 11, 2011
Now, I want to try a little exercise. I’m going to hand out an index card to all of you, and ask
you to nominate 5 fields that you think is enough to describe a book. I’ll collate the results
and report back later.
85. Monday, April 11, 2011
Stamen Design in SF. Got funding from Knight Foundation to build Citytracking. Challenge is a “hodgepodge of
bits—including APIs [2] and official sources, scraped websites, sometimes-reusable data formats and datasets,
visualizations, embeddable widgets etc.—is fractured, overly technical and obscure, held in the knowledge base of
a relatively small number of people, and requires considerable expertise to harness.”
86. Monday, April 11, 2011
Stamen Design in SF. Got funding from Knight Foundation to build Citytracking. Challenge is a “hodgepodge of
bits—including APIs [2] and official sources, scraped websites, sometimes-reusable data formats and datasets,
visualizations, embeddable widgets etc.—is fractured, overly technical and obscure, held in the knowledge base of
a relatively small number of people, and requires considerable expertise to harness.”
88. Online Publishing Distribution System (OPDS)
http://bookserver.archive.org/catalog/new
Monday, April 11, 2011
This is an example of trying something very bare bones, to try to help systems
intercommunicate more easily. (Open Library plans to publish OPDS feeds soon.)
Online Publishing Distribution System (OPDS): The Open Publication Distribution
System (OPDS) Catalog specification is a syndication format for electronic publications
based on Atom RFC4287 and HTTP RFC2616.
89. American notes for general circulation [microform]
February 25, 2011 10:22 AM
Author: Dickens, Charles, 1812-1870
Publisher: New York : Harper
Year published: 1842
Book contributor: Canadiana.org
Language: en
Download Ebook: (PDF) (EPUB)
Monday, April 11, 2011
90. Monday, April 11, 2011
Individuals can also add new books with a few details like Title, Author, Publisher and Publish
Date. That’s enough for a stub, and then people are invited to add more details.
93. Monday, April 11, 2011
Another experiment we’re looking forward to trying is about identifiers. We’re not particularly
concerned about canonical identifiers. Perhaps it’s a waste of time to wait for one, so instead,
we’re going to try and attach as many ID types to our records as we can. (This list is just a
braindump - not active yet.) The idea is that people could add a URL or actual identifier and
Open Library would just do the right thing. A suggestion (after this presentation was
delivered) was that people could ping Open Library with an identifier, not even knowing what
TYPE of ID it is. Perhaps Open Library could help “triangulate” this query towards a book
record. “Record laundering.”
95. http://openlibrary.org/books/olid/OL7440033M
http://openlibrary.org/books/isbn/0385472579
http://openlibrary.org/books/isbn/9780385472579
http://openlibrary.org/books/lccn/93005405
http://openlibrary.org/books/oclc/28419896
http://openlibrary.org/books/id/240727
http://openlibrary.org/books/amazon/...
http://openlibrary.org/books/bookmooch/...
http://openlibrary.org/books/goodreads/...
http://openlibrary.org/books/ocaid/...
http://openlibrary.org/books/librarything/...
http://openlibrary.org/books/paperback_swap/...
http://openlibrary.org/books/Your ID Here/...
Monday, April 11, 2011
You can already ping Open Library with an ID other than the Open Library identifier to see if
we have any matches.
96. http://openlibrary.org/books/olid/OL7440033M
http://openlibrary.org/books/isbn/0385472579
http://openlibrary.org/books/isbn/9780385472579
http://openlibrary.org/books/lccn/93005405
http://openlibrary.org/books/oclc/28419896
http://openlibrary.org/books/id/240727
http://openlibrary.org/books/amazon/...
http://openlibrary.org/books/bookmooch/...
http://openlibrary.org/books/goodreads/...
http://openlibrary.org/books/librarything/...
http://openlibrary.org/books/ocaid/...
http://openlibrary.org/books/paperback_swap/...
http://openlibrary.org/books/Your ID Here/...
Monday, April 11, 2011