Open Library at the API Workshop

Hello.
MITH API Workshop
George Oates
Maryland, February 2011

Monday, April 11, 2011

Some rights reserved by mattdork


I work at the Internet Archive, leading The Open Library project. We recently moved in to this
church in The Richmond in San Francisco. We’re turning it into a library.


We’re based in San Francisco, California, where I happen to have been living for about 5
years.

Universal Access to
All Knowledge


Since 1996, the non-proﬁt Internet Archive has been building a digital library of Internet sites
and other things in digital form. archive.org has a ton of texts, video, software, live music...
all sorts of things.

Our mission is Universal Access to all Knowledge. Not a bad reason to get out of bed each
day...

Some rights reserved by heather

It’s not your traditional non-proﬁt... Lots of the staff are technologists and developers.

archive.org

We have many computers. They store over
- 100,000 hours of TV from channels all over the world
- 250,000 moving images or video
- 500,000 audio recordings
- 2.5 million scanned texts
- 150,000,000,000 web pages

By rkumar

Just the other day we had 2.88 petabytes of hard drives delivered. That’s enough storage for
about 2 billion books.


Another major part of what we do is scanning books. This is a picture of one of the scanning
centers in San Francisco. We currently employ about 200 staff scanning books


And today, we have over million free texts available online ‐ that includes over 1 million books
150 million pages scanned
1,000 books scanned EVERY day
24 scanning centers in 5 countries, and we hope for more.


We’re also scanning microﬁlm, which is much faster than individual books. Here’s an example of the record of the populaJon census from
1790 to 1930. Scanned from microﬁlm from the collecJons of the Allen County Public Library and originally from the United States
NaJonal Archives Record AdministraJon.


Examples of Cross Writing from Boston Public Library


Over 1 million free books that you can read on archive.org today, and access through the
Open Library site, by checking the little “Only eBooks” box as you search.


As well as being able to download these books in a variety of different formats, from PDF to
TXT and more, we also have a web-based book reader, which you can use to read our
scanned texts within your web browser, without the need for any additional software. At the
end of 2010, we released a new version of our open source, browser-based BookReader.

I’ve actually come to Wellington direct from a meeting in San Francisco called Books in
Browser, held at the Internet Archive last week. It was there that we announced an upcoming
new release of our bookreader, which will hopefully go live in the next few weeks... Here are
some screenshots...


The main reason we wanted to improve on the current design was to try to build an “app-
level quality” book reading experience right in the browser. This included several
improvement for touch interfaces in browsers on devices like the iPad.

From a straightforward design perspective, there were also improvements to be made on
usability and simple stuff like making the book bigger in the browser window.


This is a screenshot with the toolbar open, where you can see new features like a navigation
bar at the bottom that allows you to scroll through the book, a “read to me” feature which
plays the book in a computer-y voice, and highlights what’s being read. Also, if we know a
table of contents for the book, each chapter is mapped along the navigation bar.

We’ve also rewritten the full text search engine, and I’ll talk more about that a bit later.

By rkumar

Apologies for the slightly blurry picture, but this is my boss, Brewster Kahle, who founded the
Internet Archive back in 1996. He’s playing with a touchscreen which is displaying the new
bookreader. The screen’s been installed in one of the reading desks that used to sit in the
reading room of the Christian Science church before it became our new home. A big part of
the bookreader redesign was to evolve an app-level quality book reading experience within a
web browser. If you have an iPad, I’d encourage you to try it!


The Open Library project was launched back in 2007. In May 2010, we launched a total site
redesign. Just last week, we released a revised home page, building on our new Lending
program, and generally trying to do a better job of communicating that you can come to
Open Library to ﬁnd something to read for free, or a book to borrow. We also added activity
graphs to try to show that there’s stuff happening, all day, every day.

A “Wikipedia for Books”


There are a few different ways to describe what Open Library is, but I think the explanation
that makes the most sense is “a Wikipedia for Books”.


Scrolling down the home page...


We have a lending library of some 10,000 20th Century books. You can also access another
80,000 books if you’re (literally) sitting in one of the 150 or so libraries participating in our
“In-Library Lending” program. Each participating library contributes eBooks into the in-library
pool, and you can borrow anything in the pool, once you’re sitting in one of the libraries.


Yay! Graphs going up! (That peak you can see across the graphs is our lending launch. For
more info, read “Get Thee to a Library!” http://blog.openlibrary.org/2011/02/22/get-thee-
to-a-library/)


Snapshot of the various combinations of links we can provide to get you to books... For books
we can’t lend through our own lending program, we’ve connected to Overdrive... We’re
hoping to make the vendors you can buy from more dynamic, and open up the sources for
online free texts. Right now, it’s just the Internet Archive texts that we link to in full.

lending ebooks

• map / openstreen


You can browse a map of (mainly North American) libraries participating in the In-Library
lending program. If you’re interested to join in, please contact us!

borrow page

• screen


Here’s what a page looks like to borrow a book. You can see 3 options: In Browser, PDF, and
ePub.

In-browser is available immediately. You need to download/install Adobe Digital Editions to
read PDF or ePub versions.

Developer
Resources

Open Library
http://openlibrary.org/developers


Python, Postgres, SOLR, JSON, REST

http://github.com/openlibrary

We certainly have our code online at github, but we rarely receive patches. I’m OK with this,
at least for now.

JSON/RDF
http://openlibrary.org/developers



JSON blob

• http://openlibrary.org/works/OL69181W/
• http://openlibrary.org/works/OL69181W.json
• http://openlibrary.org/works/OL69181W.rdf


HTML, JSON, RDF

Data Dumps
http://archive.org/details/ol_data


archive.org/details/ol_data

There’s a copy of everything we’re using on the Internet Archive too.

API
http://openlibrary.org/developers/api


Open Library has a RESTful API, best used to link into Open Library data in JSON,
YAML and RDF/XML.

API

Books
Covers
Search inside
Subjects
Recent Changes
Lists


Open Library has a RESTful API, best used to link into Open Library data in JSON,
YAML and RDF/XML.

Request:

Request:

http://openlibrary.org/dev/docs/api/lists


We built lists for a couple of reasons: 1, to help people collect things together, and 2, to
make it easy to get at smaller sets of records.

Covers



Where:
• key can be any one of ISBN, OLCC, LCCN, OLID and ID (case-insensitive)
• value is the value of the chosen key
• size can be one of S, M and L for small, medium and large respectively.

(we use this)


Where:
• key can be any one of ISBN, OLCC, LCCN, OLID and ID (case-insensitive)
• value is the value of the chosen key
• size can be one of S, M and L for small, medium and large respectively.

Yay!


DOUBLE
Yay!



One of quite a few examples of Open Library in the wild includes the National Library of
Australia’s new search engine, Trove.


You can see there that there are links to Open Library books wherever one can be sourced.

There are a growing number of sites making use of Open Library data... and that’s what we’re
all about - data in, data out. The more interconnections we can make with other systems, the
easier it will be for people to land where they want to go inside Open Library.


This is ImportBot. He gets new catalog records from the Library of Congress and puts them
into Open Library every Tuesday. We also import records from Amazon, and from the Internet
Archive. ImportBot looks for recently scanned books, and creates new records (or merges
them with existing ones) just a few minutes after the record is created on the Internet
Archive.


You can see ImportBot working away, just like you can see the Wiki’s edit history for every
person who edits something.


Another quick note on data in before I move on...

We’ve been experimenting with a couple of other “surgical” bots, that look across the catalog
and connect edition records directly to other services by stamping identiﬁers from other
systems into Open Library. This is a bot written by a developer called Ben Gimpert, that takes
a ﬁle mapping ISBN to Goodreads IDs, and looks for ISBN matches in OL, then adding the
Goodreads ID to those records. This allows us to construct links to Goodreads, and to make
the Goodreads ID available through the API.


You can see we’ve added a little widget on the page that connects to Goodreads, if you have
an account, you can add our records to your lists on Goodreads. There’s also a LibraryThing
ID too, added by a similar batch bot update.

Writing bots to do things like this is the sort of development we’d like to open up to external
developers too...

BookReader
http://openlibrary.org/dev/docs/ia



The Library of Congress is using our Bookreader on read.gov. There are quite a few other
examples of the IA Bookreader out there on the web. Hopefully the redesign (with touch
interactions etc) will attract new people too...


Princeton Digital Library

Internet Archive
http://openlibrary.org/dev/docs/ia


http://archive.org/help

Raw Full Text
> 4 million documents
with metadata


Stanford NLP thing

http://nlp.stanford.edu/

We’ve just begun experimenting with some of the software made by the the Stanford Natural
Language Processing Group - that includes members of both the Linguistics Department and
the Computer Science Department, One idea is to fold this software into the scanning
process, so we can do a ﬁrst pass on entity extraction on full text of a book, to extract things
like names, places and common subjects...


But then of course, you can do cool stuff like this :)

Challenges


http://ﬂic.kr/p/6zyU3U Tension?

The Taxonomy vs Folksonomy debate may be represented thusly.

1) Books are for use.
2) Every reader his [or her] book.
3) Every book its reader.
4) Save the time of the User.
5) The library is a growing organism.


So, on the basis of the idea of our current catalog being a substrate, as Ranganathan
suggests in his ﬁve laws of library science...


So... Open Library is a virtual space. Its organization isn’t constrained like a physical catalog.
In fact, the more connections you can make into one of our “virtual index cards” the more
ways people have to discover and navigate its contents.

http://www.ﬂickr.com/photos/brixton/1394845916/

http://ﬂic.kr/p/6pmtQL

But, librarians are (very clever) humans too. And everyone who’s responsible for putting
books into a traditional catalogue must work within patterns. Patterns that have grown
semantically remarkable and deeply complex.

Unknown author 403
Unknown Author 358
Author unknown 254
No Author 145
Author Unknown 59
No Author. 54
Author 20
No author. 16
No author 12
unknown author 8
Unknown Author Unknown 7
no author 7
No Author Stated 7
(No Author) 6
No author noted 5
http://openlibrary.org/search No author noted. 4
no author listed 4

?author=author (no author) 4
Author Not Stated 4
Author. 4
No author speciﬁed 3
Miscellaneous Author 3
no Author 3
Author One 3
Multi-Author 3
No Author Listed 3
No Stated Author 3
Author Anonymous 2
(no author given) 2
Author 2
Author Wright 2
Unkown Author 2
No author stated 2
Mms suspense author 2
Author Test 2
TEST AUTHOR 2


Duplicate authors (and editions) are an issue... This is an example search for author records
with “author” in their names... you can see the variety of ways that catalogers have noted
unknown authors...

http://www.ﬂickr.com/photos/blackbeltjones/4294354526/

We’ve noticed a TON of minor variations in the way cataloguers enter data... Trivial to us, but
very hard for computers to differentiate

Substrate:
any surface on which a plant or animal lives or
on which a material sticks

Some rights reserved by Brynja Eldon

We have a repository that mostly contains records created by professionals. I ﬁnd it useful to
consider these records as a substrate, something that can be reacted upon.

What if we consider the source
Open Library records like that?

Some rights reserved by Brynja Eldon

Now that we’ve begun to reveal this substrate, how will people react to it? What reactions has
it caused so far?


Handwritten scribbles and scrawls; annotations; corrections

Some rights reserved by jared

What if a catalog looks like this? Is crystalline? What if it is unconstrained by the need to sort,
say, alphabetically?

From the artist of this image, Jared Tarbell: “Lines like crystals form at perpendicular angles
to existing lines. A complex form emerges.
1000 classic computational substrate, color palette stolen from Jackson Pollock: A simple
perpendicular growth rule creates intricate city-like structures. The simple rule, the complex
results, the enormous potential for modiﬁcation; this has got to be one of my all time favorite
self-discovered algorithms. Lines likes crystals grow on a computational substrate.”


What happens when you introduce turbulence into the catalog? Here are a few examples of
the sorts of edits we’re seeing... at a rate of about 100,000 edits per month.

http://www.ﬂickr.com/photos/rreis/4859722551/sizes/l/

000s of edits per month


What happens when you introduce turbulence into the catalog? Here are a few examples of
the sorts of edits we’re seeing... at a rate of about 100,000 edits per month.

if you don’t stimulate an organism, it atrophies

http://www.ﬂickr.com/photos/rreis/4859722551/sizes/l/

Activity/History


One of the key components to any happy social system is the visibility of other people, and a
sense of activity. This is one of the key elements we’re focussed on in the redesign. This
particular list shows all edits by humans on Open Library, and actually, turns out to be a
handy way to spot check what’s happening. You’ll notice too, there’s a special tab for the
variety of edits that we run across the system using bots. Often pretty mechanical and
repetitive, we found that the bots obscure the humans if you just mush everything up in a big
list, so we separated them.

Activity/History
Live Data


One of the key components to any happy social system is the visibility of other people, and a
sense of activity. This is one of the key elements we’re focussed on in the redesign. This
particular list shows all edits by humans on Open Library, and actually, turns out to be a
handy way to spot check what’s happening. You’ll notice too, there’s a special tab for the
variety of edits that we run across the system using bots. Often pretty mechanical and
repetitive, we found that the bots obscure the humans if you just mush everything up in a big
list, so we separated them.

Solutions?


Shelf

http://www.ﬂickr.com/photos/emdot/400280705/

I really like how Raymond described his book yesterday, that as soon as he’d written it, it
began to decay... Concrete, decay

Network

http://www.ﬂickr.com/photos/arenamontanus/352130655/

Plastic, self-healing

Minimum
Viable Record


Now, I want to try a little exercise. I’m going to hand out an index card to all of you, and ask
you to nominate 5 ﬁelds that you think is enough to describe a book. I’ll collate the results
and report back later.


Stamen Design in SF. Got funding from Knight Foundation to build Citytracking. Challenge is a “hodgepodge of
bits—including APIs [2] and official sources, scraped websites, sometimes-reusable data formats and datasets,
visualizations, embeddable widgets etc.—is fractured, overly technical and obscure, held in the knowledge base of
a relatively small number of people, and requires considerable expertise to harness.”

Online Publishing Distribution System (OPDS)
http://bookserver.archive.org/catalog/new


This is an example of trying something very bare bones, to try to help systems
intercommunicate more easily. (Open Library plans to publish OPDS feeds soon.)
Online Publishing Distribution System (OPDS): The Open Publication Distribution
System (OPDS) Catalog specification is a syndication format for electronic publications
based on Atom RFC4287 and HTTP RFC2616.

American notes for general circulation [microform]
February 25, 2011 10:22 AM
Author: Dickens, Charles, 1812-1870
Publisher: New York : Harper
Year published: 1842
Book contributor: Canadiana.org
Language: en
Download Ebook: (PDF) (EPUB)



Individuals can also add new books with a few details like Title, Author, Publisher and Publish
Date. That’s enough for a stub, and then people are invited to add more details.

Canonical ID?


Canonical ID?
Collect them.



Another experiment we’re looking forward to trying is about identifiers. We’re not particularly
concerned about canonical identifiers. Perhaps it’s a waste of time to wait for one, so instead,
we’re going to try and attach as many ID types to our records as we can. (This list is just a
braindump - not active yet.) The idea is that people could add a URL or actual identifier and
Open Library would just do the right thing. A suggestion (after this presentation was
delivered) was that people could ping Open Library with an identifier, not even knowing what
TYPE of ID it is. Perhaps Open Library could help “triangulate” this query towards a book
record. “Record laundering.”

Canonical ID?
Exchange them.


http://openlibrary.org/books/olid/OL7440033M
http://openlibrary.org/books/isbn/0385472579
http://openlibrary.org/books/lccn/93005405
http://openlibrary.org/books/oclc/28419896
http://openlibrary.org/books/id/240727
http://openlibrary.org/books/amazon/...
http://openlibrary.org/books/bookmooch/...
http://openlibrary.org/books/goodreads/...
http://openlibrary.org/books/ocaid/...
http://openlibrary.org/books/librarything/...
http://openlibrary.org/books/paperback_swap/...
http://openlibrary.org/books/Your ID Here/...


You can already ping Open Library with an ID other than the Open Library identiﬁer to see if
we have any matches.

http://openlibrary.org/books/olid/OL7440033M
http://openlibrary.org/books/lccn/93005405
http://openlibrary.org/books/oclc/28419896
http://openlibrary.org/books/id/240727
http://openlibrary.org/books/amazon/...
http://openlibrary.org/books/bookmooch/...
http://openlibrary.org/books/goodreads/...
http://openlibrary.org/books/librarything/...
http://openlibrary.org/books/ocaid/...
http://openlibrary.org/books/paperback_swap/...
http://openlibrary.org/books/Your ID Here/...


Your ID


Your ID

Everyone else’s


Make nodes,
not cards
Some rights reserved by
yobink

Network,
not sequence


Thanks!
George Oates
glo@archive.org
@openlibrary


Open Library at the API Workshop

Recomendados

Recomendados

Más contenido relacionado

Similar a Open Library at the API Workshop

Similar a Open Library at the API Workshop (20)

Más de George Oates

Más de George Oates (20)

Open Library at the API Workshop