A presentation from Museums and the Web 2009:
A prototype system that allows the aggregation of data from museum and related Web sites, including object and event records, was rapidly developed. By screen-scraping the existing pages of 17 Web sites, tens of thousands of data records were collected without any technical agreement, investment or consent from the participating institutions. In this paper, we examine the reasons and benefits for aggregating this type of data, how our approach differs to other funded projects that have similar aspirations, and the relative strengths and weaknesses of each. An analysis of the data is presented, showing how the aggregate data set varies by assorted parameters, including location and date. Our work is related to the bigger picture of on-line data publishing, such as Semantic Web technologies, and some suggestions are presented as to how the grand vision of the Semantic Web may be achievable without the complexity.
Session: Technology Strategies [Technology]
Unleash Your Potential - Namagunga Girls Coding Club
Dan Zambonini and Mike Ellis, hoard.it: Aggregating, displaying and mining object-data without consent
1. hoard.it : Stealing your data
Or... “Where is your online value?”
Or... “Originality sucks”
Dan Zambonini
www.boxuk.com
Museums and the Web 2009, Indianapolis, April 16
8. Cross-Collections Projects
“Search through the cultural collections of Europe”
“explore and comment on collections”
“find and explore digital collections from museums”
“Discover cultural objects, collections”
9. Why is this a Problem?
1. Some duplication of effort
• £25,000 - £100,000 to put collections online
• £1,500 - £6,500 per cross-collection project
2. Potential end-user confusion
3. Usually only include larger institutions
4. Is there really a need?
10. Our Approach
• Use data that already exists
• No cost/duplication of effort
• No input or changes from museums
• Lightweight, open to all
• Re-expose the data programmatically
• Enable easy re-use
14. Difficulties and Limitations
• Must have collections online
• Must have a consistent template
• Slow; not real-time
• Technical variations (encoding, standards)
• Rudimentary: Flash/Forms a barrier
15. Difficulties: Normalization
• Dates
• circa 19th century, 1960s, 2008-01, 1Jan ’52, 2000 BC, 30s, April 4 1934,
04-76, 1783-25-04, 10-11-64, about 200 AD, Victorian, 1100-1150, ...
• http://feeds.boxuk.com/convert/date/
• Location
• Points of interest, cities, towns, countries, administrative regions, political
regions, ancient names, continents, postal codes, co-ordinates, ...
• http://developer.yahoo.com/geo/
16. The Data
Virtual Museum of Canada!
Carnegie Museum of Art!
Smithsonian NASM!
National Museum of Australia!
National Portrait Gallery!
Imperial War Museum!
National Museums of Scotland!
Ingenious!
Museum of London: E20CL!
British Museum!
Victoria and Albert Museum!
National Maritime Museum!
Powerhouse!
Science Museum!
24 Hour Museum!
Freebase: Events!
Wikipedia: List of Painters!
0! 2000! 4000! 6000! 8000! 10000! 12000! 14000! 16000!
17. The Data
Virtual Museum of Canada!
Carnegie Museum of Art!
Smithsonian NASM!
National Museum of Australia!
National Portrait Gallery!
Imperial War Museum!
National Museums of Scotland!
Ingenious!
Museum of London: E20CL!
British Museum!
Victoria and Albert Museum!
National Maritime Museum!
Powerhouse!
Science Museum!
24 Hour Museum!
Freebase: Events!
Wikipedia: List of Painters!
0! 2000! 4000! 6000! 8000! 10000! 12000! 14000! 16000!
70,000 objects
18. The Data
• URL 100%
• Identifier 95%
• Title 100%
• Description 70%
• Image 85%
• Creator 50%
• Created Date 75%
• Copyright 50%
• Dimensions 45%
• Subject 65%
• Location 45%
• Materials 65%
19. Data Mining - Location
65% Europe
15% Asia
14% North America
4% Oceania
Percentage of objects from the same continent as museum:
• North America: 85%
• Europe: 75%
• Oceania: 65%
27. What can you offer?
• Expertise
• Media
• The Physical Space
• Reputation and Trust
• Audience
• Voice, Exposure and Influence
28. What’s changed?
“...not all information should flow everywhere; only the
meaningful should be transmitted.
But in the network economy only signals in real time (or
close to it) are truly meaningful.
Examine the speed of knowledge in your system. How
can it be brought closer to real time? If this requires the
cooperation of subcontractors, distant partners, and far-
flung customers, so much the better.”
Kevin Kelly
http://www.kk.org/newrules/blog/2009/04/if-you-are-not-in-real-time-yo.php
34. For example
• Let your patrons collaborate
• Let your patrons run your space
• Give local communities a voice
• Provide advice and guidance
• Collect & distribute niche knowledge
• ...
• You know better than I do.
35. What has to change?
• A focus on proven user needs
• Re-usable services, not more data
• Smaller projects
• Iterative approaches
• A real commitment to the web platform
• (At least some) In-house development
36. How do we get there?
• Should web projects generate revenue?
• Don’t be afraid of re-inventing the wheel
• Demand all projects use/expose APIs that
are easy (REST not SOAP/OAI) and publicized
• Show early, show often
• Annoy funding bodies to support more,
smaller, longer (i.e. iterative) ‘boring’ projects,
and less ‘big, audacious’ projects.
37. Summary
• We stole your data...
• But then so are lots of other people...
• So produce value elsewhere.
• Ideas are harmful: do what’s proven...
• But do it brilliantly.
• And to do that, we need change.
38. Thank you
www.boxuk.com
dan@boxuk.com
twitter.com/zambonini
39. Thank you
www.boxuk.com
dan@boxuk.com
twitter.com/zambonini