1. London HUG
Common Crawl :
WhatRepository
An Open
Does
Theof Web Data
Data World
Mean to Society?
Lisa Green
Lisa Green
1 October 2012
10 October 2012
2. Photo license: Public Domain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg
5. Still Nascent
• Even cheaper storage
• Even cheaper compute
• Education
• Open Data
Image license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team (STScI/AURA)
10. Common Crawl Data
• ~8 Billion web pages
• ~120 TB
• 2008-2012
• ARC files, JSON metadata, text files
• Available to anyone
11. ARC Files - Raw Content
Metadata
• Status information
• HTTP response code
• File names & offsets of ARC files
• HTML title
• HTML meta tags
• RSS/Atom information
• All anchors/hyperlinks
Text Files - Text Only
http://commoncrawl.org/get-started
12.
13. Change between 2010 and 2012
• URLs with embedded data +6%
• Microdata +14%
• RDFa +26%
http://webdatacommons.org
14. • 22% of Web pages contain Facebook URLs
• 8% of Web pages implement Open Graph tags
15. http://wikientities.appspot.com
A corpus of anchortext-WikipediaConcept-Count
from the CommonCrawl dataset, to benefit
research on WSD, NLP and IR.
Given a sentence, it can
Explicit Topic Modeling: help identify entities
(person, location, organization) in wikipedia
Given a concept (represented as a the sentence
and map them onto Wikipedia concepts.
page), it can tell what are the most common
terms people use to describe the concept.
17. Other Use Examples
• Apache Giraph Testing
• Maplight
• Tineye
• Factual
• Sentiment Analysis Projects
18. In Development
• N-gram and Link Graph Extracts
• Pig Reader
• More Frequent Full Crawls
• Focused Subset Crawls at High Frequency
• Open Educational Resources
19. Thank You
London HUG
What Does
The Data World
Lisa Green
Mean to Society?
lisa@commoncrawl.org
www.commoncrawl.org
@commoncrawl
Lisa Green
@boudicca
1 October 2012