The document discusses the Common Crawl project, which crawls and archives the web. It provides over 8 billion web pages and 120 TB of data that is freely available to anyone. The data includes raw HTML content, metadata, and text-only files. The document outlines some of the ways the Common Crawl data is currently being used, such as for testing Apache Giraph, the maplight political mapping project, image search by Tineye, and sentiment analysis projects. It also discusses future plans to expand the data available and use cases.
1. London HUG
Common Crawl :
WhatRepository
An Open
Does
Theof Web Data
Data World
Mean to Society?
Lisa Green
Lisa Green
1 October 2012
10 October 2012
2. Photo license: Public Domain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg
5. Still Nascent
• Even cheaper storage
• Even cheaper compute
• Education
• Open Data
Image license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team (STScI/AURA)
10. Common Crawl Data
• ~8 Billion web pages
• ~120 TB
• 2008-2012
• ARC files, JSON metadata, text files
• Available to anyone
11. ARC Files - Raw Content
Metadata
• Status information
• HTTP response code
• File names & offsets of ARC files
• HTML title
• HTML meta tags
• RSS/Atom information
• All anchors/hyperlinks
Text Files - Text Only
http://commoncrawl.org/get-started
12.
13. Change between 2010 and 2012
• URLs with embedded data +6%
• Microdata +14%
• RDFa +26%
http://webdatacommons.org
14. • 22% of Web pages contain Facebook URLs
• 8% of Web pages implement Open Graph tags
15. http://wikientities.appspot.com
A corpus of anchortext-WikipediaConcept-Count
from the CommonCrawl dataset, to benefit
research on WSD, NLP and IR.
Given a sentence, it can
Explicit Topic Modeling: help identify entities
(person, location, organization) in wikipedia
Given a concept (represented as a the sentence
and map them onto Wikipedia concepts.
page), it can tell what are the most common
terms people use to describe the concept.
17. Other Use Examples
• Apache Giraph Testing
• Maplight
• Tineye
• Factual
• Sentiment Analysis Projects
18. In Development
• N-gram and Link Graph Extracts
• Pig Reader
• More Frequent Full Crawls
• Focused Subset Crawls at High Frequency
• Open Educational Resources
19. Thank You
London HUG
What Does
The Data World
Lisa Green
Mean to Society?
lisa@commoncrawl.org
www.commoncrawl.org
@commoncrawl
Lisa Green
@boudicca
1 October 2012