SlideShare una empresa de Scribd logo
1 de 19
London HUG
               Common Crawl :
               WhatRepository
              An Open
                      Does
             Theof Web Data
                  Data World
             Mean to Society?
                     Lisa Green
                   Lisa Green
                 1 October 2012
                 10 October 2012
Photo license: Public Domain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg
Photo license: CC-BY-SA Origin: http://en.wikipedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_08.jpg
Image license: CC-BY Origin: http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg
Still Nascent
                                                                    •      Even cheaper storage
                                                                    •      Even cheaper compute
                                                                    •      Education
                                                                    •      Open Data

Image license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team (STScI/AURA)
Gratis




Proprietary                Libre




              Commercial
Progress


Insight


Analysis


 Data
Gil Elbaz
Common Crawl Data
• ~8 Billion web pages
• ~120 TB
• 2008-2012
• ARC files, JSON metadata, text files
• Available to anyone
ARC Files - Raw Content
Metadata
•   Status information
•   HTTP response code
•   File names & offsets of ARC files
•   HTML title
•   HTML meta tags
•   RSS/Atom information
•   All anchors/hyperlinks

Text Files - Text Only

           http://commoncrawl.org/get-started
Change between 2010 and 2012
• URLs with embedded data +6%
• Microdata +14%
• RDFa +26%

      http://webdatacommons.org
• 22% of Web pages contain Facebook URLs
• 8% of Web pages implement Open Graph tags
http://wikientities.appspot.com

A corpus of anchortext-WikipediaConcept-Count
   from the CommonCrawl dataset, to benefit
         research on WSD, NLP and IR.

Given a sentence, it can
Explicit Topic Modeling: help identify entities
(person, location, organization) in wikipedia
Given a concept (represented as a the sentence
and map them onto Wikipedia concepts.
page), it can tell what are the most common
terms people use to describe the concept.
Mapping French websites related to Open Data
Other Use Examples
•   Apache Giraph Testing
•   Maplight
•   Tineye
•   Factual
•   Sentiment Analysis Projects
In Development
•   N-gram and Link Graph Extracts
•   Pig Reader
•   More Frequent Full Crawls
•   Focused Subset Crawls at High Frequency
•   Open Educational Resources
Thank You
London HUG

               What Does
             The Data World
                       Lisa Green

             Mean to Society?
                  lisa@commoncrawl.org
                www.commoncrawl.org
                     @commoncrawl
                      Lisa Green
                       @boudicca
                   1 October 2012

Más contenido relacionado

La actualidad más candente

DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, BerlinDBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
Anja Jentzsch
 

La actualidad más candente (20)

Clipper, research data network
Clipper, research data networkClipper, research data network
Clipper, research data network
 
Illuminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportIlluminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data Support
 
Contextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of EntitiesContextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of Entities
 
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, BerlinDBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
 
Connecting the Dots: Constellations in the Linked Data Universe
Connecting the Dots: Constellations in the Linked Data UniverseConnecting the Dots: Constellations in the Linked Data Universe
Connecting the Dots: Constellations in the Linked Data Universe
 
The Web of Data is Our Opportunity
The Web of Data is Our OpportunityThe Web of Data is Our Opportunity
The Web of Data is Our Opportunity
 
OCLC Linked Data Progress
OCLC Linked Data ProgressOCLC Linked Data Progress
OCLC Linked Data Progress
 
5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar Slides
5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar Slides5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar Slides
5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar Slides
 
Digital Preservation in Production (DPN and DuraCloud Vault)
Digital Preservation in Production (DPN and DuraCloud Vault)Digital Preservation in Production (DPN and DuraCloud Vault)
Digital Preservation in Production (DPN and DuraCloud Vault)
 
Visualizing linkeddata aall2012d-ss
Visualizing linkeddata aall2012d-ssVisualizing linkeddata aall2012d-ss
Visualizing linkeddata aall2012d-ss
 
The Cultural Linked Data Backbone
The Cultural Linked Data BackboneThe Cultural Linked Data Backbone
The Cultural Linked Data Backbone
 
ResourceSync - NISO Update Jan 2014
ResourceSync - NISO Update Jan 2014ResourceSync - NISO Update Jan 2014
ResourceSync - NISO Update Jan 2014
 
Open data and linked data
Open data and linked dataOpen data and linked data
Open data and linked data
 
They have left the building: The Web Route to Library Users
They have left the building: The Web Route to Library UsersThey have left the building: The Web Route to Library Users
They have left the building: The Web Route to Library Users
 
ConfrencePres
ConfrencePresConfrencePres
ConfrencePres
 
Unlocking Doors: recent initiatives in open and linked data at the National L...
Unlocking Doors: recent initiatives in open and linked data at the National L...Unlocking Doors: recent initiatives in open and linked data at the National L...
Unlocking Doors: recent initiatives in open and linked data at the National L...
 
DBpedia/association Introduction The Hague 12.2.2016
DBpedia/association Introduction The Hague 12.2.2016DBpedia/association Introduction The Hague 12.2.2016
DBpedia/association Introduction The Hague 12.2.2016
 
Digital Infrastructure: Storage and Content Management
Digital Infrastructure: Storage and Content ManagementDigital Infrastructure: Storage and Content Management
Digital Infrastructure: Storage and Content Management
 
Linked Data in Libraries
Linked Data in LibrariesLinked Data in Libraries
Linked Data in Libraries
 
Contextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationContextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data Foundation
 

Destacado (7)

Planificació 2
Planificació 2Planificació 2
Planificació 2
 
Entrenament U
Entrenament UEntrenament U
Entrenament U
 
Entrenament Tres
Entrenament TresEntrenament Tres
Entrenament Tres
 
Entrenament Cinc
Entrenament CincEntrenament Cinc
Entrenament Cinc
 
Metodologia
MetodologiaMetodologia
Metodologia
 
Lliga futbol 11 12 a-b-c-d
Lliga futbol 11 12 a-b-c-dLliga futbol 11 12 a-b-c-d
Lliga futbol 11 12 a-b-c-d
 
MODELS PLANIFICACIÓ
MODELS PLANIFICACIÓMODELS PLANIFICACIÓ
MODELS PLANIFICACIÓ
 

Similar a London HUG

Global lodlam_communities and open cultural data
Global lodlam_communities and open cultural dataGlobal lodlam_communities and open cultural data
Global lodlam_communities and open cultural data
Minerva Lin
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
Anja Jentzsch
 
Vila LOD-innovacion- bib-semweb-redux
Vila LOD-innovacion- bib-semweb-reduxVila LOD-innovacion- bib-semweb-redux
Vila LOD-innovacion- bib-semweb-redux
LIS EPI Meeting
 

Similar a London HUG (20)

Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & Museums
 
Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.
 
Linked Data and OCLC
Linked Data and OCLCLinked Data and OCLC
Linked Data and OCLC
 
Global lodlam_communities and open cultural data
Global lodlam_communities and open cultural dataGlobal lodlam_communities and open cultural data
Global lodlam_communities and open cultural data
 
Intro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & MuseumsIntro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & Museums
 
Linked Data Now & Next
Linked Data Now & NextLinked Data Now & Next
Linked Data Now & Next
 
Linked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & MuseumsLinked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & Museums
 
What is New in W3C land?
What is New in W3C land?What is New in W3C land?
What is New in W3C land?
 
IASSIT Kansa Presentation
IASSIT Kansa PresentationIASSIT Kansa Presentation
IASSIT Kansa Presentation
 
Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collection...
Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collection...Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collection...
Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collection...
 
BHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clustersBHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clusters
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Open Data Masterclass - Europeana and LOD
Open Data Masterclass - Europeana and LODOpen Data Masterclass - Europeana and LOD
Open Data Masterclass - Europeana and LOD
 
From Open Access to Open Standards, (Linked) Data and Collaborations
From Open Access to Open Standards, (Linked) Data and CollaborationsFrom Open Access to Open Standards, (Linked) Data and Collaborations
From Open Access to Open Standards, (Linked) Data and Collaborations
 
Linked Data Basics
Linked Data BasicsLinked Data Basics
Linked Data Basics
 
Your research as open science
Your research as open scienceYour research as open science
Your research as open science
 
Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017
 Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017 Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017
Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017
 
Vila LOD-innovacion- bib-semweb-redux
Vila LOD-innovacion- bib-semweb-reduxVila LOD-innovacion- bib-semweb-redux
Vila LOD-innovacion- bib-semweb-redux
 

London HUG

  • 1. London HUG Common Crawl : WhatRepository An Open Does Theof Web Data Data World Mean to Society? Lisa Green Lisa Green 1 October 2012 10 October 2012
  • 2. Photo license: Public Domain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg
  • 3. Photo license: CC-BY-SA Origin: http://en.wikipedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_08.jpg
  • 4. Image license: CC-BY Origin: http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg
  • 5. Still Nascent • Even cheaper storage • Even cheaper compute • Education • Open Data Image license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team (STScI/AURA)
  • 6. Gratis Proprietary Libre Commercial
  • 9.
  • 10. Common Crawl Data • ~8 Billion web pages • ~120 TB • 2008-2012 • ARC files, JSON metadata, text files • Available to anyone
  • 11. ARC Files - Raw Content Metadata • Status information • HTTP response code • File names & offsets of ARC files • HTML title • HTML meta tags • RSS/Atom information • All anchors/hyperlinks Text Files - Text Only http://commoncrawl.org/get-started
  • 12.
  • 13. Change between 2010 and 2012 • URLs with embedded data +6% • Microdata +14% • RDFa +26% http://webdatacommons.org
  • 14. • 22% of Web pages contain Facebook URLs • 8% of Web pages implement Open Graph tags
  • 15. http://wikientities.appspot.com A corpus of anchortext-WikipediaConcept-Count from the CommonCrawl dataset, to benefit research on WSD, NLP and IR. Given a sentence, it can Explicit Topic Modeling: help identify entities (person, location, organization) in wikipedia Given a concept (represented as a the sentence and map them onto Wikipedia concepts. page), it can tell what are the most common terms people use to describe the concept.
  • 16. Mapping French websites related to Open Data
  • 17. Other Use Examples • Apache Giraph Testing • Maplight • Tineye • Factual • Sentiment Analysis Projects
  • 18. In Development • N-gram and Link Graph Extracts • Pig Reader • More Frequent Full Crawls • Focused Subset Crawls at High Frequency • Open Educational Resources
  • 19. Thank You London HUG What Does The Data World Lisa Green Mean to Society? lisa@commoncrawl.org www.commoncrawl.org @commoncrawl Lisa Green @boudicca 1 October 2012