SlideShare una empresa de Scribd logo
1 de 34
Descargar para leer sin conexión
Warcbase
Building a Scalable Platform
on HBase and Hadoop
Part Two: Historian Use Case
Jimmy Lin
University of Maryland
College Park, MD
Ian Milligan
University of Waterloo
Waterloo, ON Canada
Why should a
historian
care?
The sheer amount of social,
cultural, and political
information generated every
day presents new
opportunities for historians.
Could one
even study
the 1990s
and
beyond
without
web
archives?
No.
Historians need to do this now, or
we’re going to be left behind.
Nightmare Scenario
• Wayback Machine won’t be enough. We won’t use that.
• Historians rely uncritically on date-ordered keyword
search results, putting them at mercy of search
algorithms they do not understand;
• Historians are completely left out of post-1996
research, letting everybody else do the work (a la
Culturomics project/Nature magazine article);
• Our profession gets left behind…
Unlocking an Archive-It
Collection
• Archive-It has amazing collections of social,
cultural, political, and economic records generated
by everyday people, leaders, businesses,
academics, and beyond.
• Stories waiting to be hold.
• The data is there, but the problem is access.
Example Dataset
• Archive-It Collection 227,
Canadian Political Parties and
Political Interest Groups
(University of Toronto)
• October 2005 - Present
• All major and minor political
parties, as well as organized
political interest groups (Council
of Canadians, Coalition to
Oppose the Arms Trade
Assembly of First Nations, etc.)
• Started by now-retired librarian,
hard to get details on seed list
Two Main Approaches
• Warcbase
• Link extraction and analytics
• Full-text extraction and analytics
• Full-text faceted search
• UK Web Archive’s Shine solr front end
Using Warcbase to
analyze links and full-text
Basic Link Statistics
• Count number of pages per domain
• Count number of links for each crawl so they can
be normalized (very important)
• Run on command line using relatively simple pig
scripts
Example Script (counting
number of links for each crawl)
register	
  'target/warcbase-­‐0.1.0-­‐SNAPSHOT-­‐fatjar.jar';	
  
DEFINE	
  ArcLoader	
  org.warcbase.pig.ArcLoader();	
  
DEFINE	
  ExtractLinks	
  
org.warcbase.pig.piggybank.ExtractLinks();	
  
raw	
  =	
  load	
  '/shared/collections/CanadianPoliticalParties/
arc/'	
  using	
  ArcLoader	
  as	
  
	
  	
  (url:	
  chararray,	
  date:	
  chararray,	
  mime:	
  chararray,	
  
content:	
  bytearray);	
  
a	
  =	
  filter	
  raw	
  by	
  mime	
  ==	
  'text/html'	
  and	
  date	
  is	
  not	
  null;	
  
b	
  =	
  foreach	
  a	
  generate	
  SUBSTRING(date,	
  0,	
  6)	
  as	
  date,	
  url,	
  
FLATTEN(ExtractLinks((chararray)	
  content,	
  url));	
  
c	
  =	
  group	
  b	
  by	
  $0;	
  
d	
  =	
  foreach	
  c	
  generate	
  group,	
  COUNT(b);
Social Media Appearances -
Twitter
(20080611220246,http://creativecommons.org/,twitter)	
  
(20080711224545,http://www.pm.gc.ca/eng/feature.asp?pageId=105,twitter)	
  
(20080712030632,http://www.pm.gc.ca/fra/feature.asp?pageId=105,twitter)	
  
(20080712142357,http://www.pm.gc.ca/eng/media.asp?category=2&;id=1814,twitter)	
  
(20080930221618,http://www.ndp.ca/home,twitter)	
  
(20080930221618,http://www.ndp.ca/home,twitter)	
  
(20080930221638,http://www.liberal.ca/default_e.aspx,twitter)	
  
(20080930221641,http://www.liberal.ca/story_15081_e.aspx,twitter)	
  
(20080930221714,http://www.liberal.ca/video_e.aspx,twitter)	
  
(20080930221903,http://www.ndp.ca/page/5246,twitter)	
  
(20080930221904,http://www.ndp.ca/twitterblogwidget/ndp-­‐twitter.php?
lang=en,twitter)	
  
(20080930222049,http://greenparty.ca/en/action,twitter)	
  
(20080930222124,http://www.ndp.ca/bloggingtools,twitter)	
  
(20080930222825,http://greenparty.ca/en/campaign/35053,twitter)	
  
(20080930223014,http://greenparty.ca/en/campaign/35068,twitter)	
  
(20080930223240,http://www.liberal.ca/depth_e.aspx,twitter)	
  
(20080930223258,http://www.liberal.ca/enews_e.aspx,twitter)	
  
(20080930223315,http://www.liberal.ca/glance_e.aspx,twitter)	
  
(20080930223320,http://www.liberal.ca/story_15073_e.aspx,twitter)	
  
(20080930223323,http://www.liberal.ca/gallery_e.aspx,twitter)
Social Media Appearances -
Facebook
(20070418135140,http://www.liberal.ca/glance_e.aspx,facebook)	
  
(20070418135947,http://greenparty.ca/en/blog/activemenu/menu?page=2,facebook)	
  
(20070418140056,http://greenparty.ca/en/blog/activemenu/book?page=2,facebook)	
  
(20070418140511,http://greenparty.ca/en/blog/popular?page=3,facebook)	
  
(20070418140516,http://www.liberal.ca/glance_f.aspx,facebook)	
  
(20070418141139,http://greenparty.ca/en/blog/431,facebook)	
  
(20070418141930,http://greenparty.ca/en/blog?page=2,facebook)	
  
(20070418143749,http://greenparty.ca/en/node/1280,facebook)	
  
(20070418143900,http://greenparty.ca/en/blog/activemenu/activemenu/book?page=2,facebook)	
  
(20070418144002,http://greenparty.ca/en/blog/activemenu/activemenu/menu?page=2,facebook)	
  
(20070418151727,http://www.equalvoice.ca/youth/,facebook)	
  
(20070418151734,http://www.equalvoice.ca/youth/index.htm,facebook)	
  
(20070418151843,http://www.equalvoice.ca/youth/Bios.htm,facebook)	
  
(20070418153832,http://greenparty.ca/fr/node/1280,facebook)	
  
(20070418154008,http://greenparty.ca/en/blog/activemenu/activemenu/activemenu/menu?
page=2,facebook)	
  
(20070418154112,http://greenparty.ca/en/blog/activemenu/activemenu/activemenu/book?
page=2,facebook)	
  
(20070518134656,http://www.liberal.ca/glance_e.aspx,facebook)	
  
(20070518134918,http://www.liberal.ca/conversation_e.aspx,facebook)	
  
(20070518134918,http://www.liberal.ca/conversation_e.aspx,facebook)	
  
(20070518134941,http://www.ndp.ca/page/4733,facebook)
Link Analysis
• Extracting links by domain (tab-separated values):
200810	
  conservative.ca	
   digg.com	
   2325	
  
200810	
  conservative.ca	
   facebook.com	
   2325	
  
200810	
  conservative.ca	
   mycampaign.conservative.ca	
   7902	
  
[..]	
  
200902	
  liberal.ca	
  ctv.ca	
  16	
  
200902	
  liberal.ca	
  del.icio.us	
   1118	
  
200902	
  liberal.ca	
  digg.com	
   1118	
  
Other Cases
• Extracting all links to the mainstream media, or
thinktanks, or other political parties
2005 Canadian Federal Election
Text Analysis
register	
  'target/warcbase-­‐0.1.0-­‐SNAPSHOT-­‐fatjar.jar';	
  
DEFINE	
  ArcLoader	
  org.warcbase.pig.ArcLoader();	
  
DEFINE	
  ExtractRawText	
  org.warcbase.pig.piggybank.ExtractRawText();	
  
DEFINE	
  ExtractTopLevelDomain	
  
org.warcbase.pig.piggybank.ExtractTopLevelDomain();	
  
raw	
  =	
  load	
  '/shared/collections/CanadianPoliticalParties/arc/'	
  using	
  
ArcLoader	
  as	
  
	
  	
  (url:	
  chararray,	
  date:	
  chararray,	
  mime:	
  chararray,	
  content:	
  bytearray);	
  
a	
  =	
  filter	
  raw	
  by	
  mime	
  ==	
  'text/html'	
  and	
  date	
  is	
  not	
  null;	
  
b	
  =	
  foreach	
  a	
  generate	
  SUBSTRING(date,	
  0,	
  6)	
  as	
  date,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  REPLACE(ExtractTopLevelDomain(url),	
  '^s*www.',	
  
'')	
  as	
  url,	
  content;	
  
c	
  =	
  filter	
  b	
  by	
  url	
  ==	
  'greenparty.ca';	
  
d	
  =	
  foreach	
  c	
  generate	
  date,	
  url,	
  ExtractRawText((chararray)	
  content)	
  as	
  
text;	
  
store	
  d	
  into	
  'cpp.text-­‐greenparty';
Text Analysis
• Now have circumscribed corpus for specified
query (i.e. liberal.ca, or ndp.ca, or conservative.ca)
• Can now use standard text analysis tools, etc. to
extract meaning
• LDA (topic modeling)
• NER (named entity recognition)
NER
October	
  2005	
  
	
  	
  62476	
  Stephen	
  Harper	
  
	
  	
  30234	
  Michael	
  Chong	
  
	
  	
  30109	
  Gwynne	
  Dyer	
  
	
  	
  28011	
  ami	
  Entrez	
  
	
  	
  26238	
  Paul	
  Martin	
  
	
  	
  22303	
  Harper	
  
NER
November	
  2008	
  
	
  	
  	
  3188	
  Stéphane	
  Dion	
  
	
  	
  	
  2557	
  Stephen	
  Harper	
  
	
  	
  	
  2471	
  Stephen	
  HarperLaureen	
  
	
  	
  	
  2410	
  Dion	
  
	
  	
  	
  2356	
  Harper	
  
Visualizing Interface
Next Step?
Shine
• UK Web Archive’s Shine
(https://github.com/ukwa/
shine)
• Indexing as bottleneck
• ~ 250GB of WARCs takes ~
5 days on a single machine
• Hadoop indexer available if
data in HFDS
• ~ 90GB index size
Examples
Shine
• Advantages: accessible to the general public,
easy to use, interactive trend diagram allows
digging down for context, can move down to level
of document itself.
• Disadvantage: keyword searching requires you
know what to look for; random sampling misleading
when tens of thousands of records; etc.
• Doesn’t take advantage of what makes web
sources so powerful: hyperlinks
Building connections
between Warcbase and
Shine
Conclusions &
Thanks
Jimmy Lin
University of Maryland
College Park, MD
Ian Milligan
University of Waterloo
Waterloo, ON Canada

Más contenido relacionado

La actualidad más candente

Querying the Web of Data with XSPARQL 1.1
Querying the Web of Data with XSPARQL 1.1Querying the Web of Data with XSPARQL 1.1
Querying the Web of Data with XSPARQL 1.1Daniele Dell'Aglio
 
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panicoDiego Valerio Camarda
 
(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGGRatko Mutavdzic
 
Flagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis VZW
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQLOlaf Hartig
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hiveReza Ameri
 
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...Data Con LA
 
Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...Humphrey Southall
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)dataSUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)dataDiego Valerio Camarda
 
Querying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonQuerying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonDaniel Rodriguez
 
RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)Daniele Dell'Aglio
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA
 
R Hadoop integration
R Hadoop integrationR Hadoop integration
R Hadoop integrationDzung Nguyen
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 

La actualidad más candente (20)

Querying the Web of Data with XSPARQL 1.1
Querying the Web of Data with XSPARQL 1.1Querying the Web of Data with XSPARQL 1.1
Querying the Web of Data with XSPARQL 1.1
 
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
 
(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG
 
Flagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertier
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQL
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
 
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...
 
Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)dataSUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
 
Querying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonQuerying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with python
 
RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Kyiv.py #16 october 2015
Kyiv.py #16 october 2015Kyiv.py #16 october 2015
Kyiv.py #16 october 2015
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for Hadoop
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
 
R Hadoop integration
R Hadoop integrationR Hadoop integration
R Hadoop integration
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 

Similar a Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Historian Use Case

Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Open Data and CKAN Data Catalogues
Open Data and CKAN Data CataloguesOpen Data and CKAN Data Catalogues
Open Data and CKAN Data Cataloguesdavid-read
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardITCamp
 
Client-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderClient-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderMat Kelly
 
Schema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your LibrarySchema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your LibraryRichard Wallis
 
Schema.org - An Extending Influence
Schema.org - An Extending InfluenceSchema.org - An Extending Influence
Schema.org - An Extending InfluenceRichard Wallis
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internetdrgath
 
Thesis Proposal: User Application Profiles for Publishing Linked Data in HTM...
Thesis Proposal: User Application Profiles for Publishing Linked Data in  HTM...Thesis Proposal: User Application Profiles for Publishing Linked Data in  HTM...
Thesis Proposal: User Application Profiles for Publishing Linked Data in HTM...Sean Petiya
 
SemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in PracticeSemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in PracticeDan Brickley
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azureDavid Giard
 
Bingham, De Wild & Aasman Presentation
Bingham, De Wild & Aasman PresentationBingham, De Wild & Aasman Presentation
Bingham, De Wild & Aasman PresentationWARCnet
 
Metadata - Linked Data
Metadata - Linked DataMetadata - Linked Data
Metadata - Linked DataRichard Wallis
 
YQL:: Select * from Internet
YQL:: Select * from InternetYQL:: Select * from Internet
YQL:: Select * from Internetdrgath
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Oscar Corcho
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kitehuguk
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic WebRoberto García
 
Linked Data - Exposing what we have
Linked Data - Exposing what we haveLinked Data - Exposing what we have
Linked Data - Exposing what we haveRichard Wallis
 
Open Data and CKAN Data Catalogues
Open Data and CKAN Data CataloguesOpen Data and CKAN Data Catalogues
Open Data and CKAN Data Cataloguesdavid-read
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Cory Lampert
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 

Similar a Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Historian Use Case (20)

Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Open Data and CKAN Data Catalogues
Open Data and CKAN Data CataloguesOpen Data and CKAN Data Catalogues
Open Data and CKAN Data Catalogues
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
 
Client-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderClient-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer Header
 
Schema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your LibrarySchema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your Library
 
Schema.org - An Extending Influence
Schema.org - An Extending InfluenceSchema.org - An Extending Influence
Schema.org - An Extending Influence
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internet
 
Thesis Proposal: User Application Profiles for Publishing Linked Data in HTM...
Thesis Proposal: User Application Profiles for Publishing Linked Data in  HTM...Thesis Proposal: User Application Profiles for Publishing Linked Data in  HTM...
Thesis Proposal: User Application Profiles for Publishing Linked Data in HTM...
 
SemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in PracticeSemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in Practice
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Bingham, De Wild & Aasman Presentation
Bingham, De Wild & Aasman PresentationBingham, De Wild & Aasman Presentation
Bingham, De Wild & Aasman Presentation
 
Metadata - Linked Data
Metadata - Linked DataMetadata - Linked Data
Metadata - Linked Data
 
YQL:: Select * from Internet
YQL:: Select * from InternetYQL:: Select * from Internet
YQL:: Select * from Internet
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kite
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic Web
 
Linked Data - Exposing what we have
Linked Data - Exposing what we haveLinked Data - Exposing what we have
Linked Data - Exposing what we have
 
Open Data and CKAN Data Catalogues
Open Data and CKAN Data CataloguesOpen Data and CKAN Data Catalogues
Open Data and CKAN Data Catalogues
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 

Más de Ian Milligan

Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...
Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...
Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...Ian Milligan
 
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...Ian Milligan
 
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...Ian Milligan
 
Congress text-mining-event
Congress text-mining-eventCongress text-mining-event
Congress text-mining-eventIan Milligan
 
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...Ian Milligan
 
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...Ian Milligan
 
Ruest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC AdventureRuest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC AdventureIan Milligan
 
International Internet Preservation Consortium Research Slides from Ian Milligan
International Internet Preservation Consortium Research Slides from Ian MilliganInternational Internet Preservation Consortium Research Slides from Ian Milligan
International Internet Preservation Consortium Research Slides from Ian MilliganIan Milligan
 
Historical Research Breakout Session Notes, WIRE 2014
Historical Research Breakout Session Notes, WIRE 2014Historical Research Breakout Session Notes, WIRE 2014
Historical Research Breakout Session Notes, WIRE 2014Ian Milligan
 

Más de Ian Milligan (9)

Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...
Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...
Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...
 
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
 
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...
 
Congress text-mining-event
Congress text-mining-eventCongress text-mining-event
Congress text-mining-event
 
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
 
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
 
Ruest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC AdventureRuest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC Adventure
 
International Internet Preservation Consortium Research Slides from Ian Milligan
International Internet Preservation Consortium Research Slides from Ian MilliganInternational Internet Preservation Consortium Research Slides from Ian Milligan
International Internet Preservation Consortium Research Slides from Ian Milligan
 
Historical Research Breakout Session Notes, WIRE 2014
Historical Research Breakout Session Notes, WIRE 2014Historical Research Breakout Session Notes, WIRE 2014
Historical Research Breakout Session Notes, WIRE 2014
 

Último

Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
AlbaniaDreamin24 - How to easily use an API with Flows
AlbaniaDreamin24 - How to easily use an API with FlowsAlbaniaDreamin24 - How to easily use an API with Flows
AlbaniaDreamin24 - How to easily use an API with FlowsThierry TROUIN ☁
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...tanu pandey
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Sheetaleventcompany
 
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With RoomVIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Roomgirls4nights
 
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130  Available With RoomVIP Kolkata Call Girl Alambazar 👉 8250192130  Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Roomdivyansh0kumar0
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Delhi Call girls
 
Low Rate Call Girls Kolkata Avani 🤌 8250192130 🚀 Vip Call Girls Kolkata
Low Rate Call Girls Kolkata Avani 🤌  8250192130 🚀 Vip Call Girls KolkataLow Rate Call Girls Kolkata Avani 🤌  8250192130 🚀 Vip Call Girls Kolkata
Low Rate Call Girls Kolkata Avani 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$kojalkojal131
 
Russian Call Girls in Kolkata Ishita 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Ishita 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Ishita 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Ishita 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024APNIC
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...APNIC
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girlsstephieert
 

Último (20)

Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
AlbaniaDreamin24 - How to easily use an API with Flows
AlbaniaDreamin24 - How to easily use an API with FlowsAlbaniaDreamin24 - How to easily use an API with Flows
AlbaniaDreamin24 - How to easily use an API with Flows
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
 
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With RoomVIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
 
Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Ex 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICECall Girls In South Ex 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
 
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130  Available With RoomVIP Kolkata Call Girl Alambazar 👉 8250192130  Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
 
Low Rate Call Girls Kolkata Avani 🤌 8250192130 🚀 Vip Call Girls Kolkata
Low Rate Call Girls Kolkata Avani 🤌  8250192130 🚀 Vip Call Girls KolkataLow Rate Call Girls Kolkata Avani 🤌  8250192130 🚀 Vip Call Girls Kolkata
Low Rate Call Girls Kolkata Avani 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
 
Russian Call Girls in Kolkata Ishita 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Ishita 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Ishita 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Ishita 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girls
 

Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Historian Use Case

  • 1. Warcbase Building a Scalable Platform on HBase and Hadoop Part Two: Historian Use Case Jimmy Lin University of Maryland College Park, MD Ian Milligan University of Waterloo Waterloo, ON Canada
  • 2. Why should a historian care? The sheer amount of social, cultural, and political information generated every day presents new opportunities for historians.
  • 3. Could one even study the 1990s and beyond without web archives?
  • 4. No. Historians need to do this now, or we’re going to be left behind.
  • 5. Nightmare Scenario • Wayback Machine won’t be enough. We won’t use that. • Historians rely uncritically on date-ordered keyword search results, putting them at mercy of search algorithms they do not understand; • Historians are completely left out of post-1996 research, letting everybody else do the work (a la Culturomics project/Nature magazine article); • Our profession gets left behind…
  • 6.
  • 7. Unlocking an Archive-It Collection • Archive-It has amazing collections of social, cultural, political, and economic records generated by everyday people, leaders, businesses, academics, and beyond. • Stories waiting to be hold. • The data is there, but the problem is access.
  • 8. Example Dataset • Archive-It Collection 227, Canadian Political Parties and Political Interest Groups (University of Toronto) • October 2005 - Present • All major and minor political parties, as well as organized political interest groups (Council of Canadians, Coalition to Oppose the Arms Trade Assembly of First Nations, etc.) • Started by now-retired librarian, hard to get details on seed list
  • 9. Two Main Approaches • Warcbase • Link extraction and analytics • Full-text extraction and analytics • Full-text faceted search • UK Web Archive’s Shine solr front end
  • 10. Using Warcbase to analyze links and full-text
  • 11. Basic Link Statistics • Count number of pages per domain • Count number of links for each crawl so they can be normalized (very important) • Run on command line using relatively simple pig scripts
  • 12. Example Script (counting number of links for each crawl) register  'target/warcbase-­‐0.1.0-­‐SNAPSHOT-­‐fatjar.jar';   DEFINE  ArcLoader  org.warcbase.pig.ArcLoader();   DEFINE  ExtractLinks   org.warcbase.pig.piggybank.ExtractLinks();   raw  =  load  '/shared/collections/CanadianPoliticalParties/ arc/'  using  ArcLoader  as      (url:  chararray,  date:  chararray,  mime:  chararray,   content:  bytearray);   a  =  filter  raw  by  mime  ==  'text/html'  and  date  is  not  null;   b  =  foreach  a  generate  SUBSTRING(date,  0,  6)  as  date,  url,   FLATTEN(ExtractLinks((chararray)  content,  url));   c  =  group  b  by  $0;   d  =  foreach  c  generate  group,  COUNT(b);
  • 13. Social Media Appearances - Twitter (20080611220246,http://creativecommons.org/,twitter)   (20080711224545,http://www.pm.gc.ca/eng/feature.asp?pageId=105,twitter)   (20080712030632,http://www.pm.gc.ca/fra/feature.asp?pageId=105,twitter)   (20080712142357,http://www.pm.gc.ca/eng/media.asp?category=2&;id=1814,twitter)   (20080930221618,http://www.ndp.ca/home,twitter)   (20080930221618,http://www.ndp.ca/home,twitter)   (20080930221638,http://www.liberal.ca/default_e.aspx,twitter)   (20080930221641,http://www.liberal.ca/story_15081_e.aspx,twitter)   (20080930221714,http://www.liberal.ca/video_e.aspx,twitter)   (20080930221903,http://www.ndp.ca/page/5246,twitter)   (20080930221904,http://www.ndp.ca/twitterblogwidget/ndp-­‐twitter.php? lang=en,twitter)   (20080930222049,http://greenparty.ca/en/action,twitter)   (20080930222124,http://www.ndp.ca/bloggingtools,twitter)   (20080930222825,http://greenparty.ca/en/campaign/35053,twitter)   (20080930223014,http://greenparty.ca/en/campaign/35068,twitter)   (20080930223240,http://www.liberal.ca/depth_e.aspx,twitter)   (20080930223258,http://www.liberal.ca/enews_e.aspx,twitter)   (20080930223315,http://www.liberal.ca/glance_e.aspx,twitter)   (20080930223320,http://www.liberal.ca/story_15073_e.aspx,twitter)   (20080930223323,http://www.liberal.ca/gallery_e.aspx,twitter)
  • 14. Social Media Appearances - Facebook (20070418135140,http://www.liberal.ca/glance_e.aspx,facebook)   (20070418135947,http://greenparty.ca/en/blog/activemenu/menu?page=2,facebook)   (20070418140056,http://greenparty.ca/en/blog/activemenu/book?page=2,facebook)   (20070418140511,http://greenparty.ca/en/blog/popular?page=3,facebook)   (20070418140516,http://www.liberal.ca/glance_f.aspx,facebook)   (20070418141139,http://greenparty.ca/en/blog/431,facebook)   (20070418141930,http://greenparty.ca/en/blog?page=2,facebook)   (20070418143749,http://greenparty.ca/en/node/1280,facebook)   (20070418143900,http://greenparty.ca/en/blog/activemenu/activemenu/book?page=2,facebook)   (20070418144002,http://greenparty.ca/en/blog/activemenu/activemenu/menu?page=2,facebook)   (20070418151727,http://www.equalvoice.ca/youth/,facebook)   (20070418151734,http://www.equalvoice.ca/youth/index.htm,facebook)   (20070418151843,http://www.equalvoice.ca/youth/Bios.htm,facebook)   (20070418153832,http://greenparty.ca/fr/node/1280,facebook)   (20070418154008,http://greenparty.ca/en/blog/activemenu/activemenu/activemenu/menu? page=2,facebook)   (20070418154112,http://greenparty.ca/en/blog/activemenu/activemenu/activemenu/book? page=2,facebook)   (20070518134656,http://www.liberal.ca/glance_e.aspx,facebook)   (20070518134918,http://www.liberal.ca/conversation_e.aspx,facebook)   (20070518134918,http://www.liberal.ca/conversation_e.aspx,facebook)   (20070518134941,http://www.ndp.ca/page/4733,facebook)
  • 15. Link Analysis • Extracting links by domain (tab-separated values): 200810  conservative.ca   digg.com   2325   200810  conservative.ca   facebook.com   2325   200810  conservative.ca   mycampaign.conservative.ca   7902   [..]   200902  liberal.ca  ctv.ca  16   200902  liberal.ca  del.icio.us   1118   200902  liberal.ca  digg.com   1118  
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22. Other Cases • Extracting all links to the mainstream media, or thinktanks, or other political parties
  • 23.
  • 25. Text Analysis register  'target/warcbase-­‐0.1.0-­‐SNAPSHOT-­‐fatjar.jar';   DEFINE  ArcLoader  org.warcbase.pig.ArcLoader();   DEFINE  ExtractRawText  org.warcbase.pig.piggybank.ExtractRawText();   DEFINE  ExtractTopLevelDomain   org.warcbase.pig.piggybank.ExtractTopLevelDomain();   raw  =  load  '/shared/collections/CanadianPoliticalParties/arc/'  using   ArcLoader  as      (url:  chararray,  date:  chararray,  mime:  chararray,  content:  bytearray);   a  =  filter  raw  by  mime  ==  'text/html'  and  date  is  not  null;   b  =  foreach  a  generate  SUBSTRING(date,  0,  6)  as  date,                                                REPLACE(ExtractTopLevelDomain(url),  '^s*www.',   '')  as  url,  content;   c  =  filter  b  by  url  ==  'greenparty.ca';   d  =  foreach  c  generate  date,  url,  ExtractRawText((chararray)  content)  as   text;   store  d  into  'cpp.text-­‐greenparty';
  • 26. Text Analysis • Now have circumscribed corpus for specified query (i.e. liberal.ca, or ndp.ca, or conservative.ca) • Can now use standard text analysis tools, etc. to extract meaning • LDA (topic modeling) • NER (named entity recognition)
  • 27. NER October  2005      62476  Stephen  Harper      30234  Michael  Chong      30109  Gwynne  Dyer      28011  ami  Entrez      26238  Paul  Martin      22303  Harper  
  • 28. NER November  2008        3188  Stéphane  Dion        2557  Stephen  Harper        2471  Stephen  HarperLaureen        2410  Dion        2356  Harper  
  • 30. Shine • UK Web Archive’s Shine (https://github.com/ukwa/ shine) • Indexing as bottleneck • ~ 250GB of WARCs takes ~ 5 days on a single machine • Hadoop indexer available if data in HFDS • ~ 90GB index size
  • 32. Shine • Advantages: accessible to the general public, easy to use, interactive trend diagram allows digging down for context, can move down to level of document itself. • Disadvantage: keyword searching requires you know what to look for; random sampling misleading when tens of thousands of records; etc. • Doesn’t take advantage of what makes web sources so powerful: hyperlinks
  • 34. Conclusions & Thanks Jimmy Lin University of Maryland College Park, MD Ian Milligan University of Waterloo Waterloo, ON Canada