SlideShare una empresa de Scribd logo
1 de 34
Descargar para leer sin conexión
Warcbase
Building a Scalable Platform
on HBase and Hadoop
Part Two: Historian Use Case
Jimmy Lin
University of Maryland
College Park, MD
Ian Milligan
University of Waterloo
Waterloo, ON Canada
Why should a
historian
care?
The sheer amount of social,
cultural, and political
information generated every
day presents new
opportunities for historians.
Could one
even study
the 1990s
and
beyond
without
web
archives?
No.
Historians need to do this now, or
we’re going to be left behind.
Nightmare Scenario
• Wayback Machine won’t be enough. We won’t use that.
• Historians rely uncritically on date-ordered keyword
search results, putting them at mercy of search
algorithms they do not understand;
• Historians are completely left out of post-1996
research, letting everybody else do the work (a la
Culturomics project/Nature magazine article);
• Our profession gets left behind…
Unlocking an Archive-It
Collection
• Archive-It has amazing collections of social,
cultural, political, and economic records generated
by everyday people, leaders, businesses,
academics, and beyond.
• Stories waiting to be hold.
• The data is there, but the problem is access.
Example Dataset
• Archive-It Collection 227,
Canadian Political Parties and
Political Interest Groups
(University of Toronto)
• October 2005 - Present
• All major and minor political
parties, as well as organized
political interest groups (Council
of Canadians, Coalition to
Oppose the Arms Trade
Assembly of First Nations, etc.)
• Started by now-retired librarian,
hard to get details on seed list
Two Main Approaches
• Warcbase
• Link extraction and analytics
• Full-text extraction and analytics
• Full-text faceted search
• UK Web Archive’s Shine solr front end
Using Warcbase to
analyze links and full-text
Basic Link Statistics
• Count number of pages per domain
• Count number of links for each crawl so they can
be normalized (very important)
• Run on command line using relatively simple pig
scripts
Example Script (counting
number of links for each crawl)
register	
  'target/warcbase-­‐0.1.0-­‐SNAPSHOT-­‐fatjar.jar';	
  
DEFINE	
  ArcLoader	
  org.warcbase.pig.ArcLoader();	
  
DEFINE	
  ExtractLinks	
  
org.warcbase.pig.piggybank.ExtractLinks();	
  
raw	
  =	
  load	
  '/shared/collections/CanadianPoliticalParties/
arc/'	
  using	
  ArcLoader	
  as	
  
	
  	
  (url:	
  chararray,	
  date:	
  chararray,	
  mime:	
  chararray,	
  
content:	
  bytearray);	
  
a	
  =	
  filter	
  raw	
  by	
  mime	
  ==	
  'text/html'	
  and	
  date	
  is	
  not	
  null;	
  
b	
  =	
  foreach	
  a	
  generate	
  SUBSTRING(date,	
  0,	
  6)	
  as	
  date,	
  url,	
  
FLATTEN(ExtractLinks((chararray)	
  content,	
  url));	
  
c	
  =	
  group	
  b	
  by	
  $0;	
  
d	
  =	
  foreach	
  c	
  generate	
  group,	
  COUNT(b);
Social Media Appearances -
Twitter
(20080611220246,http://creativecommons.org/,twitter)	
  
(20080711224545,http://www.pm.gc.ca/eng/feature.asp?pageId=105,twitter)	
  
(20080712030632,http://www.pm.gc.ca/fra/feature.asp?pageId=105,twitter)	
  
(20080712142357,http://www.pm.gc.ca/eng/media.asp?category=2&;id=1814,twitter)	
  
(20080930221618,http://www.ndp.ca/home,twitter)	
  
(20080930221618,http://www.ndp.ca/home,twitter)	
  
(20080930221638,http://www.liberal.ca/default_e.aspx,twitter)	
  
(20080930221641,http://www.liberal.ca/story_15081_e.aspx,twitter)	
  
(20080930221714,http://www.liberal.ca/video_e.aspx,twitter)	
  
(20080930221903,http://www.ndp.ca/page/5246,twitter)	
  
(20080930221904,http://www.ndp.ca/twitterblogwidget/ndp-­‐twitter.php?
lang=en,twitter)	
  
(20080930222049,http://greenparty.ca/en/action,twitter)	
  
(20080930222124,http://www.ndp.ca/bloggingtools,twitter)	
  
(20080930222825,http://greenparty.ca/en/campaign/35053,twitter)	
  
(20080930223014,http://greenparty.ca/en/campaign/35068,twitter)	
  
(20080930223240,http://www.liberal.ca/depth_e.aspx,twitter)	
  
(20080930223258,http://www.liberal.ca/enews_e.aspx,twitter)	
  
(20080930223315,http://www.liberal.ca/glance_e.aspx,twitter)	
  
(20080930223320,http://www.liberal.ca/story_15073_e.aspx,twitter)	
  
(20080930223323,http://www.liberal.ca/gallery_e.aspx,twitter)
Social Media Appearances -
Facebook
(20070418135140,http://www.liberal.ca/glance_e.aspx,facebook)	
  
(20070418135947,http://greenparty.ca/en/blog/activemenu/menu?page=2,facebook)	
  
(20070418140056,http://greenparty.ca/en/blog/activemenu/book?page=2,facebook)	
  
(20070418140511,http://greenparty.ca/en/blog/popular?page=3,facebook)	
  
(20070418140516,http://www.liberal.ca/glance_f.aspx,facebook)	
  
(20070418141139,http://greenparty.ca/en/blog/431,facebook)	
  
(20070418141930,http://greenparty.ca/en/blog?page=2,facebook)	
  
(20070418143749,http://greenparty.ca/en/node/1280,facebook)	
  
(20070418143900,http://greenparty.ca/en/blog/activemenu/activemenu/book?page=2,facebook)	
  
(20070418144002,http://greenparty.ca/en/blog/activemenu/activemenu/menu?page=2,facebook)	
  
(20070418151727,http://www.equalvoice.ca/youth/,facebook)	
  
(20070418151734,http://www.equalvoice.ca/youth/index.htm,facebook)	
  
(20070418151843,http://www.equalvoice.ca/youth/Bios.htm,facebook)	
  
(20070418153832,http://greenparty.ca/fr/node/1280,facebook)	
  
(20070418154008,http://greenparty.ca/en/blog/activemenu/activemenu/activemenu/menu?
page=2,facebook)	
  
(20070418154112,http://greenparty.ca/en/blog/activemenu/activemenu/activemenu/book?
page=2,facebook)	
  
(20070518134656,http://www.liberal.ca/glance_e.aspx,facebook)	
  
(20070518134918,http://www.liberal.ca/conversation_e.aspx,facebook)	
  
(20070518134918,http://www.liberal.ca/conversation_e.aspx,facebook)	
  
(20070518134941,http://www.ndp.ca/page/4733,facebook)
Link Analysis
• Extracting links by domain (tab-separated values):
200810	
  conservative.ca	
   digg.com	
   2325	
  
200810	
  conservative.ca	
   facebook.com	
   2325	
  
200810	
  conservative.ca	
   mycampaign.conservative.ca	
   7902	
  
[..]	
  
200902	
  liberal.ca	
  ctv.ca	
  16	
  
200902	
  liberal.ca	
  del.icio.us	
   1118	
  
200902	
  liberal.ca	
  digg.com	
   1118	
  
Other Cases
• Extracting all links to the mainstream media, or
thinktanks, or other political parties
2005 Canadian Federal Election
Text Analysis
register	
  'target/warcbase-­‐0.1.0-­‐SNAPSHOT-­‐fatjar.jar';	
  
DEFINE	
  ArcLoader	
  org.warcbase.pig.ArcLoader();	
  
DEFINE	
  ExtractRawText	
  org.warcbase.pig.piggybank.ExtractRawText();	
  
DEFINE	
  ExtractTopLevelDomain	
  
org.warcbase.pig.piggybank.ExtractTopLevelDomain();	
  
raw	
  =	
  load	
  '/shared/collections/CanadianPoliticalParties/arc/'	
  using	
  
ArcLoader	
  as	
  
	
  	
  (url:	
  chararray,	
  date:	
  chararray,	
  mime:	
  chararray,	
  content:	
  bytearray);	
  
a	
  =	
  filter	
  raw	
  by	
  mime	
  ==	
  'text/html'	
  and	
  date	
  is	
  not	
  null;	
  
b	
  =	
  foreach	
  a	
  generate	
  SUBSTRING(date,	
  0,	
  6)	
  as	
  date,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  REPLACE(ExtractTopLevelDomain(url),	
  '^s*www.',	
  
'')	
  as	
  url,	
  content;	
  
c	
  =	
  filter	
  b	
  by	
  url	
  ==	
  'greenparty.ca';	
  
d	
  =	
  foreach	
  c	
  generate	
  date,	
  url,	
  ExtractRawText((chararray)	
  content)	
  as	
  
text;	
  
store	
  d	
  into	
  'cpp.text-­‐greenparty';
Text Analysis
• Now have circumscribed corpus for specified
query (i.e. liberal.ca, or ndp.ca, or conservative.ca)
• Can now use standard text analysis tools, etc. to
extract meaning
• LDA (topic modeling)
• NER (named entity recognition)
NER
October	
  2005	
  
	
  	
  62476	
  Stephen	
  Harper	
  
	
  	
  30234	
  Michael	
  Chong	
  
	
  	
  30109	
  Gwynne	
  Dyer	
  
	
  	
  28011	
  ami	
  Entrez	
  
	
  	
  26238	
  Paul	
  Martin	
  
	
  	
  22303	
  Harper	
  
NER
November	
  2008	
  
	
  	
  	
  3188	
  Stéphane	
  Dion	
  
	
  	
  	
  2557	
  Stephen	
  Harper	
  
	
  	
  	
  2471	
  Stephen	
  HarperLaureen	
  
	
  	
  	
  2410	
  Dion	
  
	
  	
  	
  2356	
  Harper	
  
Visualizing Interface
Next Step?
Shine
• UK Web Archive’s Shine
(https://github.com/ukwa/
shine)
• Indexing as bottleneck
• ~ 250GB of WARCs takes ~
5 days on a single machine
• Hadoop indexer available if
data in HFDS
• ~ 90GB index size
Examples
Shine
• Advantages: accessible to the general public,
easy to use, interactive trend diagram allows
digging down for context, can move down to level
of document itself.
• Disadvantage: keyword searching requires you
know what to look for; random sampling misleading
when tens of thousands of records; etc.
• Doesn’t take advantage of what makes web
sources so powerful: hyperlinks
Building connections
between Warcbase and
Shine
Conclusions &
Thanks
Jimmy Lin
University of Maryland
College Park, MD
Ian Milligan
University of Waterloo
Waterloo, ON Canada

Más contenido relacionado

La actualidad más candente

Querying the Web of Data with XSPARQL 1.1
Querying the Web of Data with XSPARQL 1.1Querying the Web of Data with XSPARQL 1.1
Querying the Web of Data with XSPARQL 1.1Daniele Dell'Aglio
 
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panicoDiego Valerio Camarda
 
(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGGRatko Mutavdzic
 
Flagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis VZW
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQLOlaf Hartig
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hiveReza Ameri
 
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...Data Con LA
 
Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...Humphrey Southall
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)dataSUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)dataDiego Valerio Camarda
 
Querying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonQuerying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonDaniel Rodriguez
 
RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)Daniele Dell'Aglio
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA
 
R Hadoop integration
R Hadoop integrationR Hadoop integration
R Hadoop integrationDzung Nguyen
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 

La actualidad más candente (20)

Querying the Web of Data with XSPARQL 1.1
Querying the Web of Data with XSPARQL 1.1Querying the Web of Data with XSPARQL 1.1
Querying the Web of Data with XSPARQL 1.1
 
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
 
(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG
 
Flagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertier
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQL
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
 
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...
 
Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)dataSUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
 
Querying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonQuerying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with python
 
RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Kyiv.py #16 october 2015
Kyiv.py #16 october 2015Kyiv.py #16 october 2015
Kyiv.py #16 october 2015
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for Hadoop
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
 
R Hadoop integration
R Hadoop integrationR Hadoop integration
R Hadoop integration
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 

Similar a Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Historian Use Case

Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Open Data and CKAN Data Catalogues
Open Data and CKAN Data CataloguesOpen Data and CKAN Data Catalogues
Open Data and CKAN Data Cataloguesdavid-read
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardITCamp
 
Client-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderClient-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderMat Kelly
 
Schema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your LibrarySchema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your LibraryRichard Wallis
 
Schema.org - An Extending Influence
Schema.org - An Extending InfluenceSchema.org - An Extending Influence
Schema.org - An Extending InfluenceRichard Wallis
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internetdrgath
 
Thesis Proposal: User Application Profiles for Publishing Linked Data in HTM...
Thesis Proposal: User Application Profiles for Publishing Linked Data in  HTM...Thesis Proposal: User Application Profiles for Publishing Linked Data in  HTM...
Thesis Proposal: User Application Profiles for Publishing Linked Data in HTM...Sean Petiya
 
SemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in PracticeSemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in PracticeDan Brickley
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azureDavid Giard
 
Bingham, De Wild & Aasman Presentation
Bingham, De Wild & Aasman PresentationBingham, De Wild & Aasman Presentation
Bingham, De Wild & Aasman PresentationWARCnet
 
Metadata - Linked Data
Metadata - Linked DataMetadata - Linked Data
Metadata - Linked DataRichard Wallis
 
YQL:: Select * from Internet
YQL:: Select * from InternetYQL:: Select * from Internet
YQL:: Select * from Internetdrgath
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Oscar Corcho
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kitehuguk
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic WebRoberto García
 
Linked Data - Exposing what we have
Linked Data - Exposing what we haveLinked Data - Exposing what we have
Linked Data - Exposing what we haveRichard Wallis
 
Open Data and CKAN Data Catalogues
Open Data and CKAN Data CataloguesOpen Data and CKAN Data Catalogues
Open Data and CKAN Data Cataloguesdavid-read
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Cory Lampert
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 

Similar a Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Historian Use Case (20)

Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Open Data and CKAN Data Catalogues
Open Data and CKAN Data CataloguesOpen Data and CKAN Data Catalogues
Open Data and CKAN Data Catalogues
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
 
Client-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderClient-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer Header
 
Schema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your LibrarySchema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your Library
 
Schema.org - An Extending Influence
Schema.org - An Extending InfluenceSchema.org - An Extending Influence
Schema.org - An Extending Influence
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internet
 
Thesis Proposal: User Application Profiles for Publishing Linked Data in HTM...
Thesis Proposal: User Application Profiles for Publishing Linked Data in  HTM...Thesis Proposal: User Application Profiles for Publishing Linked Data in  HTM...
Thesis Proposal: User Application Profiles for Publishing Linked Data in HTM...
 
SemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in PracticeSemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in Practice
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Bingham, De Wild & Aasman Presentation
Bingham, De Wild & Aasman PresentationBingham, De Wild & Aasman Presentation
Bingham, De Wild & Aasman Presentation
 
Metadata - Linked Data
Metadata - Linked DataMetadata - Linked Data
Metadata - Linked Data
 
YQL:: Select * from Internet
YQL:: Select * from InternetYQL:: Select * from Internet
YQL:: Select * from Internet
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kite
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic Web
 
Linked Data - Exposing what we have
Linked Data - Exposing what we haveLinked Data - Exposing what we have
Linked Data - Exposing what we have
 
Open Data and CKAN Data Catalogues
Open Data and CKAN Data CataloguesOpen Data and CKAN Data Catalogues
Open Data and CKAN Data Catalogues
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 

Más de Ian Milligan

Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...
Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...
Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...Ian Milligan
 
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...Ian Milligan
 
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...Ian Milligan
 
Congress text-mining-event
Congress text-mining-eventCongress text-mining-event
Congress text-mining-eventIan Milligan
 
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...Ian Milligan
 
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...Ian Milligan
 
Ruest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC AdventureRuest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC AdventureIan Milligan
 
International Internet Preservation Consortium Research Slides from Ian Milligan
International Internet Preservation Consortium Research Slides from Ian MilliganInternational Internet Preservation Consortium Research Slides from Ian Milligan
International Internet Preservation Consortium Research Slides from Ian MilliganIan Milligan
 
Historical Research Breakout Session Notes, WIRE 2014
Historical Research Breakout Session Notes, WIRE 2014Historical Research Breakout Session Notes, WIRE 2014
Historical Research Breakout Session Notes, WIRE 2014Ian Milligan
 

Más de Ian Milligan (9)

Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...
Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...
Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...
 
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
 
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...
 
Congress text-mining-event
Congress text-mining-eventCongress text-mining-event
Congress text-mining-event
 
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
 
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
 
Ruest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC AdventureRuest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC Adventure
 
International Internet Preservation Consortium Research Slides from Ian Milligan
International Internet Preservation Consortium Research Slides from Ian MilliganInternational Internet Preservation Consortium Research Slides from Ian Milligan
International Internet Preservation Consortium Research Slides from Ian Milligan
 
Historical Research Breakout Session Notes, WIRE 2014
Historical Research Breakout Session Notes, WIRE 2014Historical Research Breakout Session Notes, WIRE 2014
Historical Research Breakout Session Notes, WIRE 2014
 

Último

Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxeditsforyah
 
Intellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxIntellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxBipin Adhikari
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleanscorenetworkseo
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 

Último (20)

Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptx
 
Intellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxIntellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptx
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleans
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 

Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Historian Use Case

  • 1. Warcbase Building a Scalable Platform on HBase and Hadoop Part Two: Historian Use Case Jimmy Lin University of Maryland College Park, MD Ian Milligan University of Waterloo Waterloo, ON Canada
  • 2. Why should a historian care? The sheer amount of social, cultural, and political information generated every day presents new opportunities for historians.
  • 3. Could one even study the 1990s and beyond without web archives?
  • 4. No. Historians need to do this now, or we’re going to be left behind.
  • 5. Nightmare Scenario • Wayback Machine won’t be enough. We won’t use that. • Historians rely uncritically on date-ordered keyword search results, putting them at mercy of search algorithms they do not understand; • Historians are completely left out of post-1996 research, letting everybody else do the work (a la Culturomics project/Nature magazine article); • Our profession gets left behind…
  • 6.
  • 7. Unlocking an Archive-It Collection • Archive-It has amazing collections of social, cultural, political, and economic records generated by everyday people, leaders, businesses, academics, and beyond. • Stories waiting to be hold. • The data is there, but the problem is access.
  • 8. Example Dataset • Archive-It Collection 227, Canadian Political Parties and Political Interest Groups (University of Toronto) • October 2005 - Present • All major and minor political parties, as well as organized political interest groups (Council of Canadians, Coalition to Oppose the Arms Trade Assembly of First Nations, etc.) • Started by now-retired librarian, hard to get details on seed list
  • 9. Two Main Approaches • Warcbase • Link extraction and analytics • Full-text extraction and analytics • Full-text faceted search • UK Web Archive’s Shine solr front end
  • 10. Using Warcbase to analyze links and full-text
  • 11. Basic Link Statistics • Count number of pages per domain • Count number of links for each crawl so they can be normalized (very important) • Run on command line using relatively simple pig scripts
  • 12. Example Script (counting number of links for each crawl) register  'target/warcbase-­‐0.1.0-­‐SNAPSHOT-­‐fatjar.jar';   DEFINE  ArcLoader  org.warcbase.pig.ArcLoader();   DEFINE  ExtractLinks   org.warcbase.pig.piggybank.ExtractLinks();   raw  =  load  '/shared/collections/CanadianPoliticalParties/ arc/'  using  ArcLoader  as      (url:  chararray,  date:  chararray,  mime:  chararray,   content:  bytearray);   a  =  filter  raw  by  mime  ==  'text/html'  and  date  is  not  null;   b  =  foreach  a  generate  SUBSTRING(date,  0,  6)  as  date,  url,   FLATTEN(ExtractLinks((chararray)  content,  url));   c  =  group  b  by  $0;   d  =  foreach  c  generate  group,  COUNT(b);
  • 13. Social Media Appearances - Twitter (20080611220246,http://creativecommons.org/,twitter)   (20080711224545,http://www.pm.gc.ca/eng/feature.asp?pageId=105,twitter)   (20080712030632,http://www.pm.gc.ca/fra/feature.asp?pageId=105,twitter)   (20080712142357,http://www.pm.gc.ca/eng/media.asp?category=2&;id=1814,twitter)   (20080930221618,http://www.ndp.ca/home,twitter)   (20080930221618,http://www.ndp.ca/home,twitter)   (20080930221638,http://www.liberal.ca/default_e.aspx,twitter)   (20080930221641,http://www.liberal.ca/story_15081_e.aspx,twitter)   (20080930221714,http://www.liberal.ca/video_e.aspx,twitter)   (20080930221903,http://www.ndp.ca/page/5246,twitter)   (20080930221904,http://www.ndp.ca/twitterblogwidget/ndp-­‐twitter.php? lang=en,twitter)   (20080930222049,http://greenparty.ca/en/action,twitter)   (20080930222124,http://www.ndp.ca/bloggingtools,twitter)   (20080930222825,http://greenparty.ca/en/campaign/35053,twitter)   (20080930223014,http://greenparty.ca/en/campaign/35068,twitter)   (20080930223240,http://www.liberal.ca/depth_e.aspx,twitter)   (20080930223258,http://www.liberal.ca/enews_e.aspx,twitter)   (20080930223315,http://www.liberal.ca/glance_e.aspx,twitter)   (20080930223320,http://www.liberal.ca/story_15073_e.aspx,twitter)   (20080930223323,http://www.liberal.ca/gallery_e.aspx,twitter)
  • 14. Social Media Appearances - Facebook (20070418135140,http://www.liberal.ca/glance_e.aspx,facebook)   (20070418135947,http://greenparty.ca/en/blog/activemenu/menu?page=2,facebook)   (20070418140056,http://greenparty.ca/en/blog/activemenu/book?page=2,facebook)   (20070418140511,http://greenparty.ca/en/blog/popular?page=3,facebook)   (20070418140516,http://www.liberal.ca/glance_f.aspx,facebook)   (20070418141139,http://greenparty.ca/en/blog/431,facebook)   (20070418141930,http://greenparty.ca/en/blog?page=2,facebook)   (20070418143749,http://greenparty.ca/en/node/1280,facebook)   (20070418143900,http://greenparty.ca/en/blog/activemenu/activemenu/book?page=2,facebook)   (20070418144002,http://greenparty.ca/en/blog/activemenu/activemenu/menu?page=2,facebook)   (20070418151727,http://www.equalvoice.ca/youth/,facebook)   (20070418151734,http://www.equalvoice.ca/youth/index.htm,facebook)   (20070418151843,http://www.equalvoice.ca/youth/Bios.htm,facebook)   (20070418153832,http://greenparty.ca/fr/node/1280,facebook)   (20070418154008,http://greenparty.ca/en/blog/activemenu/activemenu/activemenu/menu? page=2,facebook)   (20070418154112,http://greenparty.ca/en/blog/activemenu/activemenu/activemenu/book? page=2,facebook)   (20070518134656,http://www.liberal.ca/glance_e.aspx,facebook)   (20070518134918,http://www.liberal.ca/conversation_e.aspx,facebook)   (20070518134918,http://www.liberal.ca/conversation_e.aspx,facebook)   (20070518134941,http://www.ndp.ca/page/4733,facebook)
  • 15. Link Analysis • Extracting links by domain (tab-separated values): 200810  conservative.ca   digg.com   2325   200810  conservative.ca   facebook.com   2325   200810  conservative.ca   mycampaign.conservative.ca   7902   [..]   200902  liberal.ca  ctv.ca  16   200902  liberal.ca  del.icio.us   1118   200902  liberal.ca  digg.com   1118  
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22. Other Cases • Extracting all links to the mainstream media, or thinktanks, or other political parties
  • 23.
  • 25. Text Analysis register  'target/warcbase-­‐0.1.0-­‐SNAPSHOT-­‐fatjar.jar';   DEFINE  ArcLoader  org.warcbase.pig.ArcLoader();   DEFINE  ExtractRawText  org.warcbase.pig.piggybank.ExtractRawText();   DEFINE  ExtractTopLevelDomain   org.warcbase.pig.piggybank.ExtractTopLevelDomain();   raw  =  load  '/shared/collections/CanadianPoliticalParties/arc/'  using   ArcLoader  as      (url:  chararray,  date:  chararray,  mime:  chararray,  content:  bytearray);   a  =  filter  raw  by  mime  ==  'text/html'  and  date  is  not  null;   b  =  foreach  a  generate  SUBSTRING(date,  0,  6)  as  date,                                                REPLACE(ExtractTopLevelDomain(url),  '^s*www.',   '')  as  url,  content;   c  =  filter  b  by  url  ==  'greenparty.ca';   d  =  foreach  c  generate  date,  url,  ExtractRawText((chararray)  content)  as   text;   store  d  into  'cpp.text-­‐greenparty';
  • 26. Text Analysis • Now have circumscribed corpus for specified query (i.e. liberal.ca, or ndp.ca, or conservative.ca) • Can now use standard text analysis tools, etc. to extract meaning • LDA (topic modeling) • NER (named entity recognition)
  • 27. NER October  2005      62476  Stephen  Harper      30234  Michael  Chong      30109  Gwynne  Dyer      28011  ami  Entrez      26238  Paul  Martin      22303  Harper  
  • 28. NER November  2008        3188  Stéphane  Dion        2557  Stephen  Harper        2471  Stephen  HarperLaureen        2410  Dion        2356  Harper  
  • 30. Shine • UK Web Archive’s Shine (https://github.com/ukwa/ shine) • Indexing as bottleneck • ~ 250GB of WARCs takes ~ 5 days on a single machine • Hadoop indexer available if data in HFDS • ~ 90GB index size
  • 32. Shine • Advantages: accessible to the general public, easy to use, interactive trend diagram allows digging down for context, can move down to level of document itself. • Disadvantage: keyword searching requires you know what to look for; random sampling misleading when tens of thousands of records; etc. • Doesn’t take advantage of what makes web sources so powerful: hyperlinks
  • 34. Conclusions & Thanks Jimmy Lin University of Maryland College Park, MD Ian Milligan University of Waterloo Waterloo, ON Canada