SlideShare una empresa de Scribd logo
1 de 18
ArchiveSpark:
Efficient Access, Extraction and
Derivation for Archival Collections
https://github.com/helgeho/ArchiveSpark
Helge Holzmann (holzmann@L3S.de) 1
https://github.com/helgeho/MHLonArchiveSpark
in cooperation with
What is ArchiveSpark?
• Expressive and efficient data access / processing framework
• Originally built for Web Archives, later extended to any collection
• Joint work with the Internet Archive
• Open source
• Fork us on GitHub: https://github.com/helgeho/ArchiveSpark
• Star, contribute, fix, spread, get involved!
• Modular, easily extensible
• More details in: (please cite)
• Helge Holzmann, Vinay Goel, Avishek Anand. ArchiveSpark: Efficient Web
Archive Access, Extraction and Derivation. JCDL 2016 (Best Paper Nominee)
• Helge Holzmann, Emily Novak Gustainis, Vinay Goel. Universal Distant
Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017
20/02/2018 Helge Holzmann (holzmann@L3S.de)
2
The ArchiveSpark Approach
• Metadata first, content second, your corpus third
• Two step loading approach for improved efficiency
• Filter as much as possible on metadata before touching the archive
• Enrich metadata instead of mapping / transforming the full records
3
20/02/2018 Helge Holzmann (holzmann@L3S.de)
in Web archives: generally (MHL):
ArchiveSpark and MHL
• MHL-specific Data Specifications are available on GitHub
• https://github.com/helgeho/MHLonArchiveSpark
• The metadata source is MHL‘s advanced full-text search
• http://mhl.countway.harvard.edu/search
• The basic features are replicated by our tool
• more advanced filtering can be done on the retrieved metadata
• Full-text contents are fetched from the Internet Archive
• Seamlessly, abstracted away from the user
• Metadata records are enriched with requested contents
4
20/02/2018 Helge Holzmann (holzmann@L3S.de)
Simple and Expressive Interface
• Based on Spark, powered by Scala
• This does not mean you have to learn a new programming language!
• The interface is rather declarative and writing scripts for
ArchiveSpark does not require deep knowledge about Spark / Scala
• Simple data accessors are included
• Provide simplified access to the underlying data model
• Easy extraction / enrichment mechanisms
• Customizable and extensible by advanced users
5
20/02/2018 Helge Holzmann (holzmann@L3S.de)
val query = MhlSearchOptions(query = "polio", collections = MhlCollections.Statemedicalsocietyjournals)
val rdd = ArchiveSpark.load(MhlSearchSpec(query))
val enriched = rdd.enrich(Entities)
enriched.saveAsJson("enriched.json.gz")
Implicit Lineage Documentation
• Nested JSON output encodes lineage of applied enrichments
6
20/02/2018 Helge Holzmann (holzmann@L3S.de)
title
text
entities
persons
Getting Started
• We recommend the interactive use with Jupyter
• http://jupyter.org with Apache Toree: https://toree.apache.org
• Commands can be run live, results are returned immediately
• Documentation with examples is available on GitHub
• https://github.com/helgeho/ArchiveSpark/blob/master/docs
• Everything set up in a Docker container to get you started
• https://github.com/helgeho/ArchiveSpark-docker
• ArchiveSpark with Jupyter and Toree, pre-configured
• It’s just one command away, instructions are on GitHub
7
20/02/2018 Helge Holzmann (holzmann@L3S.de)
Run ArchiveSpark with Jupyter
• Start Docker (see https://github.com/helgeho/ArchiveSpark-docker)
• Add additional JAR files for MHL
• Download from https://github.com/helgeho/MHLonArchiveSpark/releases
• Copy to config/lib (the config path you specified with Docker)
8
20/02/2018 Helge Holzmann (holzmann@L3S.de)
Your First ArchiveSpark Jupyter Notebook
• Open Jupyter in your browser:
• Create a new Jupyter notebook with ArchiveSpark:
• You are now ready to write your first job:
• Press ctrl+enter to run a cell and you will immediately see the output:
9
20/02/2018 Helge Holzmann (holzmann@L3S.de)
Example: Polio Symptoms in MHL (1)
• What are the most frequently occurring symptoms and affected
body parts of Polio in journals of the Medical Heritage Library?
• The full example is available on GitHub
• https://github.com/helgeho/MHLonArchiveSpark/blob/master/examples/Mhl
PolioSymptomsSearch.ipynb
• Details can be found in:
• Helge Holzmann, Emily Novak Gustainis, Vinay Goel. Universal Distant
Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017
• http://www.helgeholzmann.de/papers/BIGDATA_2017.pdf
• More about the used Spark operations: Spark Programming Guide
• https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html
10
20/02/2018 Helge Holzmann (holzmann@L3S.de)
Example: Polio Symptoms in MHL (2)
• Import the required modules / methods
• Specify the query and load the dataset
• Available options to specify the query can be found in the code
• https://github.com/helgeho/MHLonArchiveSpark/blob/master/src/main/scal
a/edu/harvard/countway/mhl/archivespark/search/MhlSearchOptions.scala 11
20/02/2018 Helge Holzmann (holzmann@L3S.de)
Example: Polio Symptoms in MHL (3)
• Now the dataset can be filtered based on the available metadata
• at any time, peekJson let’s you look at the data of the first record as JSON
12
20/02/2018 Helge Holzmann (holzmann@L3S.de)
Example: Polio Symptoms in MHL (4)
• We define a new enrich function, which extracts the symptoms
• This is based on the content in lower case (LowerCase Enrich Function)
• We specify a set of interesting symptoms and affected body parts
• For each record, this is filtered by the ones contained in the content
• This new Enrich Function is assigned to a variable called symptoms
• Finally, the dataset is enriched with the set of contained symptoms
13
20/02/2018 Helge Holzmann (holzmann@L3S.de)
Example: Polio Symptoms in MHL (5)
• We again print out the first record to check the result
• println is used to see the full output, Jupyter would cut it otherwise
…
14
20/02/2018 Helge Holzmann (holzmann@L3S.de)
The JSON structure nicely reflects the
lineage. We can immediately see that
symptoms in this case were extracted
from the text, which was first
converted to lower case.
Example: Polio Symptoms in MHL (6)
• Eventually, we can count to the contained symptoms
• Our symptoms Enrich Function can be used as a pointer to the values here
• We flat-map these values, i.e., we create a flat list of all symptoms in the
dataset, each occurring once per record it is contained in
• To see the results, we print each line of the computed counts
15
20/02/2018 Helge Holzmann (holzmann@L3S.de)
Example: Polio Symptoms in MHL (7)
• Alternatively, we could save all filtered and enriched records as JSON
• The JSON format is widely supported by many third-party tools, so
that the resulting dataset can easily be post-processed
• to post-process it with Spark, records can be mapped to its raw values
• value can be used to access an enriched value of a record, e.g.:
val mapped = enriched.map(r => (r.title, r.valueOrElse(symptoms, Seq.empty)))
• Hint: to access the raw text of a MHL document use Text:
E.g., Text.map("length"){txt: String => txt.length) or Entities.on(Text)
16
20/02/2018 Helge Holzmann (holzmann@L3S.de)
For Advanced Users
• Some Enrich Functions, such as Entities, need additional JAR files
• Entities requires the Stanford CoreNLP library and models
• http://central.maven.org/maven2/edu/stanford/nlp/stanford-corenlp/3.4.1/
• These need to be added to the classpath (your config/lib directory if you use Docker)
• ArchiveSpark is also available on Maven Central
• https://mvnrepository.com/artifact/com.github.helgeho/archivespark
• To be used as library / API to access archival collections programmatically
• New Enrich Functions and DataSpecs are easy to create
• All required base classes are provided with the core project
• https://github.com/helgeho/ArchiveSpark/blob/master/docs/Contribute.md
• Please share yours!
17
20/02/2018 Helge Holzmann (holzmann@L3S.de)
That’s all Folks!
20/02/2018 Helge Holzmann (holzmann@L3S.de)
18
• Happy coding and please share your insights
• Fork us on GitHub: https://github.com/helgeho/ArchiveSpark
• Star, contribute, fix, spread, get involved!
• Feedback is very welcome…
• Visit us
• https://www.L3S.de
• https://archive.org
• http://alexandria-project.eu
• http://www.medicalheritage.org

Más contenido relacionado

La actualidad más candente

Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)
Sammy Fung
 
Embedded interactive learning analytics dashboards with Elasticsearch and Kib...
Embedded interactive learning analytics dashboards with Elasticsearch and Kib...Embedded interactive learning analytics dashboards with Elasticsearch and Kib...
Embedded interactive learning analytics dashboards with Elasticsearch and Kib...
Andrii Vozniuk
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
EUCLID project
 

La actualidad más candente (20)

Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)
 
Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 update
 
AINL 2016: Bugaychenko
AINL 2016: BugaychenkoAINL 2016: Bugaychenko
AINL 2016: Bugaychenko
 
EKAW - Publishing with Triple Pattern Fragments
EKAW - Publishing with Triple Pattern FragmentsEKAW - Publishing with Triple Pattern Fragments
EKAW - Publishing with Triple Pattern Fragments
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
Querying Linked Data
Querying Linked DataQuerying Linked Data
Querying Linked Data
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
 
Multidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderMultidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with Order
 
Embedded interactive learning analytics dashboards with Elasticsearch and Kib...
Embedded interactive learning analytics dashboards with Elasticsearch and Kib...Embedded interactive learning analytics dashboards with Elasticsearch and Kib...
Embedded interactive learning analytics dashboards with Elasticsearch and Kib...
 
EKAW - Linked Data Publishing
EKAW - Linked Data PublishingEKAW - Linked Data Publishing
EKAW - Linked Data Publishing
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for Repositories
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 
ResourceSync Quick Overview
ResourceSync Quick OverviewResourceSync Quick Overview
ResourceSync Quick Overview
 
Avogadro 2 and Open Chemistry
Avogadro 2 and Open ChemistryAvogadro 2 and Open Chemistry
Avogadro 2 and Open Chemistry
 
Indexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data TypesIndexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data Types
 
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
 

Similar a Medical Heritage Library (MHL) on ArchiveSpark

Linguistic Linked Open Data, Challenges, Approaches, Future Work
Linguistic Linked Open Data, Challenges, Approaches, Future WorkLinguistic Linked Open Data, Challenges, Approaches, Future Work
Linguistic Linked Open Data, Challenges, Approaches, Future Work
Sebastian Hellmann
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
Robert Viseur
 

Similar a Medical Heritage Library (MHL) on ArchiveSpark (20)

ArchiveSpark Introduction @ WebSci' 2016 Hackathon
ArchiveSpark Introduction @ WebSci' 2016 HackathonArchiveSpark Introduction @ WebSci' 2016 Hackathon
ArchiveSpark Introduction @ WebSci' 2016 Hackathon
 
ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019
 
Web Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web ArchivesWeb Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web Archives
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
Web Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web ArchivesWeb Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web Archives
 
Linguistic Linked Open Data, Challenges, Approaches, Future Work
Linguistic Linked Open Data, Challenges, Approaches, Future WorkLinguistic Linked Open Data, Challenges, Approaches, Future Work
Linguistic Linked Open Data, Challenges, Approaches, Future Work
 
The nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologiesThe nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologies
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
 
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
 
01 html-introduction
01 html-introduction01 html-introduction
01 html-introduction
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
 
The Nature.com ontologies portal - Linked Science 2015
The Nature.com ontologies portal - Linked Science 2015The Nature.com ontologies portal - Linked Science 2015
The Nature.com ontologies portal - Linked Science 2015
 
Redesigning our Combine Harvester
Redesigning our Combine HarvesterRedesigning our Combine Harvester
Redesigning our Combine Harvester
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
 
ROHub-Argos integration
ROHub-Argos integrationROHub-Argos integration
ROHub-Argos integration
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 

Último

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Último (20)

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 

Medical Heritage Library (MHL) on ArchiveSpark

  • 1. ArchiveSpark: Efficient Access, Extraction and Derivation for Archival Collections https://github.com/helgeho/ArchiveSpark Helge Holzmann (holzmann@L3S.de) 1 https://github.com/helgeho/MHLonArchiveSpark in cooperation with
  • 2. What is ArchiveSpark? • Expressive and efficient data access / processing framework • Originally built for Web Archives, later extended to any collection • Joint work with the Internet Archive • Open source • Fork us on GitHub: https://github.com/helgeho/ArchiveSpark • Star, contribute, fix, spread, get involved! • Modular, easily extensible • More details in: (please cite) • Helge Holzmann, Vinay Goel, Avishek Anand. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. JCDL 2016 (Best Paper Nominee) • Helge Holzmann, Emily Novak Gustainis, Vinay Goel. Universal Distant Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017 20/02/2018 Helge Holzmann (holzmann@L3S.de) 2
  • 3. The ArchiveSpark Approach • Metadata first, content second, your corpus third • Two step loading approach for improved efficiency • Filter as much as possible on metadata before touching the archive • Enrich metadata instead of mapping / transforming the full records 3 20/02/2018 Helge Holzmann (holzmann@L3S.de) in Web archives: generally (MHL):
  • 4. ArchiveSpark and MHL • MHL-specific Data Specifications are available on GitHub • https://github.com/helgeho/MHLonArchiveSpark • The metadata source is MHL‘s advanced full-text search • http://mhl.countway.harvard.edu/search • The basic features are replicated by our tool • more advanced filtering can be done on the retrieved metadata • Full-text contents are fetched from the Internet Archive • Seamlessly, abstracted away from the user • Metadata records are enriched with requested contents 4 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 5. Simple and Expressive Interface • Based on Spark, powered by Scala • This does not mean you have to learn a new programming language! • The interface is rather declarative and writing scripts for ArchiveSpark does not require deep knowledge about Spark / Scala • Simple data accessors are included • Provide simplified access to the underlying data model • Easy extraction / enrichment mechanisms • Customizable and extensible by advanced users 5 20/02/2018 Helge Holzmann (holzmann@L3S.de) val query = MhlSearchOptions(query = "polio", collections = MhlCollections.Statemedicalsocietyjournals) val rdd = ArchiveSpark.load(MhlSearchSpec(query)) val enriched = rdd.enrich(Entities) enriched.saveAsJson("enriched.json.gz")
  • 6. Implicit Lineage Documentation • Nested JSON output encodes lineage of applied enrichments 6 20/02/2018 Helge Holzmann (holzmann@L3S.de) title text entities persons
  • 7. Getting Started • We recommend the interactive use with Jupyter • http://jupyter.org with Apache Toree: https://toree.apache.org • Commands can be run live, results are returned immediately • Documentation with examples is available on GitHub • https://github.com/helgeho/ArchiveSpark/blob/master/docs • Everything set up in a Docker container to get you started • https://github.com/helgeho/ArchiveSpark-docker • ArchiveSpark with Jupyter and Toree, pre-configured • It’s just one command away, instructions are on GitHub 7 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 8. Run ArchiveSpark with Jupyter • Start Docker (see https://github.com/helgeho/ArchiveSpark-docker) • Add additional JAR files for MHL • Download from https://github.com/helgeho/MHLonArchiveSpark/releases • Copy to config/lib (the config path you specified with Docker) 8 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 9. Your First ArchiveSpark Jupyter Notebook • Open Jupyter in your browser: • Create a new Jupyter notebook with ArchiveSpark: • You are now ready to write your first job: • Press ctrl+enter to run a cell and you will immediately see the output: 9 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 10. Example: Polio Symptoms in MHL (1) • What are the most frequently occurring symptoms and affected body parts of Polio in journals of the Medical Heritage Library? • The full example is available on GitHub • https://github.com/helgeho/MHLonArchiveSpark/blob/master/examples/Mhl PolioSymptomsSearch.ipynb • Details can be found in: • Helge Holzmann, Emily Novak Gustainis, Vinay Goel. Universal Distant Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017 • http://www.helgeholzmann.de/papers/BIGDATA_2017.pdf • More about the used Spark operations: Spark Programming Guide • https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html 10 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 11. Example: Polio Symptoms in MHL (2) • Import the required modules / methods • Specify the query and load the dataset • Available options to specify the query can be found in the code • https://github.com/helgeho/MHLonArchiveSpark/blob/master/src/main/scal a/edu/harvard/countway/mhl/archivespark/search/MhlSearchOptions.scala 11 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 12. Example: Polio Symptoms in MHL (3) • Now the dataset can be filtered based on the available metadata • at any time, peekJson let’s you look at the data of the first record as JSON 12 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 13. Example: Polio Symptoms in MHL (4) • We define a new enrich function, which extracts the symptoms • This is based on the content in lower case (LowerCase Enrich Function) • We specify a set of interesting symptoms and affected body parts • For each record, this is filtered by the ones contained in the content • This new Enrich Function is assigned to a variable called symptoms • Finally, the dataset is enriched with the set of contained symptoms 13 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 14. Example: Polio Symptoms in MHL (5) • We again print out the first record to check the result • println is used to see the full output, Jupyter would cut it otherwise … 14 20/02/2018 Helge Holzmann (holzmann@L3S.de) The JSON structure nicely reflects the lineage. We can immediately see that symptoms in this case were extracted from the text, which was first converted to lower case.
  • 15. Example: Polio Symptoms in MHL (6) • Eventually, we can count to the contained symptoms • Our symptoms Enrich Function can be used as a pointer to the values here • We flat-map these values, i.e., we create a flat list of all symptoms in the dataset, each occurring once per record it is contained in • To see the results, we print each line of the computed counts 15 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 16. Example: Polio Symptoms in MHL (7) • Alternatively, we could save all filtered and enriched records as JSON • The JSON format is widely supported by many third-party tools, so that the resulting dataset can easily be post-processed • to post-process it with Spark, records can be mapped to its raw values • value can be used to access an enriched value of a record, e.g.: val mapped = enriched.map(r => (r.title, r.valueOrElse(symptoms, Seq.empty))) • Hint: to access the raw text of a MHL document use Text: E.g., Text.map("length"){txt: String => txt.length) or Entities.on(Text) 16 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 17. For Advanced Users • Some Enrich Functions, such as Entities, need additional JAR files • Entities requires the Stanford CoreNLP library and models • http://central.maven.org/maven2/edu/stanford/nlp/stanford-corenlp/3.4.1/ • These need to be added to the classpath (your config/lib directory if you use Docker) • ArchiveSpark is also available on Maven Central • https://mvnrepository.com/artifact/com.github.helgeho/archivespark • To be used as library / API to access archival collections programmatically • New Enrich Functions and DataSpecs are easy to create • All required base classes are provided with the core project • https://github.com/helgeho/ArchiveSpark/blob/master/docs/Contribute.md • Please share yours! 17 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 18. That’s all Folks! 20/02/2018 Helge Holzmann (holzmann@L3S.de) 18 • Happy coding and please share your insights • Fork us on GitHub: https://github.com/helgeho/ArchiveSpark • Star, contribute, fix, spread, get involved! • Feedback is very welcome… • Visit us • https://www.L3S.de • https://archive.org • http://alexandria-project.eu • http://www.medicalheritage.org