Medical Heritage Library (MHL) on ArchiveSpark

ArchiveSpark:
Efficient Access, Extraction and
Derivation for Archival Collections
https://github.com/helgeho/ArchiveSpark
Helge Holzmann (holzmann@L3S.de) 1
https://github.com/helgeho/MHLonArchiveSpark
in cooperation with

What is ArchiveSpark?
• Expressive and efficient data access / processing framework
• Originally built for Web Archives, later extended to any collection
• Joint work with the Internet Archive
• Open source
• Fork us on GitHub: https://github.com/helgeho/ArchiveSpark
• Star, contribute, fix, spread, get involved!
• Modular, easily extensible
• More details in: (please cite)
• Helge Holzmann, Vinay Goel, Avishek Anand. ArchiveSpark: Efficient Web
Archive Access, Extraction and Derivation. JCDL 2016 (Best Paper Nominee)
• Helge Holzmann, Emily Novak Gustainis, Vinay Goel. Universal Distant
Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017
20/02/2018 Helge Holzmann (holzmann@L3S.de)
2

The ArchiveSpark Approach
• Metadata first, content second, your corpus third
• Two step loading approach for improved efficiency
• Filter as much as possible on metadata before touching the archive
• Enrich metadata instead of mapping / transforming the full records
3
in Web archives: generally (MHL):

ArchiveSpark and MHL
• MHL-specific Data Specifications are available on GitHub
• https://github.com/helgeho/MHLonArchiveSpark
• The metadata source is MHL‘s advanced full-text search
• http://mhl.countway.harvard.edu/search
• The basic features are replicated by our tool
• more advanced filtering can be done on the retrieved metadata
• Full-text contents are fetched from the Internet Archive
• Seamlessly, abstracted away from the user
• Metadata records are enriched with requested contents
4

Simple and Expressive Interface
• Based on Spark, powered by Scala
• This does not mean you have to learn a new programming language!
• The interface is rather declarative and writing scripts for
ArchiveSpark does not require deep knowledge about Spark / Scala
• Simple data accessors are included
• Provide simplified access to the underlying data model
• Easy extraction / enrichment mechanisms
• Customizable and extensible by advanced users
5
val query = MhlSearchOptions(query = "polio", collections = MhlCollections.Statemedicalsocietyjournals)
val rdd = ArchiveSpark.load(MhlSearchSpec(query))
val enriched = rdd.enrich(Entities)
enriched.saveAsJson("enriched.json.gz")

Implicit Lineage Documentation
• Nested JSON output encodes lineage of applied enrichments
6
title
text
entities
persons

Getting Started
• We recommend the interactive use with Jupyter
• http://jupyter.org with Apache Toree: https://toree.apache.org
• Commands can be run live, results are returned immediately
• Documentation with examples is available on GitHub
• https://github.com/helgeho/ArchiveSpark/blob/master/docs
• Everything set up in a Docker container to get you started
• https://github.com/helgeho/ArchiveSpark-docker
• ArchiveSpark with Jupyter and Toree, pre-configured
• It’s just one command away, instructions are on GitHub
7

Run ArchiveSpark with Jupyter
• Start Docker (see https://github.com/helgeho/ArchiveSpark-docker)
• Add additional JAR files for MHL
• Download from https://github.com/helgeho/MHLonArchiveSpark/releases
• Copy to config/lib (the config path you specified with Docker)
8

Your First ArchiveSpark Jupyter Notebook
• Open Jupyter in your browser:
• Create a new Jupyter notebook with ArchiveSpark:
• You are now ready to write your first job:
• Press ctrl+enter to run a cell and you will immediately see the output:
9

Example: Polio Symptoms in MHL (1)
• What are the most frequently occurring symptoms and affected
body parts of Polio in journals of the Medical Heritage Library?
• The full example is available on GitHub
• https://github.com/helgeho/MHLonArchiveSpark/blob/master/examples/Mhl
PolioSymptomsSearch.ipynb
• Details can be found in:
• Helge Holzmann, Emily Novak Gustainis, Vinay Goel. Universal Distant
Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017
• http://www.helgeholzmann.de/papers/BIGDATA_2017.pdf
• More about the used Spark operations: Spark Programming Guide
• https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html
10

• Import the required modules / methods
• Specify the query and load the dataset
• Available options to specify the query can be found in the code
• https://github.com/helgeho/MHLonArchiveSpark/blob/master/src/main/scal
a/edu/harvard/countway/mhl/archivespark/search/MhlSearchOptions.scala 11

• Now the dataset can be filtered based on the available metadata
• at any time, peekJson let’s you look at the data of the first record as JSON
12

• We define a new enrich function, which extracts the symptoms
• This is based on the content in lower case (LowerCase Enrich Function)
• We specify a set of interesting symptoms and affected body parts
• For each record, this is filtered by the ones contained in the content
• This new Enrich Function is assigned to a variable called symptoms
• Finally, the dataset is enriched with the set of contained symptoms
13

• We again print out the first record to check the result
• println is used to see the full output, Jupyter would cut it otherwise
…
14
The JSON structure nicely reflects the
lineage. We can immediately see that
symptoms in this case were extracted
from the text, which was first
converted to lower case.

• Eventually, we can count to the contained symptoms
• Our symptoms Enrich Function can be used as a pointer to the values here
• We flat-map these values, i.e., we create a flat list of all symptoms in the
dataset, each occurring once per record it is contained in
• To see the results, we print each line of the computed counts
15

• Alternatively, we could save all filtered and enriched records as JSON
• The JSON format is widely supported by many third-party tools, so
that the resulting dataset can easily be post-processed
• to post-process it with Spark, records can be mapped to its raw values
• value can be used to access an enriched value of a record, e.g.:
val mapped = enriched.map(r => (r.title, r.valueOrElse(symptoms, Seq.empty)))
• Hint: to access the raw text of a MHL document use Text:
E.g., Text.map("length"){txt: String => txt.length) or Entities.on(Text)
16

For Advanced Users
• Some Enrich Functions, such as Entities, need additional JAR files
• Entities requires the Stanford CoreNLP library and models
• http://central.maven.org/maven2/edu/stanford/nlp/stanford-corenlp/3.4.1/
• These need to be added to the classpath (your config/lib directory if you use Docker)
• ArchiveSpark is also available on Maven Central
• https://mvnrepository.com/artifact/com.github.helgeho/archivespark
• To be used as library / API to access archival collections programmatically
• New Enrich Functions and DataSpecs are easy to create
• All required base classes are provided with the core project
• https://github.com/helgeho/ArchiveSpark/blob/master/docs/Contribute.md
• Please share yours!
17

That’s all Folks!
18
• Happy coding and please share your insights
• Fork us on GitHub: https://github.com/helgeho/ArchiveSpark
• Star, contribute, fix, spread, get involved!
• Feedback is very welcome…
• Visit us
• https://www.L3S.de
• https://archive.org
• http://alexandria-project.eu
• http://www.medicalheritage.org

Medical Heritage Library (MHL) on ArchiveSpark

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Medical Heritage Library (MHL) on ArchiveSpark

Similar a Medical Heritage Library (MHL) on ArchiveSpark (20)

Último

Último (20)

Medical Heritage Library (MHL) on ArchiveSpark