How Text Analyzer enables researchers, through the use of natural language processing, to upload a document and get relevant results including content, topics and subjects. JSTOR pushed the envelope of traditional searching and will share what challenges and opportunities were learned from their beta test of this new tool.
On Beyond Keyword Search: The Thinking Behind JSTOR Labs' Text Analyzer - NFAIS Webinar 2017
1. ON BEYOND
KEYWORD SEARCH:
THE THINKING BEHIND
JSTOR LABS’ TEXT ANALYZER
NFAIS Webinar: Shifting Patterns in Search and Discovery
June 15, 2017
@abhumphreys
Alex Humphreys, JSTOR Labs
2. ITHAKA is a not-for-profit organization that helps the academic
community use digital technologies to preserve the scholarly record
and to advance research and teaching in sustainable ways.
JSTOR is a not-for-profit
digital library of academic
journals, books, and
primary sources.
Ithaka S+R is a not-for-profit
research and consulting
service that helps academic,
cultural, and publishing
communities thrive in the
digital environment.
Portico is a not-for-profit
preservation service for
digital publications, including
electronic journals, books,
and historical collections.
Artstor provides 2+ million
high-quality images and
digital asset management
software to enhance
scholarship and teaching.
3. JSTOR Labs works with partner publishers, libraries and
labs to create tools for researchers, teachers and students
that are immediately useful – and a little bit magical.
23. THREE STEPS FOR EACH SEARCH
• From many textual
formats (pdf, word,
html, etc.)
• OCR, if needed (e.g.
a picture of a page in
a magazine)
• Topics: JSTOR
Thesaurus & an LDA
Topic Model
• Entities: Alchemy
(Watson),
OpenCalais,
Stanford, Apache
• TF-IDF to select 5
terms
• “OR” search
• Relevance ranked
based on “equalizer”
1. Extract text 2. Identify terms 3. Generate results
24. WHERE DO THE TOPICS COME FROM?
• A controlled vocabulary containing +40,000 terms, representing
concepts (no entities, currently) found in the JSTOR corpus
• Constructed from 20 thesauri obtained from various sources, including
ERIC, MeSH, and NASA
• Developed in collaboration with Access Innovations
• Key branches in the thesaurus are reviewed and corrected by subject
matter experts
THE JSTOR THESAURUS
26. WHY THESE TOPICS?
AND, WHERE DID THEY COME FROM?
Human curated tagging rules have been developed for each concept in the
JSTOR Thesaurus enabling concepts to be extracted from unstructured
text
All documents in the JSTOR corpus have been tagged with thesaurus
concepts using a rules-based indexer
28. WHY THESE TOPICS?
AND, WHERE DID THEY COME FROM?
This tagged corpus is then used to select training documents for building
an LDA topic model
The LDA topic model enables us to identify latent topics found in text in
addition to those explicitly identified with the human-generated rules
29. TOPIC MODEL
• Labeled LDA Topic model
• Model trained using documents
selected from JSTOR corpus
with tagged thesaurus concepts
• Using OSS Mallet tool
• Current version of model
includes approximately 11,000
topics
• Each topic represents a
distribution of word probabilities
redistricting district congressional minority political majority house legislative racial
gerrymandering court republican plan electoral districting seat representative black voter
democrat partisan election democratic representation line supreme legislature drawn control
population voting drawing policy texas draw map claim boundary following commission outcome
shaw race census legal principle creation decision create finding elect lublin polarization optimal
elected composition affect member measure vote gain previous legislator geographic southern
section every approach controlled round note gerrymander reapportionment compactness
decennial bipartisan constitutional find substantive california roll competitive county competition
party requirement federal north post redrawn incumbent criterion consequence likely formal safe
delegation georgia justice influence shotts equal favor might scholar equality south power law
judicial bias king carolina call according voss baker panel professor rule mandate creating
increased determine constraint politics argue standard redis grofman reno cain redrawing margin
share ing tricting decrease congress geographical requires simple held critic empirical david
niemi perverse latino analyze examine debate rather impact next provides give balance affected
subsequent possible take practice community robbins constitution computer evenly fraction
constituent illinois supporter shape responsiveness typically various proposed despite either
focus conclusion african opportunity redistrict mcdonald white numerous test statewide percent
suggests thus choice largely develop decade conclude fact four reached
Redistricting
district congressional congress house representative member federal districting seat majority
plan representation population congressman apportionment elected court president washington
columbia legislative census party interest political gerrymandering redistricting home thomas
affect every black democrat dis foley carolina find reapportionment constituency supreme
constitution voting geographic active dinner responsiveness south force john gingrich legislature
equal membership neighborhood testimony north james service decennial constituent passed
boundary law creation firm charles spending congruent election politically addition april contact
proportion con assistant position following york land unconstitutional resident miller voter pledge
stephen city official minority respective mainland kentucky post clause better divisor perimeter
yao secretary republican senate moderate congruence map county grant senior drawing portion
speaker feature decision professor became gerrymander swain trict leapfrog federalist partisan
senator vote captain compelling lucas candidate race create harm require fourth shape you
traditional purpose shaped concern people shaw historical simply policy henry david allocation
vetoed arkansas smiley serra carl volunteer politician budget burden electoral leaf education
reduced principle proximity november significant just represented second gathered fiorina
representa gressional glazer apportion gerrymandered boris bronx issn rank redrawing twice
refused eliminates provincial jefferson returned witness campaign fletcher georgia empirically
personnel size maximize half reserve read demographic percent contrary required determining
throughout …
Congressional districts
Top words from some sample topics
30. Keyword searching is great, but it ain’t
perfect. There’s more we can do for
users.
31. THANKS, DESIGN THINKING!
FOCUS ON A USER’S GOALS…
This article needs
to pass peer
review.
I need more sources
to back up my
argument.
I need to make sure
I’m not missing
anything.
32. THANKS, DESIGN THINKING!
…AND WHAT’S STANDING IN THEIR WAY
This research touches
on disciplines I’m
new to. How do I
know if I’m finding
everything?
I know what I’m
interested in, but
the search terms
I’m using aren’t
working.
Blergh, boolean
search is too
complicated.
33. THANKS, DESIGN THINKING!
UNDERSTAND THE USER’S CONTEXT
Hey, I’ve got my
first draft right
here.
At least I’ve found
ONE article I can
use.
All I have to work with is
the assignment my
teacher handed out.
I’m nowhere near
my laptop.