This document discusses a new text analysis tool created by JSTOR Labs. It allows users to upload documents and extracts key terms, topics, and entities. The tool uses topic modeling and NLP techniques like OCR, TF-IDF, and Alchemy API. A demo of the tool is available online. The document notes that JSTOR Labs is working to improve the algorithm and release an API. It seeks feedback on how to change researcher behaviors and determine if the tool is a feature, product, or service.
1. CNI Fall 2017
Creating
a New Way to Search
@abhumphreys
Alex Humphreys, JSTOR Labs
@wilderbach
Barbara Rockenbach, Columbia Libraries
2. ITHAKA is a not-for-profit organization that helps the academic
community use digital technologies to preserve the scholarly record
and to advance research and teaching in sustainable ways.
JSTOR is a not-for-profit
digital library of academic
journals, books, and
primary sources.
Ithaka S+R is a not-for-profit
research and consulting
service that helps academic,
cultural, and publishing
communities thrive in the
digital environment.
Portico is a not-for-profit
preservation service for
digital publications, including
electronic journals, books,
and historical collections.
Artstor provides 2+ million
high-quality images and
digital asset management
software to enhance
scholarship and teaching.
3. JSTOR Labs works with partner publishers,
libraries and labs to create tools for
researchers, teachers and students that are
immediately useful – and a little bit magical.
11. THREE STEPS FOR EACH SEARCH
• From many textual
formats (pdf, word,
html, etc.)
• OCR, if needed (e.g.
a picture of a page in
a magazine)
• Topics: JSTOR
Thesaurus & an LDA
Topic Model
• Entities: Alchemy
(Watson),
OpenCalais,
Stanford, Apache
• TF-IDF to select 5
terms
• “OR” search
• Relevance ranked
based on “equalizer”
1. Extract text 2. Identify terms 3. Generate results
12. TOPIC MODEL
• Labeled LDA Topic model
• Model trained using documents
selected from Wikipedia and JSTOR
• Using OSS Mallet tool
• Current version of model includes
approximately 11,000 topics
• Each topic represents a distribution
of word probabilities
redistricting district congressional minority political majority house legislative racial
gerrymandering court republican plan electoral districting seat representative black voter
democrat partisan election democratic representation line supreme legislature drawn control
population voting drawing policy texas draw map claim boundary following commission outcome
shaw race census legal principle creation decision create finding elect lublin polarization optimal
elected composition affect member measure vote gain previous legislator geographic southern
section every approach controlled round note gerrymander reapportionment compactness
decennial bipartisan constitutional find substantive california roll competitive county competition
party requirement federal north post redrawn incumbent criterion consequence likely formal safe
delegation georgia justice influence shotts equal favor might scholar equality south power law
judicial bias king carolina call according voss baker panel professor rule mandate creating
increased determine constraint politics argue standard redis grofman reno cain redrawing margin
share ing tricting decrease congress geographical requires simple held critic empirical david
niemi perverse latino analyze examine debate rather impact next provides give balance affected
subsequent possible take practice community robbins constitution computer evenly fraction
constituent illinois supporter shape responsiveness typically various proposed despite either
focus conclusion african opportunity redistrict mcdonald white numerous test statewide percent
suggests thus choice largely develop decade conclude fact four reached
Redistricting
district congressional congress house representative member federal districting seat majority
plan representation population congressman apportionment elected court president washington
columbia legislative census party interest political gerrymandering redistricting home thomas
affect every black democrat dis foley carolina find reapportionment constituency supreme
constitution voting geographic active dinner responsiveness south force john gingrich legislature
equal membership neighborhood testimony north james service decennial constituent passed
boundary law creation firm charles spending congruent election politically addition april contact
proportion con assistant position following york land unconstitutional resident miller voter pledge
stephen city official minority respective mainland kentucky post clause better divisor perimeter
yao secretary republican senate moderate congruence map county grant senior drawing portion
speaker feature decision professor became gerrymander swain trict leapfrog federalist partisan
senator vote captain compelling lucas candidate race create harm require fourth shape you
traditional purpose shaped concern people shaw historical simply policy henry david allocation
vetoed arkansas smiley serra carl volunteer politician budget burden electoral leaf education
reduced principle proximity november significant just represented second gathered fiorina
representa gressional glazer apportion gerrymandered boris bronx issn rank redrawing twice
refused eliminates provincial jefferson returned witness campaign fletcher georgia empirically
personnel size maximize half reserve read demographic percent contrary required determining
throughout …
Congressional districts
Top words from some sample topics
13. WHAT’S NEXT?
• Ongoing improvements to
algorithm
• API releasing this week to beta
partners
• Article recommendations…
16. Is this a feature, a product or a service?*
* See: https://scholarlykitchen.sspnet.org/2015/01/27/when-is-a-feature-a-product-
and-a-product-a-business/