Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

Dan Sullivan
Big Data TechCon Boston 2015
*

*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
* Performance Considerations

*
* First commercial work in natural language
processing in late 1980s
* Document Warehousing and Text Mining, 2001
* Most recent and current text mining work in
life sciences area
* Classification
* Named Entity Recognition
* Event Extraction
* Contact
* dan@dsapptech.com
* @dsapptech
* Linkedin.com/in/dansullivanpdx

*
Discount Code:
DATA35
• Available as book & eBook
• FREE shipping in the U.S.
• EPUB, PDF, and MOBI
eBook formats provided
Also available at booksellers and
online retailers – 35% off discount
only good at informit.com

*
*Sentiment Analysis
*Topic Modeling
*Classification
*Event Extraction
* Workflows, Procedures and Governance
* Performance Considerations

*
*Large volumes of
accessible and relevant
texts:
*Social media
*Email
*Patents and research
*Customer
communications
* Use Cases
*Market research
*Brand monitoring
*e-Discovery
*Intellectual property
management

Manual procedures are time
consuming and costly
Volume of literature continues
to grow
Commonly used search
techniques, such as keyword,
similarity searching, metadata
filtering, etc. can still yield
volumes of literature that are
difficult to analyze manually
Some success with popular tools
but limitations

*
*Sentiment Analysis
*Topic Modeling
*Classification
*Event Extraction
* Workflows
*Performance Considerations

*
* Analysis of tone or opinion of a
communication
* Polarity:
text  {positive, neutral, negative}
* Categorization:
text  {angry, pleased, confused …}
* Scale
text  -10 … +10
* Metadata about context essential
* subject area
* communication medium

*
*Keywords
*Lexical Affinity
* Affective Norms for English Words (ANEW)
* Emotional Dimensions
* Arousal
* Dominance
* Valence
*Statistical Classification
*Semantic or Concept-based Classification

*
* Use Cases
* Brand monitoring
* Competitive intelligence
* Demographic modeling
* Campaign analysis
* Tools
* RapidMiner
* ViralHeat Sentiment Analysis API
* Python NLTK
* Python TextBlog
* R sentiment package

*
* Technique for identify dominant themes
in document
* Does not require training
* Multiple Algorithms
* Probabilistic Latent Semantic Indexing
(PLSI)
* Latent Dirichlet allocation (LDA)
*Assumptions
*Documents about a mixture of topics
*Words used in document attributable to
topic
Source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-theres-no-training-today/

Debt, Law,
Graduation
Debt, EU,
Greece, Euro
Source: http://www.nytimes.com/pages/business/index.html April 27, 2015
EU, Greece,
Negotiations,
Varoufakis

*
* Topics represented by words; documents about a
set of topics
*Doc 1: 50% politics, 50% presidential
*Doc 2: 25% CPU, 30% memory, 45% I/O
*Doc 3: 30% cholesterol, 40% arteries, 30% heart
* Learning Topics
*Assign each word to a topic
*For each word and topic, compute
* Probability of topic given a document P(topic|doc)
* Probability of word given a topic P(word|topic)
* Reassign word to new topic with probability
P(topic|doc) * P(word|topic)
* Reassignment based on probability that topic T
generated use of word W
TOPICS

Image Source: David Blei, “Probabilistic Topic Models”
http://yosinski.com/mlss12/MLSS-2012-Blei-Probabilistic-Topic-Models/

*
* Use Cases
* Data exploration in large corpus
* Pre-classification analysis
* Identify dominant themes
* Tools
*Stanford Topic Modeling Toolbox
*Mallet (UMass Amherst)
*R package: topicmodels
*Python package: Gensim

*
* Sentiment Analysis
* Topic Modeling

* 3 Key Components
* Data
* Representation scheme
* Algorithms
* Data
* Positive examples – Examples from representative
corpus
* Negative examples – Randomly selected from same
publications
* Representation
* TF-IDF
* Vector space representation
* Cosine of vectors measure of similarity
* Algorithms
* Supervised learning
* SVMs
* Ridge Classifier
* Perceptrons
* kNN
* SGD Classifier
* Naïve Bayes
* Random Forest
* AdaBoost
*

*
Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python:
Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/

Support Vector Machine (SVM) is large
margin classifier
Commonly used in text classification
Initial results based on life sciences
sentence classifier
Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png
*

*Term Frequency (TF)
tf(t,d) = # of occurrences of t in d
t is a term
d is a document
*Inverse Document Frequency (IDF)
idf(t,D) = log(N / |{d in D : t in d}|)
D is set of documents
N is number of document
*TF-IDF = tf(t,d) * idf(t,D)
*TF-IDF is
*large when high term frequency in document and low
term frequency in all documents
*small when term appears in many documents
*

* Bag of word model
* Ignores structure (syntax) and
meaning (semantics) of sentences
* Representation vector length is the
size of set of unique words in corpus
* Stemming used to remove
morphological differences
* Each word is assigned an index in the
representation vector, V
* The value V[i] is non-zero if word
appears in sentence represented by
vector
* The non-zero value is a function of
the frequency of the word in the
sentence and the frequency of the
term in the corpus
*

Non-VF, Predicted VF:
 “Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels of
EspB into the host cell.”
 “Data were log-transformed to correct for heterogeneity of the variances where
necessary.”
 “Subsequently, the kanamycin resistance cassette from pVK4 was cloned into the
PstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption
in the cesF region of EHEC strain 85-170.”
VF, Predicted Non-VF
 “Here, it is reported that the pO157-encoded Type V-secreted serine protease
EspP influences the intestinal colonization of calves. “
 “Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing
E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and
intestinal inflammation but no signs of HUS. “
 “The DsbLI system also comprises a functional redox pair”

 Adding additional examples is not likely to substantially
improve results as seen by error curve
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 2000 4000 6000 8000 10000
All
Training Error
Validation Error

8 Alternative Algorithms
Select 10,000 most important features using chi-square

*
* SAS Text Miner
* IBM Text Analytics
* Smartlogic
* Python: scikit-learn
* R: RTextTools
* R: tm

*
* Processes of identifying words and phrases of objects
in specific categories. Also known as:
*Entity identification
*Entity extraction
*Chunking
* Two steps:
* Detect entities
* Classify entities
* Common classes of entities:
* Persons
* Organizations
* Geographic locations
* Dates
* Monetary amounts

*
* Four Broad Techniques
*Linguistic - utilize structure of sentence
* Statistical – detect patterns in training
examples
* Custom patterns – regular expressions
* Dictionaries
*Challenges
*Creating training corpus
*Granularity

*
*Use Cases
* Name normalization
* Entity correlation
*Quantified metrics based on texts
*Building block for event extraction
*Tools
* Stanford Core NLP
* OpenNLP
* Mallet
* Basis Technology
* Lexalytics
* NetOwl
* Cogitio API

*
* Entities and relations between
entities
* Company A acquires Company B
* Engineer A filed patent application
on Topic B on Date C
*Politician P announces A on Twitter
on Date B
* Assign roles to entities
* Assign subtypes
* Link to semantic data

*
* Brenden’s Twitter NLP Tools -
https://github.com/aritter/twitter_nlp
* Alchemy API
* Turku BioNLP Event Extraction Software
* Stanford Biomedical Event Parser
Source: Turku Event Extraction System, http://jbjorne.github.io/TEES/

*
* Classification
* Named Entity Recognition
* Event Extraction

*
* Document Collection
* Text Extraction
* Pre-processing
* Case conversion
* Punctuation removal
* Stemming
* Normalization
* N-gram analysis
* Analysis
* Term Frequency – Inverse Document Frequency
* Conditional Probabilities and Topic Models
* NER and Entity Extraction
* Integration
* Link to Structured Data
* Augment with additional semantic information
* Utilization
* Improve information retrieval
* Identity brand perception problems
* Assess likelihood of customer churn
* Predict likelihood of …
Collect
Extract &
Pre-Process
Analyze
Integrate
Utilize

*
Source: https://uima.apache.org/

*
* Scalability
* Multiple language support
* Quality
*Precision
*Recall
* Algorithm selection
* Reliability and timeliness of sources
* Integration rules

* Increase quantity of data (not always helpful; see
error curves)
* Improve quality of data
* Utilize multiple supervised algorithms,
ensemble and non-ensemble
* Use unlabeled data and semi-supervised
techniques
* Feature Selection
* Parameter Tuning
* Feature Engineering
* Given:
* High quality data in sufficient quantity
* State of the art machine learning algorithms
* How to improve results: Change Representation?
*

*TF-IDF
*Loss of syntactic and
semantic information
*No relation between
term index and meaning
*No support for
disambiguation
*Feature engineering
extends vector
representation or
substitute specific for
more general terms – a
crude way to capture
semantic properties
*
 Ideal
Representation
◦ Capture semantic
similarity of words
◦ Does not require
feature engineering
◦ Minimal pre-
processing, e.g. no
mapping to
ontologies
◦ Improves precision
and recall

*Words represented as set of
weights in vector
*Useful properties
* Semantically similar words in close
proximity
* Methods for capturing phrases, e.g.
“Secretion system”
* Captures some semantic features
*Trained with
* Skip-gram or CBOW algorithms
* Text, such as PubMed abstracts and
open access papers
*
T. Mikolov, et. al. “Efﬁcient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf

*
* “Characterization of the Affective Norms for English Words
by discrete emotional categories”
http://indiana.edu/~panlab/papers/SraMjaJtw_ANEW.pdf
* “New Avenues in Opinion Mining and Sentiment Analysis”
http://sentic.net/new-avenues-in-opinion-mining-and-
sentiment-analysis.pdf
* “Empirical Study of Topic Modeling in Twitter”
http://snap.stanford.edu/soma2010/papers/soma2010_12.p
df
http://snap.stanford.edu/soma2010/papers/soma2010_12.p
df
* “Open Domain Event Extraction from Twitter”
http://turing.cs.washington.edu/papers/kdd12-ritter.pdf

Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

Similar to Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property (20)

More from Dan Sullivan, Ph.D.

More from Dan Sullivan, Ph.D. (12)

Recently uploaded

Recently uploaded (20)

Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

Editor's Notes