SlideShare a Scribd company logo
1 of 51
Dan Sullivan
Big Data TechCon Boston 2015
*
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
* Performance Considerations
*
* First commercial work in natural language
processing in late 1980s
* Document Warehousing and Text Mining, 2001
* Most recent and current text mining work in
life sciences area
* Classification
* Named Entity Recognition
* Event Extraction
* Contact
* dan@dsapptech.com
* @dsapptech
* Linkedin.com/in/dansullivanpdx
*
Discount Code:
DATA35
• Available as book & eBook
• FREE shipping in the U.S.
• EPUB, PDF, and MOBI
eBook formats provided
Also available at booksellers and
online retailers – 35% off discount
only good at informit.com
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows, Procedures and Governance
* Performance Considerations
*
*Large volumes of
accessible and relevant
texts:
*Social media
*Email
*Patents and research
*Customer
communications
* Use Cases
*Market research
*Brand monitoring
*e-Discovery
*Intellectual property
management
Manual procedures are time
consuming and costly
Volume of literature continues
to grow
Commonly used search
techniques, such as keyword,
similarity searching, metadata
filtering, etc. can still yield
volumes of literature that are
difficult to analyze manually
Some success with popular tools
but limitations
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
*Performance Considerations
*
* Analysis of tone or opinion of a
communication
* Polarity:
text  {positive, neutral, negative}
* Categorization:
text  {angry, pleased, confused …}
* Scale
text  -10 … +10
* Metadata about context essential
* subject area
* communication medium
*
*Keywords
*Lexical Affinity
* Affective Norms for English Words (ANEW)
* Emotional Dimensions
* Arousal
* Dominance
* Valence
*Statistical Classification
*Semantic or Concept-based Classification
*
* Use Cases
* Brand monitoring
* Competitive intelligence
* Demographic modeling
* Campaign analysis
* Tools
* RapidMiner
* ViralHeat Sentiment Analysis API
* Python NLTK
* Python TextBlog
* R sentiment package
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows, Procedures and Governance
* Performance Considerations
*
* Technique for identify dominant themes
in document
* Does not require training
* Multiple Algorithms
* Probabilistic Latent Semantic Indexing
(PLSI)
* Latent Dirichlet allocation (LDA)
*Assumptions
*Documents about a mixture of topics
*Words used in document attributable to
topic
Source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-theres-no-training-today/
Debt, Law,
Graduation
Debt, EU,
Greece, Euro
Source: http://www.nytimes.com/pages/business/index.html April 27, 2015
EU, Greece,
Negotiations,
Varoufakis
*
* Topics represented by words; documents about a
set of topics
*Doc 1: 50% politics, 50% presidential
*Doc 2: 25% CPU, 30% memory, 45% I/O
*Doc 3: 30% cholesterol, 40% arteries, 30% heart
* Learning Topics
*Assign each word to a topic
*For each word and topic, compute
* Probability of topic given a document P(topic|doc)
* Probability of word given a topic P(word|topic)
* Reassign word to new topic with probability
P(topic|doc) * P(word|topic)
* Reassignment based on probability that topic T
generated use of word W
TOPICS
Image Source: David Blei, “Probabilistic Topic Models”
http://yosinski.com/mlss12/MLSS-2012-Blei-Probabilistic-Topic-Models/
*
* Use Cases
* Data exploration in large corpus
* Pre-classification analysis
* Identify dominant themes
* Tools
*Stanford Topic Modeling Toolbox
*Mallet (UMass Amherst)
*R package: topicmodels
*Python package: Gensim
*
* Sentiment Analysis
* Topic Modeling
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
* Performance Considerations
* 3 Key Components
* Data
* Representation scheme
* Algorithms
* Data
* Positive examples – Examples from representative
corpus
* Negative examples – Randomly selected from same
publications
* Representation
* TF-IDF
* Vector space representation
* Cosine of vectors measure of similarity
* Algorithms
* Supervised learning
* SVMs
* Ridge Classifier
* Perceptrons
* kNN
* SGD Classifier
* Naïve Bayes
* Random Forest
* AdaBoost
*
*
*
*
Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python:
Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/
Support Vector Machine (SVM) is large
margin classifier
Commonly used in text classification
Initial results based on life sciences
sentence classifier
Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png
*
*Term Frequency (TF)
tf(t,d) = # of occurrences of t in d
t is a term
d is a document
*Inverse Document Frequency (IDF)
idf(t,D) = log(N / |{d in D : t in d}|)
D is set of documents
N is number of document
*TF-IDF = tf(t,d) * idf(t,D)
*TF-IDF is
*large when high term frequency in document and low
term frequency in all documents
*small when term appears in many documents
*
* Bag of word model
* Ignores structure (syntax) and
meaning (semantics) of sentences
* Representation vector length is the
size of set of unique words in corpus
* Stemming used to remove
morphological differences
* Each word is assigned an index in the
representation vector, V
* The value V[i] is non-zero if word
appears in sentence represented by
vector
* The non-zero value is a function of
the frequency of the word in the
sentence and the frequency of the
term in the corpus
*
Non-VF, Predicted VF:
 “Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels of
EspB into the host cell.”
 “Data were log-transformed to correct for heterogeneity of the variances where
necessary.”
 “Subsequently, the kanamycin resistance cassette from pVK4 was cloned into the
PstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption
in the cesF region of EHEC strain 85-170.”
VF, Predicted Non-VF
 “Here, it is reported that the pO157-encoded Type V-secreted serine protease
EspP influences the intestinal colonization of calves. “
 “Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing
E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and
intestinal inflammation but no signs of HUS. “
 “The DsbLI system also comprises a functional redox pair”
 Adding additional examples is not likely to substantially
improve results as seen by error curve
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 2000 4000 6000 8000 10000
All
Training Error
Validation Error
8 Alternative Algorithms
Select 10,000 most important features using chi-square
*
* SAS Text Miner
* IBM Text Analytics
* Smartlogic
* Python: scikit-learn
* R: RTextTools
* R: tm
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
* Performance Considerations
*
* Processes of identifying words and phrases of objects
in specific categories. Also known as:
*Entity identification
*Entity extraction
*Chunking
* Two steps:
* Detect entities
* Classify entities
* Common classes of entities:
* Persons
* Organizations
* Geographic locations
* Dates
* Monetary amounts
*
*
* Four Broad Techniques
*Linguistic - utilize structure of sentence
* Statistical – detect patterns in training
examples
* Custom patterns – regular expressions
* Dictionaries
*Challenges
*Creating training corpus
*Granularity
*
*
*
*Use Cases
* Name normalization
* Entity correlation
*Quantified metrics based on texts
*Building block for event extraction
*Tools
* Stanford Core NLP
* OpenNLP
* Mallet
* Basis Technology
* Lexalytics
* NetOwl
* Cogitio API
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
* Performance Considerations
*
* Entities and relations between
entities
* Company A acquires Company B
* Engineer A filed patent application
on Topic B on Date C
*Politician P announces A on Twitter
on Date B
* Assign roles to entities
* Assign subtypes
* Link to semantic data
*
* Brenden’s Twitter NLP Tools -
https://github.com/aritter/twitter_nlp
* Alchemy API
* Turku BioNLP Event Extraction Software
* Stanford Biomedical Event Parser
Source: Turku Event Extraction System, http://jbjorne.github.io/TEES/
*
* Classification
* Named Entity Recognition
* Event Extraction
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
*Performance Considerations
*
* Document Collection
* Text Extraction
* Pre-processing
* Case conversion
* Punctuation removal
* Stemming
* Normalization
* N-gram analysis
* Analysis
* Term Frequency – Inverse Document Frequency
* Conditional Probabilities and Topic Models
* NER and Entity Extraction
* Integration
* Link to Structured Data
* Augment with additional semantic information
* Utilization
* Improve information retrieval
* Identity brand perception problems
* Assess likelihood of customer churn
* Predict likelihood of …
Collect
Extract &
Pre-Process
Analyze
Integrate
Utilize
*
Source: https://uima.apache.org/
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
*Performance Considerations
*
* Scalability
* Multiple language support
* Quality
*Precision
*Recall
* Algorithm selection
* Reliability and timeliness of sources
* Integration rules
* Increase quantity of data (not always helpful; see
error curves)
* Improve quality of data
* Utilize multiple supervised algorithms,
ensemble and non-ensemble
* Use unlabeled data and semi-supervised
techniques
* Feature Selection
* Parameter Tuning
* Feature Engineering
* Given:
* High quality data in sufficient quantity
* State of the art machine learning algorithms
* How to improve results: Change Representation?
*
*TF-IDF
*Loss of syntactic and
semantic information
*No relation between
term index and meaning
*No support for
disambiguation
*Feature engineering
extends vector
representation or
substitute specific for
more general terms – a
crude way to capture
semantic properties
*
 Ideal
Representation
◦ Capture semantic
similarity of words
◦ Does not require
feature engineering
◦ Minimal pre-
processing, e.g. no
mapping to
ontologies
◦ Improves precision
and recall
*Words represented as set of
weights in vector
*Useful properties
* Semantically similar words in close
proximity
* Methods for capturing phrases, e.g.
“Secretion system”
* Captures some semantic features
*Trained with
* Skip-gram or CBOW algorithms
* Text, such as PubMed abstracts and
open access papers
*
T. Mikolov, et. al. “Efficient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf
*
*
* “Characterization of the Affective Norms for English Words
by discrete emotional categories”
http://indiana.edu/~panlab/papers/SraMjaJtw_ANEW.pdf
* “New Avenues in Opinion Mining and Sentiment Analysis”
http://sentic.net/new-avenues-in-opinion-mining-and-
sentiment-analysis.pdf
* “Empirical Study of Topic Modeling in Twitter”
http://snap.stanford.edu/soma2010/papers/soma2010_12.p
df
http://snap.stanford.edu/soma2010/papers/soma2010_12.p
df
* “Open Domain Event Extraction from Twitter”
http://turing.cs.washington.edu/papers/kdd12-ritter.pdf

More Related Content

What's hot

Acquisition of malicious code using active learning
Acquisition of malicious code using active learningAcquisition of malicious code using active learning
Acquisition of malicious code using active learning
UltraUploader
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
Vsevolod Dyomkin
 
Making project data avalialble eNanomapper through Database
Making project data avalialble eNanomapper through  DatabaseMaking project data avalialble eNanomapper through  Database
Making project data avalialble eNanomapper through Database
Nina Jeliazkova
 

What's hot (20)

NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Acquisition of malicious code using active learning
Acquisition of malicious code using active learningAcquisition of malicious code using active learning
Acquisition of malicious code using active learning
 
QALL-ME: Ontology and Semantic Web
QALL-ME: Ontology and Semantic WebQALL-ME: Ontology and Semantic Web
QALL-ME: Ontology and Semantic Web
 
Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
My Research Journey with R
My Research Journey with RMy Research Journey with R
My Research Journey with R
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
 
Improving Intrusion Detection with Deep Packet Inspection and Regular Express...
Improving Intrusion Detection with Deep Packet Inspection and Regular Express...Improving Intrusion Detection with Deep Packet Inspection and Regular Express...
Improving Intrusion Detection with Deep Packet Inspection and Regular Express...
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
 
Patent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTSPatent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTS
 
Making project data avalialble eNanomapper through Database
Making project data avalialble eNanomapper through  DatabaseMaking project data avalialble eNanomapper through  Database
Making project data avalialble eNanomapper through Database
 
Semantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked DataSemantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked Data
 
smartAPIs: EUDAT Semantic Working Group Presentation @ RDA 9th Plenary
smartAPIs:  EUDAT Semantic Working Group Presentation @ RDA 9th PlenarysmartAPIs:  EUDAT Semantic Working Group Presentation @ RDA 9th Plenary
smartAPIs: EUDAT Semantic Working Group Presentation @ RDA 9th Plenary
 
ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.
 
Uncovering Library Features from API Usage on Stack Overflow
Uncovering Library Features from API Usage on Stack OverflowUncovering Library Features from API Usage on Stack Overflow
Uncovering Library Features from API Usage on Stack Overflow
 
Argument extraction from news, blogs and social media.
Argument extraction from news, blogs and social media.Argument extraction from news, blogs and social media.
Argument extraction from news, blogs and social media.
 
Social Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIASocial Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIA
 

Viewers also liked (6)

Case year report infographics
Case year report infographicsCase year report infographics
Case year report infographics
 
Infrastructure Tech Monitoring & Evaluation
Infrastructure Tech Monitoring & EvaluationInfrastructure Tech Monitoring & Evaluation
Infrastructure Tech Monitoring & Evaluation
 
Open Source Technologies for Contents and Maps
Open Source Technologies for Contents and MapsOpen Source Technologies for Contents and Maps
Open Source Technologies for Contents and Maps
 
Online tools for analyzing data coabe 2014
Online tools for analyzing data coabe 2014Online tools for analyzing data coabe 2014
Online tools for analyzing data coabe 2014
 
Smart Commute Evaluation: Tools, Techniques and Lessons Learned in Monitoring...
Smart Commute Evaluation: Tools, Techniques and Lessons Learned in Monitoring...Smart Commute Evaluation: Tools, Techniques and Lessons Learned in Monitoring...
Smart Commute Evaluation: Tools, Techniques and Lessons Learned in Monitoring...
 
Hawaii Pacific GIS Conference 2012: 3D GIS - Creating Bathymetry Maps with Co...
Hawaii Pacific GIS Conference 2012: 3D GIS - Creating Bathymetry Maps with Co...Hawaii Pacific GIS Conference 2012: 3D GIS - Creating Bathymetry Maps with Co...
Hawaii Pacific GIS Conference 2012: 3D GIS - Creating Bathymetry Maps with Co...
 

Similar to Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
OSTHUS
 
Semantic data integration proof of concept
Semantic data integration proof of conceptSemantic data integration proof of concept
Semantic data integration proof of concept
Nicolas Bertrand
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Apache OpenNLP
 
E-Utilities
E-UtilitiesE-Utilities
E-Utilities
mkim8
 

Similar to Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property (20)

Text mining meets neural nets
Text mining meets neural netsText mining meets neural nets
Text mining meets neural nets
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLP
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
 
The Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational ResearchThe Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational Research
 
Всеволод Демкин "Natural language processing на практике"
Всеволод Демкин "Natural language processing на практике"Всеволод Демкин "Natural language processing на практике"
Всеволод Демкин "Natural language processing на практике"
 
Semantic data integration proof of concept
Semantic data integration proof of conceptSemantic data integration proof of concept
Semantic data integration proof of concept
 
Applying ocr to extract information : Text mining
Applying ocr to extract information  : Text miningApplying ocr to extract information  : Text mining
Applying ocr to extract information : Text mining
 
Data models for preserving and publishing digital research material beyond th...
Data models for preserving and publishing digital research material beyond th...Data models for preserving and publishing digital research material beyond th...
Data models for preserving and publishing digital research material beyond th...
 
Dia09
Dia09Dia09
Dia09
 
Sentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using pythonSentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using python
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword research
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2
 
Text mining
Text miningText mining
Text mining
 
Introduction To Python
Introduction To PythonIntroduction To Python
Introduction To Python
 
3 Software Estmation.ppt
3 Software Estmation.ppt3 Software Estmation.ppt
3 Software Estmation.ppt
 
Standardization of the HIPC Data Templates
Standardization of the HIPC Data TemplatesStandardization of the HIPC Data Templates
Standardization of the HIPC Data Templates
 
Standardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So FarStandardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So Far
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
 
E-Utilities
E-UtilitiesE-Utilities
E-Utilities
 

More from Dan Sullivan, Ph.D.

Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Dan Sullivan, Ph.D.
 

More from Dan Sullivan, Ph.D. (12)

How to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryHow to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQuery
 
With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?
 
Getting Started with BigQuery ML
Getting Started with BigQuery MLGetting Started with BigQuery ML
Getting Started with BigQuery ML
 
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningGoogle Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine Learning
 
Unstructured text to structured data
Unstructured text to structured dataUnstructured text to structured data
Unstructured text to structured data
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False Dichotomy
 
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Big data, bioscience and the cloud   biocatalyst june 2015 sullivanBig data, bioscience and the cloud   biocatalyst june 2015 sullivan
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
 
Modeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsModeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key Patterns
 
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in Bioinformatics
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

  • 1. Dan Sullivan Big Data TechCon Boston 2015 *
  • 2. * * Emerging Demand for Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows * Performance Considerations
  • 3. * * First commercial work in natural language processing in late 1980s * Document Warehousing and Text Mining, 2001 * Most recent and current text mining work in life sciences area * Classification * Named Entity Recognition * Event Extraction * Contact * dan@dsapptech.com * @dsapptech * Linkedin.com/in/dansullivanpdx
  • 4. * Discount Code: DATA35 • Available as book & eBook • FREE shipping in the U.S. • EPUB, PDF, and MOBI eBook formats provided Also available at booksellers and online retailers – 35% off discount only good at informit.com
  • 5. * * Emerging Demand for Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows, Procedures and Governance * Performance Considerations
  • 6. * *Large volumes of accessible and relevant texts: *Social media *Email *Patents and research *Customer communications * Use Cases *Market research *Brand monitoring *e-Discovery *Intellectual property management
  • 7. Manual procedures are time consuming and costly Volume of literature continues to grow Commonly used search techniques, such as keyword, similarity searching, metadata filtering, etc. can still yield volumes of literature that are difficult to analyze manually Some success with popular tools but limitations
  • 8. * * Emerging Demand for Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows *Performance Considerations
  • 9. * * Analysis of tone or opinion of a communication * Polarity: text  {positive, neutral, negative} * Categorization: text  {angry, pleased, confused …} * Scale text  -10 … +10 * Metadata about context essential * subject area * communication medium
  • 10. * *Keywords *Lexical Affinity * Affective Norms for English Words (ANEW) * Emotional Dimensions * Arousal * Dominance * Valence *Statistical Classification *Semantic or Concept-based Classification
  • 11. * * Use Cases * Brand monitoring * Competitive intelligence * Demographic modeling * Campaign analysis * Tools * RapidMiner * ViralHeat Sentiment Analysis API * Python NLTK * Python TextBlog * R sentiment package
  • 12. * * Emerging Demand for Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows, Procedures and Governance * Performance Considerations
  • 13. * * Technique for identify dominant themes in document * Does not require training * Multiple Algorithms * Probabilistic Latent Semantic Indexing (PLSI) * Latent Dirichlet allocation (LDA) *Assumptions *Documents about a mixture of topics *Words used in document attributable to topic Source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-theres-no-training-today/
  • 14. Debt, Law, Graduation Debt, EU, Greece, Euro Source: http://www.nytimes.com/pages/business/index.html April 27, 2015 EU, Greece, Negotiations, Varoufakis
  • 15. * * Topics represented by words; documents about a set of topics *Doc 1: 50% politics, 50% presidential *Doc 2: 25% CPU, 30% memory, 45% I/O *Doc 3: 30% cholesterol, 40% arteries, 30% heart * Learning Topics *Assign each word to a topic *For each word and topic, compute * Probability of topic given a document P(topic|doc) * Probability of word given a topic P(word|topic) * Reassign word to new topic with probability P(topic|doc) * P(word|topic) * Reassignment based on probability that topic T generated use of word W TOPICS
  • 16. Image Source: David Blei, “Probabilistic Topic Models” http://yosinski.com/mlss12/MLSS-2012-Blei-Probabilistic-Topic-Models/
  • 17. * * Use Cases * Data exploration in large corpus * Pre-classification analysis * Identify dominant themes * Tools *Stanford Topic Modeling Toolbox *Mallet (UMass Amherst) *R package: topicmodels *Python package: Gensim
  • 18. * * Sentiment Analysis * Topic Modeling
  • 19. * * Emerging Demand for Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows * Performance Considerations
  • 20. * 3 Key Components * Data * Representation scheme * Algorithms * Data * Positive examples – Examples from representative corpus * Negative examples – Randomly selected from same publications * Representation * TF-IDF * Vector space representation * Cosine of vectors measure of similarity * Algorithms * Supervised learning * SVMs * Ridge Classifier * Perceptrons * kNN * SGD Classifier * Naïve Bayes * Random Forest * AdaBoost *
  • 21. *
  • 22. *
  • 23. * Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python: Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/
  • 24. Support Vector Machine (SVM) is large margin classifier Commonly used in text classification Initial results based on life sciences sentence classifier Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png *
  • 25. *Term Frequency (TF) tf(t,d) = # of occurrences of t in d t is a term d is a document *Inverse Document Frequency (IDF) idf(t,D) = log(N / |{d in D : t in d}|) D is set of documents N is number of document *TF-IDF = tf(t,d) * idf(t,D) *TF-IDF is *large when high term frequency in document and low term frequency in all documents *small when term appears in many documents *
  • 26. * Bag of word model * Ignores structure (syntax) and meaning (semantics) of sentences * Representation vector length is the size of set of unique words in corpus * Stemming used to remove morphological differences * Each word is assigned an index in the representation vector, V * The value V[i] is non-zero if word appears in sentence represented by vector * The non-zero value is a function of the frequency of the word in the sentence and the frequency of the term in the corpus *
  • 27. Non-VF, Predicted VF:  “Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels of EspB into the host cell.”  “Data were log-transformed to correct for heterogeneity of the variances where necessary.”  “Subsequently, the kanamycin resistance cassette from pVK4 was cloned into the PstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption in the cesF region of EHEC strain 85-170.” VF, Predicted Non-VF  “Here, it is reported that the pO157-encoded Type V-secreted serine protease EspP influences the intestinal colonization of calves. “  “Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and intestinal inflammation but no signs of HUS. “  “The DsbLI system also comprises a functional redox pair”
  • 28.  Adding additional examples is not likely to substantially improve results as seen by error curve 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 2000 4000 6000 8000 10000 All Training Error Validation Error
  • 29. 8 Alternative Algorithms Select 10,000 most important features using chi-square
  • 30. * * SAS Text Miner * IBM Text Analytics * Smartlogic * Python: scikit-learn * R: RTextTools * R: tm
  • 31. * * Emerging Demand for Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows * Performance Considerations
  • 32. * * Processes of identifying words and phrases of objects in specific categories. Also known as: *Entity identification *Entity extraction *Chunking * Two steps: * Detect entities * Classify entities * Common classes of entities: * Persons * Organizations * Geographic locations * Dates * Monetary amounts
  • 33. *
  • 34. * * Four Broad Techniques *Linguistic - utilize structure of sentence * Statistical – detect patterns in training examples * Custom patterns – regular expressions * Dictionaries *Challenges *Creating training corpus *Granularity
  • 35. *
  • 36. *
  • 37. * *Use Cases * Name normalization * Entity correlation *Quantified metrics based on texts *Building block for event extraction *Tools * Stanford Core NLP * OpenNLP * Mallet * Basis Technology * Lexalytics * NetOwl * Cogitio API
  • 38. * * Emerging Demand for Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows * Performance Considerations
  • 39. * * Entities and relations between entities * Company A acquires Company B * Engineer A filed patent application on Topic B on Date C *Politician P announces A on Twitter on Date B * Assign roles to entities * Assign subtypes * Link to semantic data
  • 40. * * Brenden’s Twitter NLP Tools - https://github.com/aritter/twitter_nlp * Alchemy API * Turku BioNLP Event Extraction Software * Stanford Biomedical Event Parser Source: Turku Event Extraction System, http://jbjorne.github.io/TEES/
  • 41. * * Classification * Named Entity Recognition * Event Extraction
  • 42. * * Emerging Demand for Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows *Performance Considerations
  • 43. * * Document Collection * Text Extraction * Pre-processing * Case conversion * Punctuation removal * Stemming * Normalization * N-gram analysis * Analysis * Term Frequency – Inverse Document Frequency * Conditional Probabilities and Topic Models * NER and Entity Extraction * Integration * Link to Structured Data * Augment with additional semantic information * Utilization * Improve information retrieval * Identity brand perception problems * Assess likelihood of customer churn * Predict likelihood of … Collect Extract & Pre-Process Analyze Integrate Utilize
  • 45. * * Emerging Demand for Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows *Performance Considerations
  • 46. * * Scalability * Multiple language support * Quality *Precision *Recall * Algorithm selection * Reliability and timeliness of sources * Integration rules
  • 47. * Increase quantity of data (not always helpful; see error curves) * Improve quality of data * Utilize multiple supervised algorithms, ensemble and non-ensemble * Use unlabeled data and semi-supervised techniques * Feature Selection * Parameter Tuning * Feature Engineering * Given: * High quality data in sufficient quantity * State of the art machine learning algorithms * How to improve results: Change Representation? *
  • 48. *TF-IDF *Loss of syntactic and semantic information *No relation between term index and meaning *No support for disambiguation *Feature engineering extends vector representation or substitute specific for more general terms – a crude way to capture semantic properties *  Ideal Representation ◦ Capture semantic similarity of words ◦ Does not require feature engineering ◦ Minimal pre- processing, e.g. no mapping to ontologies ◦ Improves precision and recall
  • 49. *Words represented as set of weights in vector *Useful properties * Semantically similar words in close proximity * Methods for capturing phrases, e.g. “Secretion system” * Captures some semantic features *Trained with * Skip-gram or CBOW algorithms * Text, such as PubMed abstracts and open access papers * T. Mikolov, et. al. “Efficient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf
  • 50. *
  • 51. * * “Characterization of the Affective Norms for English Words by discrete emotional categories” http://indiana.edu/~panlab/papers/SraMjaJtw_ANEW.pdf * “New Avenues in Opinion Mining and Sentiment Analysis” http://sentic.net/new-avenues-in-opinion-mining-and- sentiment-analysis.pdf * “Empirical Study of Topic Modeling in Twitter” http://snap.stanford.edu/soma2010/papers/soma2010_12.p df http://snap.stanford.edu/soma2010/papers/soma2010_12.p df * “Open Domain Event Extraction from Twitter” http://turing.cs.washington.edu/papers/kdd12-ritter.pdf

Editor's Notes

  1. 1. – Process used in VF 2. – No idea why this labeled as a 1 3. Probably from a Methods section, refers to resistance cassette 4.