SlideShare a Scribd company logo
1 of 29
Statistical text mining using R
Tom Liptrot
The Christie Hospital
Motivation
Example 1: Dickens to
matrix
Example 2: Electronic
patient records
Dickens to Matrix: a bag of words
IT WAS the best of times, it was the worst of times,
it was the age of wisdom, it was the age of
foolishness, it was the epoch of belief, it was the
epoch of incredulity, it was the season of Light, it
was the season of Darkness, it was the spring of
hope, it was the winter of despair, we had
everything before us, we had nothing before us, we
were all going direct to Heaven, we were all going
direct the other way- in short, the period was so far
like the present period, that some of its noisiest
authorities insisted on its being received, for good or
for evil, in the superlative degree of comparison
only.
Dickens to Matrix: a matrix
Documents
Words
#Example matrix syntax
A = matrix(c(1, rep(0,6), 2), nrow = 4)
library(slam)
S = simple_triplet_matrix(c(1, 4), c(1, 2), c(1, 2))
library(Matrix)
M = sparseMatrix(i = c(1, 4), j = c(1, 2), x = c(1, 2))
๐‘Ž11 ๐‘Ž12 โ‹ฏ ๐‘Ž1๐‘›
๐‘Ž21 ๐‘Ž22 โ‹ฏ ๐‘Ž2๐‘›
โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ
๐‘Ž ๐‘š1 ๐‘Ž ๐‘š2 โ‹ฏ ๐‘Ž ๐‘š๐‘›
Dickens to Matrix: tm package
library(tm) #load the tm package
corpus_1 <- Corpus(VectorSource(txt)) # creates a โ€˜corpusโ€™ from a vector
corpus_1 <- tm_map(corpus_1, content_transformer(tolower))
corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)
it was the best of times, it was the worst of times, it was the age of
wisdom, it was the age of foolishness, it was the epoch of belief, it was
the epoch of incredulity, it was the season of light, it was the season of
darkness, it was the spring of hope, it was the winter of despair, we
had everything before us, we had nothing before us, we were all going
direct to heaven, we were all going direct the other way- in short, the
period was so far like the present period, that some of its noisiest
authorities insisted on its being received, for good or for evil, in the
superlative degree of comparison only.
Dickens to Matrix: stopwords
library(tm)
corpus_1 <- Corpus(VectorSource(txt))
corpus_1 <- tm_map(corpus_1, content_transformer(tolower))
corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)
it was the best of times, it was the worst of times, it was the age of
wisdom, it was the age of foolishness, it was the epoch of belief, it
was the epoch of incredulity, it was the season of light, it was the
season of darkness, it was the spring of hope, it was the winter of
despair, we had everything before us, we had nothing before us, we
were all going direct to heaven, we were all going direct the other
way- in short, the period was so far like the present period, that
some of its noisiest authorities insisted on its being received, for
good or for evil, in the superlative degree of comparison only.
Dickens to Matrix: stopwords
library(tm)
corpus_1 <- Corpus(VectorSource(txt))
corpus_1 <- tm_map(corpus_1, content_transformer(tolower))
corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)
best times, worst times, age wisdom, age foolishness, epoch
belief, epoch incredulity, season light, season darkness,
spring hope, winter despair, everything us, nothing us, going
direct heaven, going direct way- short, period far like present
period, noisiest authorities insisted received, good evil,
superlative degree comparison .
Dickens to Matrix: punctuation
library(tm)
corpus_1 <- Corpus(VectorSource(txt))
corpus_1 <- tm_map(corpus_1, content_transformer(tolower))
corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)
best times worst times age wisdom age foolishness epoch
belief epoch incredulity season light season darkness spring
hope winter despair everything us nothing us going direct
heaven going direct way short period far like present period
noisiest authorities insisted received good evil superlative degree
comparison
Dickens to Matrix: stemming
library(tm)
corpus_1 <- Corpus(VectorSource(txt))
corpus_1 <- tm_map(corpus_1, content_transformer(tolower))
corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)
best time worst time age wisdom age foolish epoch
belief epoch incredul season light season dark spring hope
winter despair everyth us noth us go direct heaven go direct
way short period far like present period noisiest author insist
receiv good evil superl degre comparison
Dickens to Matrix: cleanup
library(tm)
corpus_1 <- Corpus(VectorSource(txt))
corpus_1 <- tm_map(corpus_1, content_transformer(tolower))
corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)
best time worst time age wisdom age foolish epoch belief epoch
incredul season light season dark spring hope winter despair everyth
us noth us go direct heaven go direct way short period far like present
period noisiest author insist receiv good evil superl degre comparison
Dickens to Matrix: Term Document Matrix
tdm <- TermDocumentMatrix(corpus_1)
<<TermDocumentMatrix (terms: 35, documents: 1)>>
Non-/sparse entries: 35/0
Sparsity : 0%
Maximal term length: 10
Weighting : term frequency (tf)
class(tdm)
[1] "TermDocumentMatrix" "simple_triplet_matrixโ€œ
dim (tdm)
[1] 35 1
age 2 epoch 2 insist 1 short 1
author 1 everyth 1 light 1 spring 1
belief 1 evil 1 like 1 superl 1
best 1 far 1 noisiest 1 time 2
comparison 1 foolish 1 noth 1 way 1
dark 1 good 1 period 2 winter 1
degre 1 heaven 1 present 1 wisdom 1
despair 1 hope 1 receiv 1 worst 1
direct 2 incredul 1 season 2
Dickens to Matrix: Ngrams
Dickens to Matrix: Ngrams
Library(Rweka)
four_gram_tokeniser <- function(x, n) {
RWeka:::NGramTokenizer(x, RWeka:::Weka_control(min = 1, max = 4))
}
tdm_4gram <- TermDocumentMatrix(corpus_1,
control = list(tokenize = four_gram_tokeniser)))
dim(tdm_4gram)
[1] 163 1
age 2 author insist receiv good 1 dark 1
age foolish 1 belief 1 dark spring 1
age foolish epoch 1 belief epoch 1 dark spring hope 1
age foolish epoch belief 1 belief epoch incredul 1 dark spring hope winter 1
age wisdom 1 belief epoch incredul season 1 degre 1
age wisdom age 1 best 1 degre comparison 1
age wisdom age foolish 1 best time 1 despair 1
author 1 best time worst 1 despair everyth 1
author insist 1 best time worst time 1 despair everyth us 1
author insist receiv 1 comparison 1 despair everyth us noth 1
Electronic patient records: Gathering
structured medical data
Doctor enters structured data directly
Electronic patient records: Gathering
structured medical data
Trained staff
extract
structured
data from
typed notes
Doctor enters structured data directly
Electronic patient records: example text
Diagnosis: Oesophagus lower third squamous cell carcinoma, T3 N2 M0
History: X year old lady who presented with progressive dysphagia since
X and was known at X Hospital. She underwent an endoscopy which
found a tumour which was biopsied and is a squamous cell carcinoma.
A staging CT scan picked up a left upper lobe nodule. She then went on
to have an EUS at X this was performed by Dr X and showed an early
T3 tumour at 35-40cm of 4 small 4-6mm para-oesophageal nodes,
between 35-40cm. There was a further 7.8mm node in the AP window at
27cm, the carina was measured at 28cm and aortic arch at 24cm, the
conclusion T3 N2 M0. A subsequent PET CT scan was arranged-see
below. She can manage a soft diet such as Weetabix, soft toast, mashed
potato and gets occasional food stuck. Has lost half a stone in weight
and is supplementing with 3 Fresubin supplements per day.
Performance score is 1.
Electronic patient records: targets
Diagnosis: Oesophagus lower third squamous cell carcinoma, T3 N2 M0
History: X year old lady who presented with progressive dysphagia since X and was known at X
Hospital. She underwent an endoscopy which found a tumour which was biopsied and is a
squamous cell carcinoma. A staging CT scan picked up a left upper lobe nodule. She then went
on to have an EUS at X this was performed by Dr X and showed an early T3 tumour at 35-40cm of
4 small 4-6mm para-oesophageal nodes, between 35-40cm. There was a further 7.8mm node in
the AP window at 27cm, the carina was measured at 28cm and aortic arch at 24cm, the conclusion
T3 N2 M0. A subsequent PET CT scan was arranged-see below. She can manage a soft diet such
as Weetabix, soft toast, mashed potato and gets occasional food stuck. Has lost half a stone in
weight and is supplementing with 3 Fresubin supplements per day. Performance score is 1.
Electronic patient records: steps
1. Identify patients where we have both structured data and notes (c.20k)
2. Extract notes and structured data from SQL database
3. Make term document matrix (as shown previously) (60m x 20k)
4. Split data into training and development set
5. Train classification model using training set
6. Assess performance and tune model using development set
7. Evaluate system performance on independent dataset
8. Use system to extract structured data where we have none
Electronic patient records: predicting
disease site using the elastic net
#fits a elastic net model, classifying into oesophagus or not
selecting lambda through cross validation
library(glmnet)
dim(tdm) #22,843 documents, 677,017 Ngrams
#note tdm must either be a matrix or a SparseMatrix NOT a
simple_triplet_matrix
mod_oeso <- cv.glmnet( x = tdm,
y = disease_site == 'Oesophagus',
family = "binomial")
๐›ฝ = argmin
๐›ฝ
๐‘ฆ โˆ’ ๐‘‹๐›ฝ 2
+ ๐œ†2 ๐›ฝ 2
+ ๐œ†1 ๐›ฝ 1
OLS + RIDGE + LASSO
Electronic patient records: The Elastic Net
#plots non-zero coefficients from elastic net model
coefs <- coef(mod_oeso, s = mod_oeso$lambda.1se)[,1]
coefs <- coefs[coefs != 0]
coefs <- coefs[order(abs(coefs), decreasing = TRUE)]
barplot(coefs[-1], horiz = TRUE, col = 2)
P(site = โ€˜Oesophagusโ€™) = 0.03
Electronic patient records: classification
performance: primary disease site
Training set = 20,000
Test set = 4,000 patients
80% of patients can be classified
with 95% accuracy (remaining 20%
can be done by human abstractors)
Next step is full formal evaluation on
independent dataset
Working in combination with rules
based approach from Manchester
University
AUC = 90%
Electronic patient records: Possible
extensions
โ€ข Classification (hierarchical)
โ€ข Cluster analysis (KNN)
โ€ข Time
โ€ข Survival
โ€ข Drug toxicity
โ€ข Quality of life
Thanks
Tom.liptrot@christie.nhs.uk
Books example
get_links <- function(address, link_prefix = '', link_suffix = ''){
page <- getURL(address)
# Convert to R
tree <- htmlParse(page)
## Get All link elements
links <- xpathSApply(tree, path = "//*/a",
fun = xmlGetAttr, name = "href")
## Convert to vector
links <- unlist(links)
## add prefix and suffix
paste0(link_prefix, links, link_suffix)
}
links_authors <- get_links("http://textfiles.com/etext/AUTHORS/", '/',
link_prefix ='http://textfiles.com/etext/AUTHORS/')
links_text <- alply(links_authors, 1,function(.x){
get_links(.x, link_prefix =.x , link_suffix = '')
})
books <- llply(links_text, function(.x){
aaply(.x, 1, getURL)
})
Principle components analysis
## Code to get the first n principal components
## from a large sparse matrix term document matrix of class
dgCMatrix
library(irlba)
n <- 5 # number of components to calculate
m <- nrow(tdm) # 110703 terms in tdm matrix
xt.x <- crossprod(tdm)
x.Means <- colMeans(tdm)
xt.x <- (xt.x - m * tcrossprod(x.means)) / (m-1)
svd <- irlba(xt.x, nu=0, nv=n, tol=1e-10)
0.8 1.0 1.2 1.4
0.8
0.9
1.0
1.1
1.2
1.3
1.4
PC2
PC3
ARISTOTLE
BURROUGHS
DICKENS
KANT
PLATO
SHAKESPEARE
plot(svd$v[i,c(2,3)] + 1,
col = books_df$author,
log = 'xy',
xlab = 'PC2',
ylab = 'PC3')
PCA plot

More Related Content

Viewers also liked

Text mining with R-studio
Text mining with R-studioText mining with R-studio
Text mining with R-studioAshley Lindley
ย 
My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)Vincent Handara
ย 
Data mining with R- regression models
Data mining with R- regression modelsData mining with R- regression models
Data mining with R- regression modelsHamideh Iraj
ย 
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Richard Sheng
ย 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with RYanchang Zhao
ย 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
ย 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With RJahnab Kumar Deka
ย 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data MiningYanchang Zhao
ย 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with RYanchang Zhao
ย 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012Gigaom
ย 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
ย 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on rAshraf Uddin
ย 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)fridolin.wild
ย 
Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Revolution Analytics
ย 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Krishna Petrochemicals
ย 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysisike kurniati
ย 
Social media analysis in R using twitter API
Social media analysis in R using twitter API Social media analysis in R using twitter API
Social media analysis in R using twitter API Mohd Shadab Alam
ย 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify RaisAjay Ohri
ย 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with RYanchang Zhao
ย 

Viewers also liked (20)

Text mining with R-studio
Text mining with R-studioText mining with R-studio
Text mining with R-studio
ย 
My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)
ย 
Data mining with R- regression models
Data mining with R- regression modelsData mining with R- regression models
Data mining with R- regression models
ย 
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
ย 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with R
ย 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
ย 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With R
ย 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data Mining
ย 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
ย 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
ย 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
ย 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on r
ย 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
ย 
Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)
ย 
TextMining with R
TextMining with RTextMining with R
TextMining with R
ย 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
ย 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis
ย 
Social media analysis in R using twitter API
Social media analysis in R using twitter API Social media analysis in R using twitter API
Social media analysis in R using twitter API
ย 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify Rais
ย 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with R
ย 

Similar to R user group presentation

Quantifying MCMC exploration of phylogenetic tree space
Quantifying MCMC exploration of phylogenetic tree spaceQuantifying MCMC exploration of phylogenetic tree space
Quantifying MCMC exploration of phylogenetic tree spaceErick Matsen
ย 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisDavid Gleich
ย 
Variational quantum fidelity estimation
Variational quantum fidelity estimationVariational quantum fidelity estimation
Variational quantum fidelity estimationMarco Cerezo
ย 
Multi-Armed Bandits:โ€จ Intro, examples and tricks
Multi-Armed Bandits:โ€จ Intro, examples and tricksMulti-Armed Bandits:โ€จ Intro, examples and tricks
Multi-Armed Bandits:โ€จ Intro, examples and tricksIlias Flaounas
ย 
Markov Blanket Causal Discovery Using Minimum Message Length
Markov Blanket Causal  Discovery Using Minimum  Message LengthMarkov Blanket Causal  Discovery Using Minimum  Message Length
Markov Blanket Causal Discovery Using Minimum Message LengthBayesian Intelligence
ย 
Recent developments on unbiased MCMC
Recent developments on unbiased MCMCRecent developments on unbiased MCMC
Recent developments on unbiased MCMCPierre Jacob
ย 
Prediction and Explanation over DL-Lite Data Streams
Prediction and Explanation over DL-Lite Data StreamsPrediction and Explanation over DL-Lite Data Streams
Prediction and Explanation over DL-Lite Data StreamsSzymon Klarman
ย 
Poster_PingPong
Poster_PingPongPoster_PingPong
Poster_PingPongAlexa Aucoin
ย 
MD_course.ppt
MD_course.pptMD_course.ppt
MD_course.pptGold681565
ย 
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...EmadfHABIB2
ย 
A baisc ideas of statistical physics.pptx
A baisc ideas of statistical physics.pptxA baisc ideas of statistical physics.pptx
A baisc ideas of statistical physics.pptxabhilasha7888
ย 
Exploring temporal graph data with Python: โ€จa study on tensor decomposition o...
Exploring temporal graph data with Python: โ€จa study on tensor decomposition o...Exploring temporal graph data with Python: โ€จa study on tensor decomposition o...
Exploring temporal graph data with Python: โ€จa study on tensor decomposition o...Andrรฉ Panisson
ย 
Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...NeuroMat
ย 
Temporal dynamics of human behavior in social networks (ii)
Temporal dynamics of human behavior in social networks (ii)Temporal dynamics of human behavior in social networks (ii)
Temporal dynamics of human behavior in social networks (ii)Esteban Moro
ย 
Simple semantics in topic detection and tracking
Simple semantics in topic detection and trackingSimple semantics in topic detection and tracking
Simple semantics in topic detection and trackingGeorge Ang
ย 
Looking Inside Mechanistic Models of Carcinogenesis
Looking Inside Mechanistic Models of CarcinogenesisLooking Inside Mechanistic Models of Carcinogenesis
Looking Inside Mechanistic Models of CarcinogenesisSascha Zรถllner
ย 
Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionbutest
ย 

Similar to R user group presentation (20)

Quantifying MCMC exploration of phylogenetic tree space
Quantifying MCMC exploration of phylogenetic tree spaceQuantifying MCMC exploration of phylogenetic tree space
Quantifying MCMC exploration of phylogenetic tree space
ย 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysis
ย 
Variational quantum fidelity estimation
Variational quantum fidelity estimationVariational quantum fidelity estimation
Variational quantum fidelity estimation
ย 
Multi-Armed Bandits:โ€จ Intro, examples and tricks
Multi-Armed Bandits:โ€จ Intro, examples and tricksMulti-Armed Bandits:โ€จ Intro, examples and tricks
Multi-Armed Bandits:โ€จ Intro, examples and tricks
ย 
Markov Blanket Causal Discovery Using Minimum Message Length
Markov Blanket Causal  Discovery Using Minimum  Message LengthMarkov Blanket Causal  Discovery Using Minimum  Message Length
Markov Blanket Causal Discovery Using Minimum Message Length
ย 
Recent developments on unbiased MCMC
Recent developments on unbiased MCMCRecent developments on unbiased MCMC
Recent developments on unbiased MCMC
ย 
Prediction and Explanation over DL-Lite Data Streams
Prediction and Explanation over DL-Lite Data StreamsPrediction and Explanation over DL-Lite Data Streams
Prediction and Explanation over DL-Lite Data Streams
ย 
Poster_PingPong
Poster_PingPongPoster_PingPong
Poster_PingPong
ย 
MD_course.ppt
MD_course.pptMD_course.ppt
MD_course.ppt
ย 
Kalman filtering
Kalman filteringKalman filtering
Kalman filtering
ย 
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
ย 
A baisc ideas of statistical physics.pptx
A baisc ideas of statistical physics.pptxA baisc ideas of statistical physics.pptx
A baisc ideas of statistical physics.pptx
ย 
08 entropie
08 entropie08 entropie
08 entropie
ย 
Exploring temporal graph data with Python: โ€จa study on tensor decomposition o...
Exploring temporal graph data with Python: โ€จa study on tensor decomposition o...Exploring temporal graph data with Python: โ€จa study on tensor decomposition o...
Exploring temporal graph data with Python: โ€จa study on tensor decomposition o...
ย 
Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...
ย 
Temporal dynamics of human behavior in social networks (ii)
Temporal dynamics of human behavior in social networks (ii)Temporal dynamics of human behavior in social networks (ii)
Temporal dynamics of human behavior in social networks (ii)
ย 
Simple semantics in topic detection and tracking
Simple semantics in topic detection and trackingSimple semantics in topic detection and tracking
Simple semantics in topic detection and tracking
ย 
Looking Inside Mechanistic Models of Carcinogenesis
Looking Inside Mechanistic Models of CarcinogenesisLooking Inside Mechanistic Models of Carcinogenesis
Looking Inside Mechanistic Models of Carcinogenesis
ย 
International Journal of Engineering Inventions (IJEI),
International Journal of Engineering Inventions (IJEI), International Journal of Engineering Inventions (IJEI),
International Journal of Engineering Inventions (IJEI),
ย 
Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognition
ย 

Recently uploaded

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
ย 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
ย 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
ย 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
ย 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
ย 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
ย 
Chintamani Call Girls: ๐Ÿ“ 7737669865 ๐Ÿ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: ๐Ÿ“ 7737669865 ๐Ÿ“ High Profile Model Escorts | Bangalore ...Chintamani Call Girls: ๐Ÿ“ 7737669865 ๐Ÿ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: ๐Ÿ“ 7737669865 ๐Ÿ“ High Profile Model Escorts | Bangalore ...amitlee9823
ย 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
ย 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
ย 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
ย 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
ย 
Call Girls Indiranagar Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service B...
Call Girls Indiranagar Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service B...Call Girls Indiranagar Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service B...
Call Girls Indiranagar Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service B...amitlee9823
ย 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
ย 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
ย 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
ย 
CHEAP Call Girls in Saket (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR
ย 
BDSMโšกCall Girls in Mandawali Delhi >เผ’8448380779 Escort Service
BDSMโšกCall Girls in Mandawali Delhi >เผ’8448380779 Escort ServiceBDSMโšกCall Girls in Mandawali Delhi >เผ’8448380779 Escort Service
BDSMโšกCall Girls in Mandawali Delhi >เผ’8448380779 Escort ServiceDelhi Call girls
ย 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra
ย 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
ย 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
ย 

Recently uploaded (20)

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
ย 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
ย 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
ย 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
ย 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
ย 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
ย 
Chintamani Call Girls: ๐Ÿ“ 7737669865 ๐Ÿ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: ๐Ÿ“ 7737669865 ๐Ÿ“ High Profile Model Escorts | Bangalore ...Chintamani Call Girls: ๐Ÿ“ 7737669865 ๐Ÿ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: ๐Ÿ“ 7737669865 ๐Ÿ“ High Profile Model Escorts | Bangalore ...
ย 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
ย 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
ย 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
ย 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
ย 
Call Girls Indiranagar Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service B...
Call Girls Indiranagar Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service B...Call Girls Indiranagar Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service B...
Call Girls Indiranagar Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service B...
ย 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
ย 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
ย 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
ย 
CHEAP Call Girls in Saket (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
ย 
BDSMโšกCall Girls in Mandawali Delhi >เผ’8448380779 Escort Service
BDSMโšกCall Girls in Mandawali Delhi >เผ’8448380779 Escort ServiceBDSMโšกCall Girls in Mandawali Delhi >เผ’8448380779 Escort Service
BDSMโšกCall Girls in Mandawali Delhi >เผ’8448380779 Escort Service
ย 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
ย 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
ย 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
ย 

R user group presentation

  • 1. Statistical text mining using R Tom Liptrot The Christie Hospital
  • 3.
  • 4. Example 1: Dickens to matrix Example 2: Electronic patient records
  • 5. Dickens to Matrix: a bag of words IT WAS the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way- in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.
  • 6. Dickens to Matrix: a matrix Documents Words #Example matrix syntax A = matrix(c(1, rep(0,6), 2), nrow = 4) library(slam) S = simple_triplet_matrix(c(1, 4), c(1, 2), c(1, 2)) library(Matrix) M = sparseMatrix(i = c(1, 4), j = c(1, 2), x = c(1, 2)) ๐‘Ž11 ๐‘Ž12 โ‹ฏ ๐‘Ž1๐‘› ๐‘Ž21 ๐‘Ž22 โ‹ฏ ๐‘Ž2๐‘› โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ ๐‘Ž ๐‘š1 ๐‘Ž ๐‘š2 โ‹ฏ ๐‘Ž ๐‘š๐‘›
  • 7. Dickens to Matrix: tm package library(tm) #load the tm package corpus_1 <- Corpus(VectorSource(txt)) # creates a โ€˜corpusโ€™ from a vector corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removePunctuation) corpus_1 <- tm_map(corpus_1, stemDocument) corpus_1 <- tm_map(corpus_1, stripWhitespace) it was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to heaven, we were all going direct the other way- in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.
  • 8. Dickens to Matrix: stopwords library(tm) corpus_1 <- Corpus(VectorSource(txt)) corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removePunctuation) corpus_1 <- tm_map(corpus_1, stemDocument) corpus_1 <- tm_map(corpus_1, stripWhitespace) it was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to heaven, we were all going direct the other way- in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.
  • 9. Dickens to Matrix: stopwords library(tm) corpus_1 <- Corpus(VectorSource(txt)) corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removePunctuation) corpus_1 <- tm_map(corpus_1, stemDocument) corpus_1 <- tm_map(corpus_1, stripWhitespace) best times, worst times, age wisdom, age foolishness, epoch belief, epoch incredulity, season light, season darkness, spring hope, winter despair, everything us, nothing us, going direct heaven, going direct way- short, period far like present period, noisiest authorities insisted received, good evil, superlative degree comparison .
  • 10. Dickens to Matrix: punctuation library(tm) corpus_1 <- Corpus(VectorSource(txt)) corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removePunctuation) corpus_1 <- tm_map(corpus_1, stemDocument) corpus_1 <- tm_map(corpus_1, stripWhitespace) best times worst times age wisdom age foolishness epoch belief epoch incredulity season light season darkness spring hope winter despair everything us nothing us going direct heaven going direct way short period far like present period noisiest authorities insisted received good evil superlative degree comparison
  • 11. Dickens to Matrix: stemming library(tm) corpus_1 <- Corpus(VectorSource(txt)) corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removePunctuation) corpus_1 <- tm_map(corpus_1, stemDocument) corpus_1 <- tm_map(corpus_1, stripWhitespace) best time worst time age wisdom age foolish epoch belief epoch incredul season light season dark spring hope winter despair everyth us noth us go direct heaven go direct way short period far like present period noisiest author insist receiv good evil superl degre comparison
  • 12. Dickens to Matrix: cleanup library(tm) corpus_1 <- Corpus(VectorSource(txt)) corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removePunctuation) corpus_1 <- tm_map(corpus_1, stemDocument) corpus_1 <- tm_map(corpus_1, stripWhitespace) best time worst time age wisdom age foolish epoch belief epoch incredul season light season dark spring hope winter despair everyth us noth us go direct heaven go direct way short period far like present period noisiest author insist receiv good evil superl degre comparison
  • 13. Dickens to Matrix: Term Document Matrix tdm <- TermDocumentMatrix(corpus_1) <<TermDocumentMatrix (terms: 35, documents: 1)>> Non-/sparse entries: 35/0 Sparsity : 0% Maximal term length: 10 Weighting : term frequency (tf) class(tdm) [1] "TermDocumentMatrix" "simple_triplet_matrixโ€œ dim (tdm) [1] 35 1 age 2 epoch 2 insist 1 short 1 author 1 everyth 1 light 1 spring 1 belief 1 evil 1 like 1 superl 1 best 1 far 1 noisiest 1 time 2 comparison 1 foolish 1 noth 1 way 1 dark 1 good 1 period 2 winter 1 degre 1 heaven 1 present 1 wisdom 1 despair 1 hope 1 receiv 1 worst 1 direct 2 incredul 1 season 2
  • 15. Dickens to Matrix: Ngrams Library(Rweka) four_gram_tokeniser <- function(x, n) { RWeka:::NGramTokenizer(x, RWeka:::Weka_control(min = 1, max = 4)) } tdm_4gram <- TermDocumentMatrix(corpus_1, control = list(tokenize = four_gram_tokeniser))) dim(tdm_4gram) [1] 163 1 age 2 author insist receiv good 1 dark 1 age foolish 1 belief 1 dark spring 1 age foolish epoch 1 belief epoch 1 dark spring hope 1 age foolish epoch belief 1 belief epoch incredul 1 dark spring hope winter 1 age wisdom 1 belief epoch incredul season 1 degre 1 age wisdom age 1 best 1 degre comparison 1 age wisdom age foolish 1 best time 1 despair 1 author 1 best time worst 1 despair everyth 1 author insist 1 best time worst time 1 despair everyth us 1 author insist receiv 1 comparison 1 despair everyth us noth 1
  • 16. Electronic patient records: Gathering structured medical data Doctor enters structured data directly
  • 17. Electronic patient records: Gathering structured medical data Trained staff extract structured data from typed notes Doctor enters structured data directly
  • 18. Electronic patient records: example text Diagnosis: Oesophagus lower third squamous cell carcinoma, T3 N2 M0 History: X year old lady who presented with progressive dysphagia since X and was known at X Hospital. She underwent an endoscopy which found a tumour which was biopsied and is a squamous cell carcinoma. A staging CT scan picked up a left upper lobe nodule. She then went on to have an EUS at X this was performed by Dr X and showed an early T3 tumour at 35-40cm of 4 small 4-6mm para-oesophageal nodes, between 35-40cm. There was a further 7.8mm node in the AP window at 27cm, the carina was measured at 28cm and aortic arch at 24cm, the conclusion T3 N2 M0. A subsequent PET CT scan was arranged-see below. She can manage a soft diet such as Weetabix, soft toast, mashed potato and gets occasional food stuck. Has lost half a stone in weight and is supplementing with 3 Fresubin supplements per day. Performance score is 1.
  • 19. Electronic patient records: targets Diagnosis: Oesophagus lower third squamous cell carcinoma, T3 N2 M0 History: X year old lady who presented with progressive dysphagia since X and was known at X Hospital. She underwent an endoscopy which found a tumour which was biopsied and is a squamous cell carcinoma. A staging CT scan picked up a left upper lobe nodule. She then went on to have an EUS at X this was performed by Dr X and showed an early T3 tumour at 35-40cm of 4 small 4-6mm para-oesophageal nodes, between 35-40cm. There was a further 7.8mm node in the AP window at 27cm, the carina was measured at 28cm and aortic arch at 24cm, the conclusion T3 N2 M0. A subsequent PET CT scan was arranged-see below. She can manage a soft diet such as Weetabix, soft toast, mashed potato and gets occasional food stuck. Has lost half a stone in weight and is supplementing with 3 Fresubin supplements per day. Performance score is 1.
  • 20. Electronic patient records: steps 1. Identify patients where we have both structured data and notes (c.20k) 2. Extract notes and structured data from SQL database 3. Make term document matrix (as shown previously) (60m x 20k) 4. Split data into training and development set 5. Train classification model using training set 6. Assess performance and tune model using development set 7. Evaluate system performance on independent dataset 8. Use system to extract structured data where we have none
  • 21. Electronic patient records: predicting disease site using the elastic net #fits a elastic net model, classifying into oesophagus or not selecting lambda through cross validation library(glmnet) dim(tdm) #22,843 documents, 677,017 Ngrams #note tdm must either be a matrix or a SparseMatrix NOT a simple_triplet_matrix mod_oeso <- cv.glmnet( x = tdm, y = disease_site == 'Oesophagus', family = "binomial") ๐›ฝ = argmin ๐›ฝ ๐‘ฆ โˆ’ ๐‘‹๐›ฝ 2 + ๐œ†2 ๐›ฝ 2 + ๐œ†1 ๐›ฝ 1 OLS + RIDGE + LASSO
  • 22. Electronic patient records: The Elastic Net #plots non-zero coefficients from elastic net model coefs <- coef(mod_oeso, s = mod_oeso$lambda.1se)[,1] coefs <- coefs[coefs != 0] coefs <- coefs[order(abs(coefs), decreasing = TRUE)] barplot(coefs[-1], horiz = TRUE, col = 2) P(site = โ€˜Oesophagusโ€™) = 0.03
  • 23. Electronic patient records: classification performance: primary disease site Training set = 20,000 Test set = 4,000 patients 80% of patients can be classified with 95% accuracy (remaining 20% can be done by human abstractors) Next step is full formal evaluation on independent dataset Working in combination with rules based approach from Manchester University AUC = 90%
  • 24. Electronic patient records: Possible extensions โ€ข Classification (hierarchical) โ€ข Cluster analysis (KNN) โ€ข Time โ€ข Survival โ€ข Drug toxicity โ€ข Quality of life
  • 26. Books example get_links <- function(address, link_prefix = '', link_suffix = ''){ page <- getURL(address) # Convert to R tree <- htmlParse(page) ## Get All link elements links <- xpathSApply(tree, path = "//*/a", fun = xmlGetAttr, name = "href") ## Convert to vector links <- unlist(links) ## add prefix and suffix paste0(link_prefix, links, link_suffix) } links_authors <- get_links("http://textfiles.com/etext/AUTHORS/", '/', link_prefix ='http://textfiles.com/etext/AUTHORS/') links_text <- alply(links_authors, 1,function(.x){ get_links(.x, link_prefix =.x , link_suffix = '') }) books <- llply(links_text, function(.x){ aaply(.x, 1, getURL) })
  • 27.
  • 28. Principle components analysis ## Code to get the first n principal components ## from a large sparse matrix term document matrix of class dgCMatrix library(irlba) n <- 5 # number of components to calculate m <- nrow(tdm) # 110703 terms in tdm matrix xt.x <- crossprod(tdm) x.Means <- colMeans(tdm) xt.x <- (xt.x - m * tcrossprod(x.means)) / (m-1) svd <- irlba(xt.x, nu=0, nv=n, tol=1e-10)
  • 29. 0.8 1.0 1.2 1.4 0.8 0.9 1.0 1.1 1.2 1.3 1.4 PC2 PC3 ARISTOTLE BURROUGHS DICKENS KANT PLATO SHAKESPEARE plot(svd$v[i,c(2,3)] + 1, col = books_df$author, log = 'xy', xlab = 'PC2', ylab = 'PC3') PCA plot

Editor's Notes

  1. Aims Make people want to try this themselves Get feedback or suggestions from those who have done this already
  2. Aims Make people want to try this themselves Get feedback or suggestions from those who have done this already