SlideShare una empresa de Scribd logo
1 de 19
HANDS ON:
TEXT MINING WITH R
Jahnab Kumar Deka
Introduction
• To learn from collections of text documents like books,
newspapers, emails, etc.
Important Terms:
• Tokenization
• Tagging (Noun/Verb/…)
• Chunking(Noun Phase)
• Stemming(-ing/-s/-ed)
Important packages in R
• library(tm) # Framework for text mining.
• library(SnowballC) # Provides wordStem() for stemming.
• library(qdap) # Quantitative discourse analysis of
transcripts.
• library(qdapDictionaries)
• library(dplyr) # Data preparation and pipes %>%.
• library(RColorBrewer) # Generate palette of colours for
plots.
• library(ggplot2) # Plot word frequencies.
• library(scales) # Include commas in numbers.
• library(Rgraphviz) # Correlation plots.
Corpus
• Collection of text
• Each corpus will have separate articles, stories, volumes,
each treated as a separate entity or record.
• Any file format can be converted to text file for corpus
Eg:
• PDF to Text File
• system("for f in *.pdf; do pdftotext -enc ASCII7 -nopgbrk $f; done")
• Word Document to Text File
• system("for f in *.doc; do antiword $f; done")
Corpus
• Consider folder corpus/txt
• List some of file names
Loading Corpus
• Loading Corpus
** Using DirSource() the source object is passed on to Corpus() which loads the documents.
• In case of PDF Documents
• docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF))
** xpdf application needs to be installed for readPDF()
• In case of Word Documents
• docs <- Corpus(DirSource(cname), readerControl=list(reader=readDOC("-r -s")))
** -r requests that removed text be included in the output
** -s requests that text hidden by Word be included
Exploration of Corpus
• inspect()
• Preparing the corpus
• Transformation type
• tm map() is used to apply one of this transformation
• Other transformations can be implemented using R functions and wrapped
within content_transformer()
Transformation Example
• replace “/”, “@” and “|” with a space
• Alternate method
• Conversion to toLower Case
• Remove Numbers
• Remove Punctuation
Contd...
• Remove English Stop Words
• Remove Own Stop Words
• Strip Whitespace
• Specific Transformations
Contd...
• Stemming
• Creating a Document Term Matrix
A matrix with documents as the rows
terms as the columns
count of the frequency of words as the cells of the matrix.
• Term frequency
Contd...
• Frequency order of item
• ord <- order(freq)
• Least Frequent item
• freq[head(ord)]
• Most frequent item
• freq[tail(ord)]
• Document Term matrix to CSV
• dtm <- DocumentTermMatrix(docs)
• m <- as.matrix(dtm)
• write.csv(m, file="dtm.csv")
Contd...
• Removing Sparse Terms
• dtms <- removeSparseTerms(dtm, 0.1) //Sparse factor
• the resulting matrix contains only terms with a sparse factor of less than sparse.
• Frequent items and association
** lowfreq = terms that occur at least 1000 times
• Association with word with correlation limit
• // association of “data” with other word
• // two words always appear together => correlation would be 1.0
Correlation
• 50 of the more frequent words
• With minimum correlation of 0.5
• Word occurrences 100
• By default
• 20 random terms
• With minimum correlation of 0.7
Plotting word frequencies
• freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
• wf <- data.frame(word=names(freq), freq=freq)
• //words that occurs at least 500 times in the corpus
Word cloud
Size of Word & Frequency
• For word limitation
• wordcloud(names(freq), freq, max.words=100)
• For term frequency limitation
• wordcloud(names(freq), freq, min.freq=100)
• Adding Color
• wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))
Quantitative Analysis of Text (qdap)
• Extracting the column names (the terms) and retain those shorter
than 20 characters
• To generate frequencies and percentage
Contd...
• Word Length Counts
** vertical line = Mean length of words
Letter and Position Heatmap

Más contenido relacionado

La actualidad más candente

Datamining - On What Kind of Data
Datamining - On What Kind of DataDatamining - On What Kind of Data
Datamining - On What Kind of Datawina wulansari
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingCloudxLab
 
Top Machine Learning Tools and Frameworks for Beginners | Edureka
Top Machine Learning Tools and Frameworks for Beginners | EdurekaTop Machine Learning Tools and Frameworks for Beginners | Edureka
Top Machine Learning Tools and Frameworks for Beginners | EdurekaEdureka!
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)Amir Fahmideh
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with RShareThis
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data AnalysisAndrew Henshaw
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big dataSigmoid
 
6 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/26 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/2Fabio Fumarola
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...butest
 
Classification of data mart
Classification of data martClassification of data mart
Classification of data martkhush_boo31
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notesMohit Saini
 
Database Indexes
Database IndexesDatabase Indexes
Database IndexesSperasoft
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Alexandros Karatzoglou
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1malathieswaran29
 

La actualidad más candente (20)

Datamining - On What Kind of Data
Datamining - On What Kind of DataDatamining - On What Kind of Data
Datamining - On What Kind of Data
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Top Machine Learning Tools and Frameworks for Beginners | Edureka
Top Machine Learning Tools and Frameworks for Beginners | EdurekaTop Machine Learning Tools and Frameworks for Beginners | Edureka
Top Machine Learning Tools and Frameworks for Beginners | Edureka
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with R
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data Analysis
 
Python pandas Library
Python pandas LibraryPython pandas Library
Python pandas Library
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
6 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/26 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/2
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Data mining
Data miningData mining
Data mining
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
 
Classification of data mart
Classification of data martClassification of data mart
Classification of data mart
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Database Indexes
Database IndexesDatabase Indexes
Database Indexes
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial
 
NLTK
NLTKNLTK
NLTK
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 

Destacado

Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)fridolin.wild
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 
Social media analysis in R using twitter API
Social media analysis in R using twitter API Social media analysis in R using twitter API
Social media analysis in R using twitter API Mohd Shadab Alam
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify RaisAjay Ohri
 
Sentiment analysis of tweets
Sentiment analysis of tweetsSentiment analysis of tweets
Sentiment analysis of tweetsVasu Jain
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in TwitterAyushi Dalmia
 
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlText analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlBen Healey
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with RYanchang Zhao
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with RYanchang Zhao
 
Interactive Text Mining Suite: Data Visualization for Literary Studies
Interactive Text Mining Suite: Data Visualization for Literary Studies Interactive Text Mining Suite: Data Visualization for Literary Studies
Interactive Text Mining Suite: Data Visualization for Literary Studies Olga Scrivner
 
Rugby World Cup 2011 twitter analysis
Rugby World Cup 2011 twitter analysisRugby World Cup 2011 twitter analysis
Rugby World Cup 2011 twitter analysisiGo2 Pty Ltd
 
Der Nobelpreis geht an: Vitamin C
Der Nobelpreis geht an: Vitamin CDer Nobelpreis geht an: Vitamin C
Der Nobelpreis geht an: Vitamin CDr Rath
 
Text Mining for Second Screen
Text Mining for Second ScreenText Mining for Second Screen
Text Mining for Second ScreenIvan Demin
 
Count-Min Tree Sketch : Approximate counting for NLP tasks
Count-Min Tree Sketch : Approximate counting for NLP tasksCount-Min Tree Sketch : Approximate counting for NLP tasks
Count-Min Tree Sketch : Approximate counting for NLP tasksGuillaume Pitel
 

Destacado (20)

Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
TextMining with R
TextMining with RTextMining with R
TextMining with R
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Social media analysis in R using twitter API
Social media analysis in R using twitter API Social media analysis in R using twitter API
Social media analysis in R using twitter API
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify Rais
 
Sentiment analysis of tweets
Sentiment analysis of tweetsSentiment analysis of tweets
Sentiment analysis of tweets
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
 
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlText analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco Control
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with R
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with R
 
Interactive Text Mining Suite: Data Visualization for Literary Studies
Interactive Text Mining Suite: Data Visualization for Literary Studies Interactive Text Mining Suite: Data Visualization for Literary Studies
Interactive Text Mining Suite: Data Visualization for Literary Studies
 
R Datatypes
R DatatypesR Datatypes
R Datatypes
 
Reading Data into R
Reading Data into RReading Data into R
Reading Data into R
 
Text MIning
Text MIningText MIning
Text MIning
 
Rugby World Cup 2011 twitter analysis
Rugby World Cup 2011 twitter analysisRugby World Cup 2011 twitter analysis
Rugby World Cup 2011 twitter analysis
 
Der Nobelpreis geht an: Vitamin C
Der Nobelpreis geht an: Vitamin CDer Nobelpreis geht an: Vitamin C
Der Nobelpreis geht an: Vitamin C
 
Text Mining for Second Screen
Text Mining for Second ScreenText Mining for Second Screen
Text Mining for Second Screen
 
Count-Min Tree Sketch : Approximate counting for NLP tasks
Count-Min Tree Sketch : Approximate counting for NLP tasksCount-Min Tree Sketch : Approximate counting for NLP tasks
Count-Min Tree Sketch : Approximate counting for NLP tasks
 

Similar a hands on: Text Mining With R

Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersVitomir Kovanovic
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XMLAbhra Basak
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingFlorian Leitner
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachReza Rahimi
 
Alexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape MeetupAlexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape MeetupOleksii Holub
 
Text and Numbers (Data Types)in PHP
Text and Numbers (Data Types)in PHPText and Numbers (Data Types)in PHP
Text and Numbers (Data Types)in PHPKamal Acharya
 
Set Similarity Search using a Distributed Prefix Tree Index
Set Similarity Search using a Distributed Prefix Tree IndexSet Similarity Search using a Distributed Prefix Tree Index
Set Similarity Search using a Distributed Prefix Tree IndexHPCC Systems
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examplesYoshitomo Matsubara
 
Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions Ahmed El-Arabawy
 

Similar a hands on: Text Mining With R (20)

Web search engines
Web search enginesWeb search engines
Web search engines
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics Researchers
 
04 standard class library c#
04 standard class library c#04 standard class library c#
04 standard class library c#
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XML
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String Processing
 
search engine
search enginesearch engine
search engine
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning Approach
 
Alexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape MeetupAlexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape Meetup
 
Text and Numbers (Data Types)in PHP
Text and Numbers (Data Types)in PHPText and Numbers (Data Types)in PHP
Text and Numbers (Data Types)in PHP
 
MIPS Architecture
MIPS ArchitectureMIPS Architecture
MIPS Architecture
 
Set Similarity Search using a Distributed Prefix Tree Index
Set Similarity Search using a Distributed Prefix Tree IndexSet Similarity Search using a Distributed Prefix Tree Index
Set Similarity Search using a Distributed Prefix Tree Index
 
AWS Data Collection & Storage
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & Storage
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examples
 
Basics R.ppt
Basics R.pptBasics R.ppt
Basics R.ppt
 
Lecture_4.pdf
Lecture_4.pdfLecture_4.pdf
Lecture_4.pdf
 
Text features
Text featuresText features
Text features
 
Basics.ppt
Basics.pptBasics.ppt
Basics.ppt
 
Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions
 
محاضرة برنامج التحليل الكمي R program د.هديل القفيدي
محاضرة برنامج التحليل الكمي   R program د.هديل القفيديمحاضرة برنامج التحليل الكمي   R program د.هديل القفيدي
محاضرة برنامج التحليل الكمي R program د.هديل القفيدي
 

Último

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 

Último (20)

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 

hands on: Text Mining With R

  • 1. HANDS ON: TEXT MINING WITH R Jahnab Kumar Deka
  • 2. Introduction • To learn from collections of text documents like books, newspapers, emails, etc. Important Terms: • Tokenization • Tagging (Noun/Verb/…) • Chunking(Noun Phase) • Stemming(-ing/-s/-ed)
  • 3. Important packages in R • library(tm) # Framework for text mining. • library(SnowballC) # Provides wordStem() for stemming. • library(qdap) # Quantitative discourse analysis of transcripts. • library(qdapDictionaries) • library(dplyr) # Data preparation and pipes %>%. • library(RColorBrewer) # Generate palette of colours for plots. • library(ggplot2) # Plot word frequencies. • library(scales) # Include commas in numbers. • library(Rgraphviz) # Correlation plots.
  • 4. Corpus • Collection of text • Each corpus will have separate articles, stories, volumes, each treated as a separate entity or record. • Any file format can be converted to text file for corpus Eg: • PDF to Text File • system("for f in *.pdf; do pdftotext -enc ASCII7 -nopgbrk $f; done") • Word Document to Text File • system("for f in *.doc; do antiword $f; done")
  • 5. Corpus • Consider folder corpus/txt • List some of file names
  • 6. Loading Corpus • Loading Corpus ** Using DirSource() the source object is passed on to Corpus() which loads the documents. • In case of PDF Documents • docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF)) ** xpdf application needs to be installed for readPDF() • In case of Word Documents • docs <- Corpus(DirSource(cname), readerControl=list(reader=readDOC("-r -s"))) ** -r requests that removed text be included in the output ** -s requests that text hidden by Word be included
  • 7. Exploration of Corpus • inspect() • Preparing the corpus • Transformation type • tm map() is used to apply one of this transformation • Other transformations can be implemented using R functions and wrapped within content_transformer()
  • 8. Transformation Example • replace “/”, “@” and “|” with a space • Alternate method • Conversion to toLower Case • Remove Numbers • Remove Punctuation
  • 9. Contd... • Remove English Stop Words • Remove Own Stop Words • Strip Whitespace • Specific Transformations
  • 10. Contd... • Stemming • Creating a Document Term Matrix A matrix with documents as the rows terms as the columns count of the frequency of words as the cells of the matrix. • Term frequency
  • 11. Contd... • Frequency order of item • ord <- order(freq) • Least Frequent item • freq[head(ord)] • Most frequent item • freq[tail(ord)] • Document Term matrix to CSV • dtm <- DocumentTermMatrix(docs) • m <- as.matrix(dtm) • write.csv(m, file="dtm.csv")
  • 12. Contd... • Removing Sparse Terms • dtms <- removeSparseTerms(dtm, 0.1) //Sparse factor • the resulting matrix contains only terms with a sparse factor of less than sparse. • Frequent items and association ** lowfreq = terms that occur at least 1000 times • Association with word with correlation limit • // association of “data” with other word • // two words always appear together => correlation would be 1.0
  • 13. Correlation • 50 of the more frequent words • With minimum correlation of 0.5 • Word occurrences 100 • By default • 20 random terms • With minimum correlation of 0.7
  • 14. Plotting word frequencies • freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE) • wf <- data.frame(word=names(freq), freq=freq) • //words that occurs at least 500 times in the corpus
  • 16. Size of Word & Frequency • For word limitation • wordcloud(names(freq), freq, max.words=100) • For term frequency limitation • wordcloud(names(freq), freq, min.freq=100) • Adding Color • wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))
  • 17. Quantitative Analysis of Text (qdap) • Extracting the column names (the terms) and retain those shorter than 20 characters • To generate frequencies and percentage
  • 18. Contd... • Word Length Counts ** vertical line = Mean length of words