SlideShare una empresa de Scribd logo
1 de 17
Data Science 101: Using R Language
to get Big Insights
Satnam Singh,
Senior Chief Engineer,
Samsung Research India – Bangalore
[ Twitter - @satnam74s]
India Software Developers Conference, Bangalore
March 16, 2013
2
Motivation: Using Data to get Business Insights
Data Bases
& Clusters
Data Bases
& Clusters
Data Bases
& Clusters
Insights? Insights?
Insights?
Ref. [kaggle.com]
Data Science Programming Languages
Why R?
• Popular, Free
• Open source
• Multi-platform
• Vectorization
• Many statistical packages
• Large support base
• Obj. oriented prog. lang.
Ref [http://www.r-project.org]
R Language Basics
> y <- 21
> y
[1] 21
> z = 233
> z
[1] 233
> y <- c(1,2,3,4)
> y
[1] 1 2 3 4
Simple
Operations
Vector
Operations
Function
Calls
5
R Language: Data Structures Examples
• Data frame
• Matrix
• List
> MyFamilyage <- c(5,6,40,38)
> MyFamilyage <- c(5,6,40,38)
> MFamilyName <- c("Sat",“Veera",“Minu","Dummy")
> MyFamilyweight <- c(72,70,12,40)
> MyFamily<-
data.frame(MyFamilyName,MyFamilyage,MyFamilyweight)
> MyMatrix<-as.matrix(MyFamilyage)
> Mydataframe <-as.data.frame(MyMatrix)
> MyList <-a.list(Mydataframe)
6
Case Study: Activity Recognition
• Activity Recognition: Detect walking,
driving, biking, climbing stairs,
standing, etc.
Example of Accelerometer data
Smartphone’s
Accelerometer
Sensor
[Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham
University, Bronx, NY
[Ref] Jordan Frank, McGill University
[Ref] Commercial API Providers: Sensor Platoforms, Movea,
Alohar
7
Data Analysis - Steps
Feature
Extraction
Time Series Data 43 Features
Mean for each
acc. Axis (3)
Std. dev. for each
acc. Axis (3)
200 samples (10 sec)
Avg. Abs. diff. from
Mean for each
acc. Axis (3)
Avg. Resultant Acc. (1)
Histogram (30)
Classifiers
CART: Decision Tree
RF: Random Forest
Classify the
Activity
[Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY
[Ref] Jordan Frank, McGill University
Data Visualization – Activity (Class Variable)
[Ref] Rattle R Data Mining Tool
ds <-
rbind(summary(na.omit(crs$dataset[,]$clas
s)), summary(na.omit(crs$dataset[,][crs
$dataset$class=="Downstairs",]$class)),
summary(na.omit(crs$dataset[,][crs$datase
t$class=="Jogging",]$class)), summary(
na.omit(crs$dataset[,][crs$dataset$class=
="Sitting",]$class)), summary(na.omit(
crs$dataset[,][crs$dataset$class=="Standi
ng",]$class)), summary(na.omit(crs$dat
aset[,][crs$dataset$class=="Upstairs",]$c
lass)), summary(na.omit(crs$dataset[,]
[crs$dataset$class=="Walking",]$class)))
ord <- order(ds[1,], decreasing=TRUE)
bp <-
barplot2(ds[,ord], beside=TRUE, ylab="Fre
quency", xlab="class", ylim=c(0, 2497), c
ol=rainbow_hcl(7))
dotchart(ds[nrow(ds):1,ord],
col=rev(rainbow_hcl(7)), labels="",
xlab="Frequency", ylab="class",
pch=c(1:6, 19))
Bar Plot
Dot Plot
Data Visualization Example – Variable Yavg.
ds <-
rbind(data.frame(dat=crs$dataset[,][,"YAVG
"], grp="All"),
data.frame(dat=crs$dataset[,][crs$dataset$
class=="Downstairs","YAVG"],
grp="Downstairs"),
data.frame(dat=crs$dataset[,][crs$dataset$
class=="Jogging","YAVG"], grp="Jogging"),
data.frame(dat=crs$dataset[,][crs$dataset$
class=="Sitting","YAVG"], grp="Sitting"),
data.frame(dat=crs$dataset[,][crs$dataset$
class=="Standing","YAVG"],
grp="Standing"),
data.frame(dat=crs$dataset[,][crs$dataset$
class=="Upstairs","YAVG"],
grp="Upstairs"),
data.frame(dat=crs$dataset[,][crs$dataset$
class=="Walking","YAVG"], grp="Walking"))
bp <- boxplot(formula=dat ~ grp, data=ds,
col=rainbow_hcl(7), xlab="class",
ylab="YAVG", varwidth=TRUE, notch=TRUE)
require(doBy, quietly=TRUE)
points(1:7, summaryBy(dat ~ grp, data=ds,
FUN=mean, na.rm=TRUE)$dat.mean, pch=8)
hs <- hist(ds[ds$grp=="All",1], main="",
xlab="YAVG", ylab="Frequency", col="grey90",
ylim=c(0, 2137.72617616154), breaks="fd",
border=TRUE)
[Ref] Rattle R Data Mining Tool
• Easy to interpret
Blue : Positive correlation
Red: Negative correlation
Correlation Plot
[Ref] Rattle R Data Mining Tool
require(ellipse, quietly=TRUE)
crs$cor <-
cor(crs$dataset[, crs$numeric], use="
pairwise", method="pearson")
crs$ord <- order(crs$cor[1,])
crs$cor <- crs$cor[crs$ord, crs$ord]
print(crs$cor)
plotcorr(crs$cor,
col=colorRampPalette(c("red",
"white", "blue"))(11)[5*crs$cor + 6]
Functions Library Discription
Cluster hclust stats Hierarchical cluster analysis
kmeans stats Kmeans clustering
Classifiers glm stats Logistic regression
rpart rpart Recursive partitioning and
regression trees
ksvm kernlab Support Vector Machine
apriori arules Rule based classification
Ensemble ada ada Stochastic boosting
randomForest randomForest Random Forests classification and
regression
Data Science R Packages
Decision Tree - Visualization
[Ref] Rattle R Data Mining Tool
• Decision Tree Model Results:
n= 3792
1) root 3792 2364 Walking (0.098 0.3 0.057 0.049 0.12 0.38)
2) YABSOLDEV>=5.095 1097 85 Jogging (0.0055 0.92 0 0 0.031 0.041)
4) ZAVG>=-4.125 1058 46 Jogging (0.0057 0.96 0 0 0.032 0.0057)
*
5) ZAVG< -4.125 39 0 Walking (0 0 0 0 0 1) *
3) YABSOLDEV< 5.095 2695 1312 Walking (0.14 0.047 0.08 0.069 0.16
0.51)
6) YSTANDDEV< 1.675 382 175 Sitting (0 0 0.54 0.44 0 0.016)
Variables actually used in tree construction:
RESULTANT YABSOLDEV YAVG YSTANDDEV ZABSOLDEV ZAVG
Root node error: 2364/3792 = 0.62342
Decision Tree
rpart(formula = class ~ ., data = smartphone_data, method =
"class", parms = list(split = "information"), control =
rpart.control(usesurrogate = 0, maxsurrogate = 0))
Random Forest: Ensemble of Trees
[Ref] Rattle R Data Mining Tool
…
Σ
Random Forest
Tree1 Tree2
Treen
• Random Forest Model Results:
Number of observations used to build the model: 3792
Type of random forest: classification
OOB estimate of error rate: 11.05%
Confusion matrix:
Downstairs Jogging Sitting Standing Upstairs Walking class.error
Downstairs 204 7 0 1 64 97 0.45308311
Jogging 6 1117 0 0 8 7 0.01845343
Sitting 0 0 209 5 1 0 0.02790698
Standing 4 0 0 177 4 0 0.04324324
Upstairs 48 31 1 0 276 97 0.39072848
Walking 20 1 1 1 15 1390 0.02661064
Random Forest Package in R
randomForest(formula = class ~ ., data =
smartphone_data, ntree = 300, mtry = 6, importance =
TRUE, replace = FALSE, na.action = na.roughfix)
• Fusion of data science and domain knowledge
enables the big insights from the data
• R language provides a platform to rapidly build
prototypes and test the ideas
• Getting data insights is an outcome of intense
team effort between various stakeholders
16
Summary
• R Project: http://www.r-project.org
• Activity Recognition Dataset- “ The Impact of Personalization on
Smartphone-Based Activity Recognition” Gary M. Weiss and Jeffrey W.
Lockhart, Activity Context Representation: Techniques and Languages,
AAAI Technical Report WS-12-05
• “Activity and Gait Recognition with Time-Delay Embeddings” Jordan Frank,
AAAI Conference on Artificial Intelligence -2010
• R wiki:
http://rwiki.sciviews.org/doku.php
• R graph gallery:
http://addictedtor.free.fr/graphiques/thumbs.php
• Kickstarting R:
http://cran.r-project.org/doc/contrib/Lemon-kickstart/
• Rattle – R Data Mining Tool [http://rattle.togaware.com/]
• Sensor Platforms, http://www.sensorplatforms.com/context-aware/
• Movea, http://www.movea.com/
• Alohar, https://www.alohar.com
17
References

Más contenido relacionado

La actualidad más candente

Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
Wes McKinney
 

La actualidad más candente (20)

Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2
 
Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandas
 
Pandas
PandasPandas
Pandas
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
 
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning Libraries
 
Iris data analysis example in R
Iris data analysis example in RIris data analysis example in R
Iris data analysis example in R
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Tree representation in map reduce world
Tree representation  in map reduce worldTree representation  in map reduce world
Tree representation in map reduce world
 
R seminar dplyr package
R seminar dplyr packageR seminar dplyr package
R seminar dplyr package
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
Python for R Users
Python for R UsersPython for R Users
Python for R Users
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
 

Similar a India software developers conference 2013 Bangalore

2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
tirlukachaitanya
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciences
alexstorer
 

Similar a India software developers conference 2013 Bangalore (20)

R and data mining
R and data miningR and data mining
R and data mining
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
 
R language tutorial
R language tutorialR language tutorial
R language tutorial
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in R
 
AiCore Brochure 27-Mar-2023-205529.pdf
AiCore Brochure 27-Mar-2023-205529.pdfAiCore Brochure 27-Mar-2023-205529.pdf
AiCore Brochure 27-Mar-2023-205529.pdf
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft Azure
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdf
 
R Programming: Numeric Functions In R
R Programming: Numeric Functions In RR Programming: Numeric Functions In R
R Programming: Numeric Functions In R
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
 
Getting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commitsGetting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commits
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciences
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Programming with R in Big Data Analytics
Programming with R in Big Data AnalyticsProgramming with R in Big Data Analytics
Programming with R in Big Data Analytics
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programming
 
An Intoduction to R
An Intoduction to RAn Intoduction to R
An Intoduction to R
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In R
 
R and Data Science
R and Data ScienceR and Data Science
R and Data Science
 

Más de Satnam Singh

Threat Hunting with Deceptive Defense and Splunk Enterprise Security
Threat Hunting with Deceptive Defense and Splunk Enterprise SecurityThreat Hunting with Deceptive Defense and Splunk Enterprise Security
Threat Hunting with Deceptive Defense and Splunk Enterprise Security
Satnam Singh
 
AI for CyberSecurity
AI for CyberSecurityAI for CyberSecurity
AI for CyberSecurity
Satnam Singh
 
Big Data Analytics Insights Conference- Satnam
Big Data Analytics Insights Conference- SatnamBig Data Analytics Insights Conference- Satnam
Big Data Analytics Insights Conference- Satnam
Satnam Singh
 

Más de Satnam Singh (11)

InfoSec Deep Learning in Action
InfoSec Deep Learning in ActionInfoSec Deep Learning in Action
InfoSec Deep Learning in Action
 
Probabilistic signals and systems satnam singh
Probabilistic signals and systems satnam singhProbabilistic signals and systems satnam singh
Probabilistic signals and systems satnam singh
 
Threat Hunting with Deceptive Defense and Splunk Enterprise Security
Threat Hunting with Deceptive Defense and Splunk Enterprise SecurityThreat Hunting with Deceptive Defense and Splunk Enterprise Security
Threat Hunting with Deceptive Defense and Splunk Enterprise Security
 
A Game between Adversary and AI Scientist
A Game between Adversary and AI ScientistA Game between Adversary and AI Scientist
A Game between Adversary and AI Scientist
 
Deep learning fundamentals workshop
Deep learning fundamentals workshopDeep learning fundamentals workshop
Deep learning fundamentals workshop
 
Deception-Triggered Security Data Science to Detect Adversary Movements
Deception-Triggered Security Data Science to Detect Adversary MovementsDeception-Triggered Security Data Science to Detect Adversary Movements
Deception-Triggered Security Data Science to Detect Adversary Movements
 
AI for CyberSecurity
AI for CyberSecurityAI for CyberSecurity
AI for CyberSecurity
 
Using Deception to Detect and Profile Hidden Threats
Using Deception to Detect and Profile Hidden ThreatsUsing Deception to Detect and Profile Hidden Threats
Using Deception to Detect and Profile Hidden Threats
 
HawkEye : A Real-time Anomaly Detection System
HawkEye : A Real-time Anomaly Detection SystemHawkEye : A Real-time Anomaly Detection System
HawkEye : A Real-time Anomaly Detection System
 
The Fifth Elephant - 2013 Talk - "Smart Analytics in Smartphones"
The Fifth Elephant - 2013 Talk - "Smart Analytics in Smartphones"The Fifth Elephant - 2013 Talk - "Smart Analytics in Smartphones"
The Fifth Elephant - 2013 Talk - "Smart Analytics in Smartphones"
 
Big Data Analytics Insights Conference- Satnam
Big Data Analytics Insights Conference- SatnamBig Data Analytics Insights Conference- Satnam
Big Data Analytics Insights Conference- Satnam
 

Último

Último (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

India software developers conference 2013 Bangalore

  • 1. Data Science 101: Using R Language to get Big Insights Satnam Singh, Senior Chief Engineer, Samsung Research India – Bangalore [ Twitter - @satnam74s] India Software Developers Conference, Bangalore March 16, 2013
  • 2. 2 Motivation: Using Data to get Business Insights Data Bases & Clusters Data Bases & Clusters Data Bases & Clusters Insights? Insights? Insights?
  • 3. Ref. [kaggle.com] Data Science Programming Languages Why R? • Popular, Free • Open source • Multi-platform • Vectorization • Many statistical packages • Large support base • Obj. oriented prog. lang. Ref [http://www.r-project.org]
  • 4. R Language Basics > y <- 21 > y [1] 21 > z = 233 > z [1] 233 > y <- c(1,2,3,4) > y [1] 1 2 3 4 Simple Operations Vector Operations Function Calls
  • 5. 5 R Language: Data Structures Examples • Data frame • Matrix • List > MyFamilyage <- c(5,6,40,38) > MyFamilyage <- c(5,6,40,38) > MFamilyName <- c("Sat",“Veera",“Minu","Dummy") > MyFamilyweight <- c(72,70,12,40) > MyFamily<- data.frame(MyFamilyName,MyFamilyage,MyFamilyweight) > MyMatrix<-as.matrix(MyFamilyage) > Mydataframe <-as.data.frame(MyMatrix) > MyList <-a.list(Mydataframe)
  • 6. 6 Case Study: Activity Recognition • Activity Recognition: Detect walking, driving, biking, climbing stairs, standing, etc. Example of Accelerometer data Smartphone’s Accelerometer Sensor [Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY [Ref] Jordan Frank, McGill University [Ref] Commercial API Providers: Sensor Platoforms, Movea, Alohar
  • 7. 7 Data Analysis - Steps Feature Extraction Time Series Data 43 Features Mean for each acc. Axis (3) Std. dev. for each acc. Axis (3) 200 samples (10 sec) Avg. Abs. diff. from Mean for each acc. Axis (3) Avg. Resultant Acc. (1) Histogram (30) Classifiers CART: Decision Tree RF: Random Forest Classify the Activity [Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY [Ref] Jordan Frank, McGill University
  • 8. Data Visualization – Activity (Class Variable) [Ref] Rattle R Data Mining Tool ds <- rbind(summary(na.omit(crs$dataset[,]$clas s)), summary(na.omit(crs$dataset[,][crs $dataset$class=="Downstairs",]$class)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Jogging",]$class)), summary( na.omit(crs$dataset[,][crs$dataset$class= ="Sitting",]$class)), summary(na.omit( crs$dataset[,][crs$dataset$class=="Standi ng",]$class)), summary(na.omit(crs$dat aset[,][crs$dataset$class=="Upstairs",]$c lass)), summary(na.omit(crs$dataset[,] [crs$dataset$class=="Walking",]$class))) ord <- order(ds[1,], decreasing=TRUE) bp <- barplot2(ds[,ord], beside=TRUE, ylab="Fre quency", xlab="class", ylim=c(0, 2497), c ol=rainbow_hcl(7)) dotchart(ds[nrow(ds):1,ord], col=rev(rainbow_hcl(7)), labels="", xlab="Frequency", ylab="class", pch=c(1:6, 19)) Bar Plot Dot Plot
  • 9. Data Visualization Example – Variable Yavg. ds <- rbind(data.frame(dat=crs$dataset[,][,"YAVG "], grp="All"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Downstairs","YAVG"], grp="Downstairs"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Jogging","YAVG"], grp="Jogging"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Sitting","YAVG"], grp="Sitting"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Standing","YAVG"], grp="Standing"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Upstairs","YAVG"], grp="Upstairs"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Walking","YAVG"], grp="Walking")) bp <- boxplot(formula=dat ~ grp, data=ds, col=rainbow_hcl(7), xlab="class", ylab="YAVG", varwidth=TRUE, notch=TRUE) require(doBy, quietly=TRUE) points(1:7, summaryBy(dat ~ grp, data=ds, FUN=mean, na.rm=TRUE)$dat.mean, pch=8) hs <- hist(ds[ds$grp=="All",1], main="", xlab="YAVG", ylab="Frequency", col="grey90", ylim=c(0, 2137.72617616154), breaks="fd", border=TRUE) [Ref] Rattle R Data Mining Tool
  • 10. • Easy to interpret Blue : Positive correlation Red: Negative correlation Correlation Plot [Ref] Rattle R Data Mining Tool require(ellipse, quietly=TRUE) crs$cor <- cor(crs$dataset[, crs$numeric], use=" pairwise", method="pearson") crs$ord <- order(crs$cor[1,]) crs$cor <- crs$cor[crs$ord, crs$ord] print(crs$cor) plotcorr(crs$cor, col=colorRampPalette(c("red", "white", "blue"))(11)[5*crs$cor + 6]
  • 11. Functions Library Discription Cluster hclust stats Hierarchical cluster analysis kmeans stats Kmeans clustering Classifiers glm stats Logistic regression rpart rpart Recursive partitioning and regression trees ksvm kernlab Support Vector Machine apriori arules Rule based classification Ensemble ada ada Stochastic boosting randomForest randomForest Random Forests classification and regression Data Science R Packages
  • 12. Decision Tree - Visualization [Ref] Rattle R Data Mining Tool
  • 13. • Decision Tree Model Results: n= 3792 1) root 3792 2364 Walking (0.098 0.3 0.057 0.049 0.12 0.38) 2) YABSOLDEV>=5.095 1097 85 Jogging (0.0055 0.92 0 0 0.031 0.041) 4) ZAVG>=-4.125 1058 46 Jogging (0.0057 0.96 0 0 0.032 0.0057) * 5) ZAVG< -4.125 39 0 Walking (0 0 0 0 0 1) * 3) YABSOLDEV< 5.095 2695 1312 Walking (0.14 0.047 0.08 0.069 0.16 0.51) 6) YSTANDDEV< 1.675 382 175 Sitting (0 0 0.54 0.44 0 0.016) Variables actually used in tree construction: RESULTANT YABSOLDEV YAVG YSTANDDEV ZABSOLDEV ZAVG Root node error: 2364/3792 = 0.62342 Decision Tree rpart(formula = class ~ ., data = smartphone_data, method = "class", parms = list(split = "information"), control = rpart.control(usesurrogate = 0, maxsurrogate = 0))
  • 14. Random Forest: Ensemble of Trees [Ref] Rattle R Data Mining Tool … Σ Random Forest Tree1 Tree2 Treen
  • 15. • Random Forest Model Results: Number of observations used to build the model: 3792 Type of random forest: classification OOB estimate of error rate: 11.05% Confusion matrix: Downstairs Jogging Sitting Standing Upstairs Walking class.error Downstairs 204 7 0 1 64 97 0.45308311 Jogging 6 1117 0 0 8 7 0.01845343 Sitting 0 0 209 5 1 0 0.02790698 Standing 4 0 0 177 4 0 0.04324324 Upstairs 48 31 1 0 276 97 0.39072848 Walking 20 1 1 1 15 1390 0.02661064 Random Forest Package in R randomForest(formula = class ~ ., data = smartphone_data, ntree = 300, mtry = 6, importance = TRUE, replace = FALSE, na.action = na.roughfix)
  • 16. • Fusion of data science and domain knowledge enables the big insights from the data • R language provides a platform to rapidly build prototypes and test the ideas • Getting data insights is an outcome of intense team effort between various stakeholders 16 Summary
  • 17. • R Project: http://www.r-project.org • Activity Recognition Dataset- “ The Impact of Personalization on Smartphone-Based Activity Recognition” Gary M. Weiss and Jeffrey W. Lockhart, Activity Context Representation: Techniques and Languages, AAAI Technical Report WS-12-05 • “Activity and Gait Recognition with Time-Delay Embeddings” Jordan Frank, AAAI Conference on Artificial Intelligence -2010 • R wiki: http://rwiki.sciviews.org/doku.php • R graph gallery: http://addictedtor.free.fr/graphiques/thumbs.php • Kickstarting R: http://cran.r-project.org/doc/contrib/Lemon-kickstart/ • Rattle – R Data Mining Tool [http://rattle.togaware.com/] • Sensor Platforms, http://www.sensorplatforms.com/context-aware/ • Movea, http://www.movea.com/ • Alohar, https://www.alohar.com 17 References

Notas del editor

  1. The R statistical programming language is a free open source package based on the S language developed by Bell Labs.The language is very powerful for writing programs.Many statistical functions are already built in.Contributed packages expand the functionality to cutting edge research.Since it is a programming language, generating computer code to complete tasks is required.Implement many common statistical proceduresIt has a large collection of intermediate tools for data analysisExcellent graphical facilities for data analysis and display either on-screen or on hardcopyA well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.Versions of R exist of Windows, MacOS, Linux and various other Unix flavorsA vibrant world wide community
  2. Command c creates a vector that is assigned to object a
  3. A table where columns can contain numeric and string valuesAll columns must contain either numeric or string values, but these can not be combinedData frame d is converted into a matrix eR: f&lt;-as.data.frame(e)Matrix e is converted into a dataframe f
  4. Smartphone has Tri-axial accelerometer that measures acceleration in all three spatial dimensions.Accuracy for general model~75%, &gt;95% personalized model using 10 seconds training for each activityAccelerometer sensor is low power consuming sensor can be used for the whole day
  5. The &apos;randomForest&apos; and package provides the &apos;randomForest&apos; function.The ‘party’ package provide conditional random forest ‘randomForest’ can be used for classification and regression. It can also be used in unsupervised mode for assessing proximities among data points.