Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
India software developers conference 2013 Bangalore
1. Data Science 101: Using R Language
to get Big Insights
Satnam Singh,
Senior Chief Engineer,
Samsung Research India – Bangalore
[ Twitter - @satnam74s]
India Software Developers Conference, Bangalore
March 16, 2013
2. 2
Motivation: Using Data to get Business Insights
Data Bases
& Clusters
Data Bases
& Clusters
Data Bases
& Clusters
Insights? Insights?
Insights?
3. Ref. [kaggle.com]
Data Science Programming Languages
Why R?
• Popular, Free
• Open source
• Multi-platform
• Vectorization
• Many statistical packages
• Large support base
• Obj. oriented prog. lang.
Ref [http://www.r-project.org]
4. R Language Basics
> y <- 21
> y
[1] 21
> z = 233
> z
[1] 233
> y <- c(1,2,3,4)
> y
[1] 1 2 3 4
Simple
Operations
Vector
Operations
Function
Calls
6. 6
Case Study: Activity Recognition
• Activity Recognition: Detect walking,
driving, biking, climbing stairs,
standing, etc.
Example of Accelerometer data
Smartphone’s
Accelerometer
Sensor
[Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham
University, Bronx, NY
[Ref] Jordan Frank, McGill University
[Ref] Commercial API Providers: Sensor Platoforms, Movea,
Alohar
7. 7
Data Analysis - Steps
Feature
Extraction
Time Series Data 43 Features
Mean for each
acc. Axis (3)
Std. dev. for each
acc. Axis (3)
200 samples (10 sec)
Avg. Abs. diff. from
Mean for each
acc. Axis (3)
Avg. Resultant Acc. (1)
Histogram (30)
Classifiers
CART: Decision Tree
RF: Random Forest
Classify the
Activity
[Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY
[Ref] Jordan Frank, McGill University
8. Data Visualization – Activity (Class Variable)
[Ref] Rattle R Data Mining Tool
ds <-
rbind(summary(na.omit(crs$dataset[,]$clas
s)), summary(na.omit(crs$dataset[,][crs
$dataset$class=="Downstairs",]$class)),
summary(na.omit(crs$dataset[,][crs$datase
t$class=="Jogging",]$class)), summary(
na.omit(crs$dataset[,][crs$dataset$class=
="Sitting",]$class)), summary(na.omit(
crs$dataset[,][crs$dataset$class=="Standi
ng",]$class)), summary(na.omit(crs$dat
aset[,][crs$dataset$class=="Upstairs",]$c
lass)), summary(na.omit(crs$dataset[,]
[crs$dataset$class=="Walking",]$class)))
ord <- order(ds[1,], decreasing=TRUE)
bp <-
barplot2(ds[,ord], beside=TRUE, ylab="Fre
quency", xlab="class", ylim=c(0, 2497), c
ol=rainbow_hcl(7))
dotchart(ds[nrow(ds):1,ord],
col=rev(rainbow_hcl(7)), labels="",
xlab="Frequency", ylab="class",
pch=c(1:6, 19))
Bar Plot
Dot Plot
14. Random Forest: Ensemble of Trees
[Ref] Rattle R Data Mining Tool
…
Σ
Random Forest
Tree1 Tree2
Treen
15. • Random Forest Model Results:
Number of observations used to build the model: 3792
Type of random forest: classification
OOB estimate of error rate: 11.05%
Confusion matrix:
Downstairs Jogging Sitting Standing Upstairs Walking class.error
Downstairs 204 7 0 1 64 97 0.45308311
Jogging 6 1117 0 0 8 7 0.01845343
Sitting 0 0 209 5 1 0 0.02790698
Standing 4 0 0 177 4 0 0.04324324
Upstairs 48 31 1 0 276 97 0.39072848
Walking 20 1 1 1 15 1390 0.02661064
Random Forest Package in R
randomForest(formula = class ~ ., data =
smartphone_data, ntree = 300, mtry = 6, importance =
TRUE, replace = FALSE, na.action = na.roughfix)
16. • Fusion of data science and domain knowledge
enables the big insights from the data
• R language provides a platform to rapidly build
prototypes and test the ideas
• Getting data insights is an outcome of intense
team effort between various stakeholders
16
Summary
17. • R Project: http://www.r-project.org
• Activity Recognition Dataset- “ The Impact of Personalization on
Smartphone-Based Activity Recognition” Gary M. Weiss and Jeffrey W.
Lockhart, Activity Context Representation: Techniques and Languages,
AAAI Technical Report WS-12-05
• “Activity and Gait Recognition with Time-Delay Embeddings” Jordan Frank,
AAAI Conference on Artificial Intelligence -2010
• R wiki:
http://rwiki.sciviews.org/doku.php
• R graph gallery:
http://addictedtor.free.fr/graphiques/thumbs.php
• Kickstarting R:
http://cran.r-project.org/doc/contrib/Lemon-kickstart/
• Rattle – R Data Mining Tool [http://rattle.togaware.com/]
• Sensor Platforms, http://www.sensorplatforms.com/context-aware/
• Movea, http://www.movea.com/
• Alohar, https://www.alohar.com
17
References
Notas del editor
The R statistical programming language is a free open source package based on the S language developed by Bell Labs.The language is very powerful for writing programs.Many statistical functions are already built in.Contributed packages expand the functionality to cutting edge research.Since it is a programming language, generating computer code to complete tasks is required.Implement many common statistical proceduresIt has a large collection of intermediate tools for data analysisExcellent graphical facilities for data analysis and display either on-screen or on hardcopyA well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.Versions of R exist of Windows, MacOS, Linux and various other Unix flavorsA vibrant world wide community
Command c creates a vector that is assigned to object a
A table where columns can contain numeric and string valuesAll columns must contain either numeric or string values, but these can not be combinedData frame d is converted into a matrix eR: f<-as.data.frame(e)Matrix e is converted into a dataframe f
Smartphone has Tri-axial accelerometer that measures acceleration in all three spatial dimensions.Accuracy for general model~75%, >95% personalized model using 10 seconds training for each activityAccelerometer sensor is low power consuming sensor can be used for the whole day
The 'randomForest' and package provides the 'randomForest' function.The ‘party’ package provide conditional random forest ‘randomForest’ can be used for classification and regression. It can also be used in unsupervised mode for assessing proximities among data points.