SlideShare una empresa de Scribd logo
1 de 27
Descargar para leer sin conexión
RRWWoorrkksshhooppII
gettoknowNYCopendataportalandstarttouseR
Vivian Zhang for NYC-open-data meetup
http://www.meetup.com/NYC-Open-Data/
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
1 of 27 6/13/14, 1:50 PM
OOvveerrvviieeww
nyc open data portal
Rstudio
R
Github
hack time
·
·
·
·
·
2/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
2 of 27 6/13/14, 1:50 PM
AAddvvaannttaaggeeooffuussiinnggRRssttuuddiioo
Easiness·
install and load R packages
keep track of R dev version
download github repositories
debug faster
-
-
-
-
3/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
3 of 27 6/13/14, 1:50 PM
ddiiaammoonnddssssuubbsseettttiinnggeexxaammppllee
require(ggplot2)
head(diamonds)
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
4/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
4 of 27 6/13/14, 1:50 PM
ddiiaammoonnddssssuubbsseettttiinnggeexxaammppllee
head(diamonds[-1, ])
## carat cut color clarity depth table price x y z
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
5/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
5 of 27 6/13/14, 1:50 PM
ddiiaammoonnddssssuubbsseettttiinnggeexxaammppllee
head(diamonds[, -1])
## cut color clarity depth table price x y z
## 1 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
6/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
6 of 27 6/13/14, 1:50 PM
ddiiaammoonnddssssuubbsseettttiinnggeexxaammppllee
head(diamonds[c(1, 2), ])
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
7/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
7 of 27 6/13/14, 1:50 PM
ddiiaammoonnddssssuubbsseettttiinnggeexxaammppllee
names(diamonds)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
head(diamonds[, c(T, T, F, F, F, F, T, F, F, F)])
## carat cut price
## 1 0.23 Ideal 326
## 2 0.21 Premium 326
## 3 0.23 Good 327
## 4 0.29 Premium 334
## 5 0.31 Good 335
## 6 0.24 Very Good 336
8/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
8 of 27 6/13/14, 1:50 PM
ddiiaammoonnddssssuubbsseettttiinnggeexxaammppllee
names(diamonds)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
head(diamonds$carat)
## [1] 0.23 0.21 0.23 0.29 0.31 0.24
diamonds[diamonds$price == max(diamonds$price), ]
## carat cut color clarity depth table price x y z
## 27750 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
9/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
9 of 27 6/13/14, 1:50 PM
rreeaaddiinnggaannddssuubbsseettttiinnggddaattaaiinnRR
blank
integer
logical
character
·
include all-
·
+: include;-: exclude-
·
include TRUEs-
·
lookup by name-
Source: Hadley Wickham
10/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
10 of 27 6/13/14, 1:50 PM
ddaattaassttrruuccttuurreeiinnRR
Source: Hadley Wickham
11/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
11 of 27 6/13/14, 1:50 PM
rreeaaddiinntthheeooppeennddaattaa
read.table()
read.csv()
·
·
rodent1year <- read.csv("C:UserszhangsGoogle DriveR codeRworkshop311_Service_Requests_from_2010_
header = TRUE, sep = ",")
dim(rodent1year)
summary(rodent1year)
table(rodent1year$Borough)
12/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
12 of 27 6/13/14, 1:50 PM
With() is generic function that evaluates expr in
a local environment constructed from data.
Using ggplot2, "aes" stands for "aesthetics",
"geom"" is used to create scatterplots
pplloottddiiaammoonnddss
with(diamonds, plot(carat, price)) ggplot(diamonds, aes(x = carat, y = price)) + geom_point()
13/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
13 of 27 6/13/14, 1:50 PM
pplloottddiiaammoonnddss
ggplot2 generates more supplicated graph than the traditional graphics package. Let us play with
some color
ggplot(diamonds, aes(x = carat, y = price, colour = cut)) + geom_point()
14/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
14 of 27 6/13/14, 1:50 PM
pplloottddiiaammoonnddss
In stead of fitting linear relation, we try to fit log linear relation
Log(price) is quite linear with log(carat),Bingo!
ggplot(diamonds, aes(x = log(carat), y = log(price), colour = cut)) + geom_point()
15/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
15 of 27 6/13/14, 1:50 PM
pplloottddiiaammoonnddss
As letters go from D to J, the diamond becomes more and more yellow. The numbers beside
"S"(small) and "VS"(very small) describe the size of "internal imperfections" in the diamonds. "IF" is
internally flawless.
ggplot(diamonds, aes(x = log(carat), y = log(price), colour = cut)) + geom_point() +
facet_grid(clarity ~ color)
16/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
16 of 27 6/13/14, 1:50 PM
pplloottddiiaammoonnddss
Let us look back to a normal scale. The bottom left panel shows price vs carat for ultimate white and
internally flawless diamonds. The upper right panel shows price vs carat for most unpure(or dirtiest)
and flawed diamonds.
ggplot(diamonds, aes(x = carat, y = price, colour = cut)) + geom_point() + facet_grid(clarity ~
color)
17/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
17 of 27 6/13/14, 1:50 PM
pplloottddiiaammoonnddss
As we would expect, for the diamonds at the same level of pureness(observed by row) , the price
per carat increases faster for white stone (bottom left) than for yellow stone(bottom right). And for the
diamond at the same level of color (observed by column), the price per carat increases faster for
pure stone(bottom left) than for dirty stone(upper left).
18/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
18 of 27 6/13/14, 1:50 PM
pplloottddiiaammoonnddss
We facet the plot by one of these factor variables--clarity.
ggplot(diamonds, aes(x = carat, y = price, colour = cut)) + geom_point() + facet_grid(clarity ~
.)
19/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
19 of 27 6/13/14, 1:50 PM
ggooooddttiippttooggeenneerraatteepplloottss
The same type of graph is used over and over again while new individual component of ggplot2 is
introduced and interpreted. It is a very effective way to display complex relationship in large,
high-dimensional data. Remember, the key is to bring in only one change each time.
Source: http://gettinggeneticsdone.blogspot.com/2010/01/ggplot2-tutorial-scatterplots-in-series.html
20/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
20 of 27 6/13/14, 1:50 PM
pplloottddiiaammoonnddss
Last , we fit line for the orginal data and for the log transformed data.The linear relation is roughly
perfect of the log transformed data if we ignore the few points at two sides of the distribution.
ggplot(diamonds, aes(x = carat, y = price)) + geom_point() + geom_smooth()ggplot(diamonds, aes(x = log(carat), y = log(price))) + geo
21/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
21 of 27 6/13/14, 1:50 PM
aammaazziinnggNNYYTTiimmeessssaammppllee
http://timelyportfolio.github.io/rCharts_512paths/
Source: Timely Portfolio and NYTimes
22/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
22 of 27 6/13/14, 1:50 PM
wwhhyyddoowweeuusseeRR
Dirk's exmaple about elegance and efficiency of R Source: Dirk Eddelbuettel
23/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
23 of 27 6/13/14, 1:50 PM
wwhhyyddoowweeuusseeRR
Dirk's exmaple about elegance and efficiency of R Source: Dirk Eddelbuettel
24/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
24 of 27 6/13/14, 1:50 PM
hhaacckkttiimmee
download an open dataset using filter
read it in to your Rstudio
check the dimensity of the dataset
decide which columns you will use
plot it!
·
·
·
·
·
25/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
25 of 27 6/13/14, 1:50 PM
RReessoouurrcceess
R in a Nutshell - Joseph Adler
The Art of R Programming - Norman Matloff
ggplot2 - Elegant Graphics for Data Analysis - Hadley Wickham
26/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
26 of 27 6/13/14, 1:50 PM
27/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
27 of 27 6/13/14, 1:50 PM

Más contenido relacionado

Destacado

A Workshop on R
A Workshop on RA Workshop on R
A Workshop on RAjay Ohri
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformSyracuse University
 
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Ryan Rosario
 
Distribución binomial
Distribución binomialDistribución binomial
Distribución binomialJulio Leal
 
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Ryan Rosario
 

Destacado (7)

A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics Platform
 
R learning by examples
R learning by examplesR learning by examples
R learning by examples
 
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
 
R programming
R programmingR programming
R programming
 
Distribución binomial
Distribución binomialDistribución binomial
Distribución binomial
 
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
 

Similar a R workshop i r basic (4th time)

R Programming: Comparing Objects In R
R Programming: Comparing Objects In RR Programming: Comparing Objects In R
R Programming: Comparing Objects In RRsquared Academy
 
useR!2010 matome
useR!2010 matomeuseR!2010 matome
useR!2010 matomeybenjo
 
Database managment System Relational Algebra
Database managment System  Relational AlgebraDatabase managment System  Relational Algebra
Database managment System Relational AlgebraUttara University
 
R Programming: Mathematical Functions In R
R Programming: Mathematical Functions In RR Programming: Mathematical Functions In R
R Programming: Mathematical Functions In RRsquared Academy
 
And Now You Have Two Problems
And Now You Have Two ProblemsAnd Now You Have Two Problems
And Now You Have Two ProblemsLuca Mearelli
 
Refatoração + Design Patterns em Ruby
Refatoração + Design Patterns em RubyRefatoração + Design Patterns em Ruby
Refatoração + Design Patterns em RubyCássio Marques
 
R 語言上手篇
R 語言上手篇R 語言上手篇
R 語言上手篇Wush Wu
 
Out with Regex, In with Tokens
Out with Regex, In with TokensOut with Regex, In with Tokens
Out with Regex, In with Tokensscoates
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciencesalexstorer
 
R Programming: Transform/Reshape Data In R
R Programming: Transform/Reshape Data In RR Programming: Transform/Reshape Data In R
R Programming: Transform/Reshape Data In RRsquared Academy
 
introtorandrstudio.ppt
introtorandrstudio.pptintrotorandrstudio.ppt
introtorandrstudio.pptMalkaParveen3
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Aslak Hellesøy
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Aslak Hellesøy
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Aslak Hellesøy
 
Propel your Performance: AgensGraph, the multi-model database
Propel your Performance: AgensGraph, the multi-model databasePropel your Performance: AgensGraph, the multi-model database
Propel your Performance: AgensGraph, the multi-model databaseJoshua Bae
 
And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...Codemotion
 
Making presentations with LaTeX: Workshop Day 4
Making presentations with LaTeX: Workshop Day 4Making presentations with LaTeX: Workshop Day 4
Making presentations with LaTeX: Workshop Day 4Suddhasheel GHOSH, PhD
 

Similar a R workshop i r basic (4th time) (20)

R Programming: Comparing Objects In R
R Programming: Comparing Objects In RR Programming: Comparing Objects In R
R Programming: Comparing Objects In R
 
useR!2010 matome
useR!2010 matomeuseR!2010 matome
useR!2010 matome
 
Database managment System Relational Algebra
Database managment System  Relational AlgebraDatabase managment System  Relational Algebra
Database managment System Relational Algebra
 
R Programming: Mathematical Functions In R
R Programming: Mathematical Functions In RR Programming: Mathematical Functions In R
R Programming: Mathematical Functions In R
 
Tsukubar8
Tsukubar8Tsukubar8
Tsukubar8
 
And Now You Have Two Problems
And Now You Have Two ProblemsAnd Now You Have Two Problems
And Now You Have Two Problems
 
Refatoração + Design Patterns em Ruby
Refatoração + Design Patterns em RubyRefatoração + Design Patterns em Ruby
Refatoração + Design Patterns em Ruby
 
R 語言上手篇
R 語言上手篇R 語言上手篇
R 語言上手篇
 
Out with Regex, In with Tokens
Out with Regex, In with TokensOut with Regex, In with Tokens
Out with Regex, In with Tokens
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciences
 
R Programming: Transform/Reshape Data In R
R Programming: Transform/Reshape Data In RR Programming: Transform/Reshape Data In R
R Programming: Transform/Reshape Data In R
 
introtorandrstudio.ppt
introtorandrstudio.pptintrotorandrstudio.ppt
introtorandrstudio.ppt
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 
Propel your Performance: AgensGraph, the multi-model database
Propel your Performance: AgensGraph, the multi-model databasePropel your Performance: AgensGraph, the multi-model database
Propel your Performance: AgensGraph, the multi-model database
 
And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...
 
Making presentations with LaTeX: Workshop Day 4
Making presentations with LaTeX: Workshop Day 4Making presentations with LaTeX: Workshop Day 4
Making presentations with LaTeX: Workshop Day 4
 
R basics
R basicsR basics
R basics
 
Vine shortest example
Vine shortest exampleVine shortest example
Vine shortest example
 

Más de Vivian S. Zhang

Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger RenVivian S. Zhang
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide bookVivian S. Zhang
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentationVivian S. Zhang
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataVivian S. Zhang
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data Vivian S. Zhang
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Vivian S. Zhang
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret packageVivian S. Zhang
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedVivian S. Zhang
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Vivian S. Zhang
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataVivian S. Zhang
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningVivian S. Zhang
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesVivian S. Zhang
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rVivian S. Zhang
 

Más de Vivian S. Zhang (20)

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
 
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
Xgboost
XgboostXgboost
Xgboost
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Xgboost
XgboostXgboost
Xgboost
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
 
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
 

Último

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Último (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

R workshop i r basic (4th time)

  • 1. RRWWoorrkksshhooppII gettoknowNYCopendataportalandstarttouseR Vivian Zhang for NYC-open-data meetup http://www.meetup.com/NYC-Open-Data/ R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 1 of 27 6/13/14, 1:50 PM
  • 2. OOvveerrvviieeww nyc open data portal Rstudio R Github hack time · · · · · 2/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 2 of 27 6/13/14, 1:50 PM
  • 3. AAddvvaannttaaggeeooffuussiinnggRRssttuuddiioo Easiness· install and load R packages keep track of R dev version download github repositories debug faster - - - - 3/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 3 of 27 6/13/14, 1:50 PM
  • 4. ddiiaammoonnddssssuubbsseettttiinnggeexxaammppllee require(ggplot2) head(diamonds) ## carat cut color clarity depth table price x y z ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 4/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 4 of 27 6/13/14, 1:50 PM
  • 5. ddiiaammoonnddssssuubbsseettttiinnggeexxaammppllee head(diamonds[-1, ]) ## carat cut color clarity depth table price x y z ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 ## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 5/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 5 of 27 6/13/14, 1:50 PM
  • 6. ddiiaammoonnddssssuubbsseettttiinnggeexxaammppllee head(diamonds[, -1]) ## cut color clarity depth table price x y z ## 1 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 Premium I VS2 62.4 58 334 4.20 4.23 2.63 ## 5 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 6/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 6 of 27 6/13/14, 1:50 PM
  • 7. ddiiaammoonnddssssuubbsseettttiinnggeexxaammppllee head(diamonds[c(1, 2), ]) ## carat cut color clarity depth table price x y z ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 7/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 7 of 27 6/13/14, 1:50 PM
  • 8. ddiiaammoonnddssssuubbsseettttiinnggeexxaammppllee names(diamonds) ## [1] "carat" "cut" "color" "clarity" "depth" "table" "price" ## [8] "x" "y" "z" head(diamonds[, c(T, T, F, F, F, F, T, F, F, F)]) ## carat cut price ## 1 0.23 Ideal 326 ## 2 0.21 Premium 326 ## 3 0.23 Good 327 ## 4 0.29 Premium 334 ## 5 0.31 Good 335 ## 6 0.24 Very Good 336 8/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 8 of 27 6/13/14, 1:50 PM
  • 9. ddiiaammoonnddssssuubbsseettttiinnggeexxaammppllee names(diamonds) ## [1] "carat" "cut" "color" "clarity" "depth" "table" "price" ## [8] "x" "y" "z" head(diamonds$carat) ## [1] 0.23 0.21 0.23 0.29 0.31 0.24 diamonds[diamonds$price == max(diamonds$price), ] ## carat cut color clarity depth table price x y z ## 27750 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16 9/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 9 of 27 6/13/14, 1:50 PM
  • 10. rreeaaddiinnggaannddssuubbsseettttiinnggddaattaaiinnRR blank integer logical character · include all- · +: include;-: exclude- · include TRUEs- · lookup by name- Source: Hadley Wickham 10/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 10 of 27 6/13/14, 1:50 PM
  • 11. ddaattaassttrruuccttuurreeiinnRR Source: Hadley Wickham 11/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 11 of 27 6/13/14, 1:50 PM
  • 12. rreeaaddiinntthheeooppeennddaattaa read.table() read.csv() · · rodent1year <- read.csv("C:UserszhangsGoogle DriveR codeRworkshop311_Service_Requests_from_2010_ header = TRUE, sep = ",") dim(rodent1year) summary(rodent1year) table(rodent1year$Borough) 12/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 12 of 27 6/13/14, 1:50 PM
  • 13. With() is generic function that evaluates expr in a local environment constructed from data. Using ggplot2, "aes" stands for "aesthetics", "geom"" is used to create scatterplots pplloottddiiaammoonnddss with(diamonds, plot(carat, price)) ggplot(diamonds, aes(x = carat, y = price)) + geom_point() 13/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 13 of 27 6/13/14, 1:50 PM
  • 14. pplloottddiiaammoonnddss ggplot2 generates more supplicated graph than the traditional graphics package. Let us play with some color ggplot(diamonds, aes(x = carat, y = price, colour = cut)) + geom_point() 14/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 14 of 27 6/13/14, 1:50 PM
  • 15. pplloottddiiaammoonnddss In stead of fitting linear relation, we try to fit log linear relation Log(price) is quite linear with log(carat),Bingo! ggplot(diamonds, aes(x = log(carat), y = log(price), colour = cut)) + geom_point() 15/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 15 of 27 6/13/14, 1:50 PM
  • 16. pplloottddiiaammoonnddss As letters go from D to J, the diamond becomes more and more yellow. The numbers beside "S"(small) and "VS"(very small) describe the size of "internal imperfections" in the diamonds. "IF" is internally flawless. ggplot(diamonds, aes(x = log(carat), y = log(price), colour = cut)) + geom_point() + facet_grid(clarity ~ color) 16/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 16 of 27 6/13/14, 1:50 PM
  • 17. pplloottddiiaammoonnddss Let us look back to a normal scale. The bottom left panel shows price vs carat for ultimate white and internally flawless diamonds. The upper right panel shows price vs carat for most unpure(or dirtiest) and flawed diamonds. ggplot(diamonds, aes(x = carat, y = price, colour = cut)) + geom_point() + facet_grid(clarity ~ color) 17/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 17 of 27 6/13/14, 1:50 PM
  • 18. pplloottddiiaammoonnddss As we would expect, for the diamonds at the same level of pureness(observed by row) , the price per carat increases faster for white stone (bottom left) than for yellow stone(bottom right). And for the diamond at the same level of color (observed by column), the price per carat increases faster for pure stone(bottom left) than for dirty stone(upper left). 18/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 18 of 27 6/13/14, 1:50 PM
  • 19. pplloottddiiaammoonnddss We facet the plot by one of these factor variables--clarity. ggplot(diamonds, aes(x = carat, y = price, colour = cut)) + geom_point() + facet_grid(clarity ~ .) 19/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 19 of 27 6/13/14, 1:50 PM
  • 20. ggooooddttiippttooggeenneerraatteepplloottss The same type of graph is used over and over again while new individual component of ggplot2 is introduced and interpreted. It is a very effective way to display complex relationship in large, high-dimensional data. Remember, the key is to bring in only one change each time. Source: http://gettinggeneticsdone.blogspot.com/2010/01/ggplot2-tutorial-scatterplots-in-series.html 20/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 20 of 27 6/13/14, 1:50 PM
  • 21. pplloottddiiaammoonnddss Last , we fit line for the orginal data and for the log transformed data.The linear relation is roughly perfect of the log transformed data if we ignore the few points at two sides of the distribution. ggplot(diamonds, aes(x = carat, y = price)) + geom_point() + geom_smooth()ggplot(diamonds, aes(x = log(carat), y = log(price))) + geo 21/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 21 of 27 6/13/14, 1:50 PM
  • 22. aammaazziinnggNNYYTTiimmeessssaammppllee http://timelyportfolio.github.io/rCharts_512paths/ Source: Timely Portfolio and NYTimes 22/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 22 of 27 6/13/14, 1:50 PM
  • 23. wwhhyyddoowweeuusseeRR Dirk's exmaple about elegance and efficiency of R Source: Dirk Eddelbuettel 23/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 23 of 27 6/13/14, 1:50 PM
  • 24. wwhhyyddoowweeuusseeRR Dirk's exmaple about elegance and efficiency of R Source: Dirk Eddelbuettel 24/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 24 of 27 6/13/14, 1:50 PM
  • 25. hhaacckkttiimmee download an open dataset using filter read it in to your Rstudio check the dimensity of the dataset decide which columns you will use plot it! · · · · · 25/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 25 of 27 6/13/14, 1:50 PM
  • 26. RReessoouurrcceess R in a Nutshell - Joseph Adler The Art of R Programming - Norman Matloff ggplot2 - Elegant Graphics for Data Analysis - Hadley Wickham 26/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 26 of 27 6/13/14, 1:50 PM
  • 27. 27/27 R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 27 of 27 6/13/14, 1:50 PM