SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
Introduction
Enter base
base:
statistical functions for large datasets
Jan Wijels  Edwin de Jonge:
jwijels@bnosac.be  edwindjonge@gmail.com
BNOSAC - Belgium Network of Open Source Analytical Consultants: www.bnosac.be
 Centraal Bureau voor Statistiek: www.cbs.nl
July 10 2013, useR! 2013
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Overview
Introduction
Who are we
Large data
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Who are we
Large data
About Jan)
Founder of www.bnosac.be and very recent father of Midas
(2013-06-19) and therefore not present at useR! 2013.
Providing consultancy services in open source analytical engineering
Poor man's BI:
Python/PostgreSQL/Pentaho/R/Hadoop/Sencha/ExtJS...
...
Expertise in predictive data mining, biostatistics, geostats, python
programming, GUI building, articial intelligence, process
automation, analytical web development
R implementations  R application maintenance
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Who are we
Large data
About Edwin
Working at Statistics Netherlands, that produces all dutch ocial
statistics. Co-author of several R packages:
editrules
tabplot
whisker
ffbase
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Who are we
Large data
Large data
Statistical data is becoming larger.
If nrow(data)  106 everything works ne in R
For larger data.frame's you run quickly into memory problems
# create an integer of length = 1 billion
integer(1e+09)
## Error: cannot allocate vector of size 3.7 Gb
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Who are we
Large data
If it ts in memory plain R is just ne!
If it doesn't t on your hard disk, you'll need some big data stu
(Hadoop, rmr, RHadoop)
For data sizes range of 106 - 109 several options provided by external
packages.
Popular one is ff1
1Daniel Adler, Christian Gläser, Oleg Nenadic, Jens Oehlschlägel, Walter Zucchini
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Who are we
Large data
Popularity based on RStudio download log les
0
100
200
300
jan feb mrt apr mei jun jul
#uniqueip
ff
bigmemory
filehash
ffbase
colbycol
mmap
pbdBASE
Rstudio logs 2013, downloads/week
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Who are we
Large data
ff package provides:
numerical, integer, boolean and factor variables of type ff stored on
disk (via memory mapping)
a sort of data.frame of type ffdf
ecient indexing, retrieval and sorting of ff vectors and ffdf.
ecient chunk-wise retrieval of data.
ff-based matrix storage.
vectors up-to length of 2 · 109.
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Who are we
Large data
What's the catch?
ff is nice, but:
Handling ff vectors often results in non-standard R code
It oers no statistical functions on ff and ffdf objects
No mean, max, min, sd, etc. on ff vectors
Requires that you process the data chunkwise.
Has no support for character vectors.
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Who are we
Large data
Typical chunkwise code for ff
library(ff)
x - ff(0, length = 1e+08)
# calculating the max value of x
m - -Inf
for (i in chunk(x)) {
m - max(x[i], m, na.rm = T)
}
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Who are we
Large data
Value vs reference
Note that out-of-memory objects have issues with value vs reference
semantics.
x - ff(0, length = 1e+07)
y - x
y[1] - 100
print(x[1])
## [1] 100
This is not what a normal R vector would do! Trade-of between copying
large object and side-eects.
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
ffbase2 attempts to
add basic statistical functions to 
make code as standard R as possible
make working with ff more pleasant.
connect ff with big* methods.
2de Jonge/Wijels/van der Laan
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
Basic operations
Most methods works via S3 dispatch (but not all...)
mean,min,max,range, sum, all, cumsum, cumprod, quantile,
tabulate.ff, table.ff,
cut, c, unique, duplicated, Math.Ops
x - 1:10
x_ff - ff(x)
mean(x)
## [1] 5.5
mean(x_ff)
## [1] 5.5
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
Idiosyncratic R
Use S3 dispatch (works only for generic methods...)
ffbase adds subset3, with, within and transform to ff
Makes code interchangeable with normal R code:
iris_ff - as.ffdf(iris)
iris_ff - transform( iris_ff
, Sepal.Ratio = Sepal.Width/Sepal.Length
)
str(iris_ff[,])
## 'data.frame': 150 obs. of 6 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2
## $ Species : Factor w/ 3 levels setosa,versicolor,
## $ Sepal.Ratio : num 0.686 0.612 0.681 0.674 0.72 ...
3This creates a copy of all selected data
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
Under the hood
ffbase rewrites the expression into chunked expression
transform( iris_ff
, Sepal.Ratio = Sepal.Width/Sepal.Length
)
into4
iris_ff2 - iris_ff
for (.i in chunk(iris_ff2)){
iris_ff2[.i,] - transform( iris_ff2[.i,]
, Sepal.Ratio=Sepal.Width/Sepal.Length
)
}
return(iris_ff2)
4greatly simplied
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
ltering: ffwhich
Often we need an index for a subselection, but even this may be to big
for memory.
ffwhich(dat, expression) returns a ff index vector
result can be used to index ffdf data.frame.
idx - ffwhich(iris_ff, Sepal.Width  2)
iris_ff[idx, ]
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
Importing data
Package ff: read.table.ffdf, read.csv.ffdf etcetera
Package ETLutils: read.dbi.ffdf, read.odbc.ffdf,
read.jdbc.ffdf SQL Databases (SQLite / PostgreSQL / Oracle /
MySQL / SQL Server (read.odbc.df) / Hive (read.jdbc.df) / ...)
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
ffbase
ffbase adds
laf_to_ffdf using LaF for importing large csv and fwf les
ffappend for appending vectors to an existing ff object
x - ffappend(x, 1:10)
ffdfappend for appending data.frame's to an existing ffdf
dat - ffdfappend(dat, iris)
# Note the pattern of assigning the result of the function ap
# ff object to itself
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
 storage
When data is in ff format processing is fast!
Time bottle neck can be loading the data into ff
Keeping data in ff format can save time
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
ff stores all ff vectors on disk, however lenames are not user-friendly.
basename(filename(iris_ff$Sepal.Length))
## [1] ffdf1ad05b442183.ff
Furthermore each ff vector stores the absolute path.
Makes moving data around more dicult
ff provides: ffsave and ffload, which archives and unarchives 
vectors and df data.frames into a zip le.
Note that this still can be time-consuming.
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
ffbase has:
save.ffdf and load.ffdf that store and load ffdf data.frames
into a directory with sensible R names.
save.ffdf(iris_ff)
basename(filename(iris_ff$Sepal.Length))
## [1] iris_ff$Sepal.Length.ff
pack.ffdf and unpack.ffdf that do the same but zip/unzip the
result.
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
Several methods for ff
ffbase allows statistical models directly on  data.5
Classication + Regression with bigglm.ffdf (biglm + base
packages)
Least angle regression with biglars.fit (biglars + base
packages)
Randomforest classication with bigrfc (bigrf + base packages)
Clustering with clustering features of package stream
5These work on large data but you can also sample from your df to build your
stat model and predict chunkwise.
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
Example of bigglm
Using biglm (with bigglm.df from base)
class(iris_ff)
## [1] ffdf
nrow(iris_ff) # that is 1e7!
## [1] 1000000
mymodel - bigglm(Sepal.Length ~ Petal.Length, data = iris_ff)
coef(mymodel)
## (Intercept) Petal.Length
## 4.3051 0.4093
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
Thank you!
Questions?
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets

Más contenido relacionado

La actualidad más candente

Introduction to data warehousing
Introduction to data warehousing   Introduction to data warehousing
Introduction to data warehousing
Girish Dhareshwar
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 

La actualidad más candente (20)

Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
 
binary log と 2PC と Group Commit
binary log と 2PC と Group Commitbinary log と 2PC と Group Commit
binary log と 2PC と Group Commit
 
Introduction to data warehousing
Introduction to data warehousing   Introduction to data warehousing
Introduction to data warehousing
 
Big data landscape
Big data landscapeBig data landscape
Big data landscape
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
MapReduce
MapReduceMapReduce
MapReduce
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Mysql security 5.7
Mysql security 5.7 Mysql security 5.7
Mysql security 5.7
 
データベース入門1
データベース入門1データベース入門1
データベース入門1
 
NoSQL MeetUp
NoSQL MeetUpNoSQL MeetUp
NoSQL MeetUp
 
Building a data warehouse with Pentaho and Docker
Building a data warehouse with Pentaho and DockerBuilding a data warehouse with Pentaho and Docker
Building a data warehouse with Pentaho and Docker
 
Bases de Données non relationnelles, NoSQL (Introduction) 1er cours
Bases de Données non relationnelles, NoSQL (Introduction) 1er coursBases de Données non relationnelles, NoSQL (Introduction) 1er cours
Bases de Données non relationnelles, NoSQL (Introduction) 1er cours
 
NoSQL Database: Classification, Characteristics and Comparison
NoSQL Database: Classification, Characteristics and ComparisonNoSQL Database: Classification, Characteristics and Comparison
NoSQL Database: Classification, Characteristics and Comparison
 
Mongo DB: Operational Big Data Database
Mongo DB: Operational Big Data DatabaseMongo DB: Operational Big Data Database
Mongo DB: Operational Big Data Database
 
12 Lean Memes
12 Lean Memes12 Lean Memes
12 Lean Memes
 
Search User Interface Design
Search User Interface DesignSearch User Interface Design
Search User Interface Design
 
Introduction to Data Mining and Data Warehousing
Introduction to Data Mining and Data WarehousingIntroduction to Data Mining and Data Warehousing
Introduction to Data Mining and Data Warehousing
 
Data Warehouse and Data Mining
Data Warehouse and Data MiningData Warehouse and Data Mining
Data Warehouse and Data Mining
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 

Destacado

Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and concepts
Ajay Ohri
 
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Ryan Rosario
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Revolution Analytics
 
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Ryan Rosario
 

Destacado (15)

Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and concepts
 
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
 
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageJanuary 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
 
Massive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using RMassive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using R
 
December 2015 Meetup - Shiny: Make Your R Code Interactive - Craig Wang
December 2015 Meetup - Shiny: Make Your R Code Interactive - Craig WangDecember 2015 Meetup - Shiny: Make Your R Code Interactive - Craig Wang
December 2015 Meetup - Shiny: Make Your R Code Interactive - Craig Wang
 
Zurich R user group presentation May 2016
Zurich R user group presentation May 2016Zurich R user group presentation May 2016
Zurich R user group presentation May 2016
 
Heatmaps best practices Strata Hadoop
Heatmaps best practices Strata HadoopHeatmaps best practices Strata Hadoop
Heatmaps best practices Strata Hadoop
 
Accessing R from Python using RPy2
Accessing R from Python using RPy2Accessing R from Python using RPy2
Accessing R from Python using RPy2
 
Parallel Computing with R
Parallel Computing with RParallel Computing with R
Parallel Computing with R
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
Chunked, dplyr for large text files
Chunked, dplyr for large text filesChunked, dplyr for large text files
Chunked, dplyr for large text files
 
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
 
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
 

Similar a ffbase, statistical functions for large datasets

Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabe
Dataiku
 
Question 1 1.  A data flow element in a DFD must be conntected.docx
Question 1 1.  A data flow element in a DFD must be conntected.docxQuestion 1 1.  A data flow element in a DFD must be conntected.docx
Question 1 1.  A data flow element in a DFD must be conntected.docx
IRESH3
 
Data scientist enablement dse 400 - week 1 roadmap
Data scientist enablement   dse 400 - week 1 roadmapData scientist enablement   dse 400 - week 1 roadmap
Data scientist enablement dse 400 - week 1 roadmap
Dr. Mohan K. Bavirisetty
 

Similar a ffbase, statistical functions for large datasets (20)

Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
Abstract.DOCX
Abstract.DOCXAbstract.DOCX
Abstract.DOCX
 
Big data and you
Big data and you Big data and you
Big data and you
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
 
Bayesian reasoning
Bayesian reasoningBayesian reasoning
Bayesian reasoning
 
Tf itpptbo
Tf itpptboTf itpptbo
Tf itpptbo
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data and Fast Data combined – is it possible?
Big Data and Fast Data combined – is it possible?Big Data and Fast Data combined – is it possible?
Big Data and Fast Data combined – is it possible?
 
Big data PPT
Big data PPT Big data PPT
Big data PPT
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabe
 
Bigdataanalytics
BigdataanalyticsBigdataanalytics
Bigdataanalytics
 
Suneel manne
Suneel  manneSuneel  manne
Suneel manne
 
Suneel manne
Suneel  manneSuneel  manne
Suneel manne
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
Data science and Machine learning Booklet
Data science and Machine learning BookletData science and Machine learning Booklet
Data science and Machine learning Booklet
 
Question 1 1.  A data flow element in a DFD must be conntected.docx
Question 1 1.  A data flow element in a DFD must be conntected.docxQuestion 1 1.  A data flow element in a DFD must be conntected.docx
Question 1 1.  A data flow element in a DFD must be conntected.docx
 
Data scientist enablement dse 400 - week 1 roadmap
Data scientist enablement   dse 400 - week 1 roadmapData scientist enablement   dse 400 - week 1 roadmap
Data scientist enablement dse 400 - week 1 roadmap
 
Make your data great now
Make your data great nowMake your data great now
Make your data great now
 
Dc python meetup
Dc python meetupDc python meetup
Dc python meetup
 

Más de Edwin de Jonge

StatMine (New Technologies and Techniques for Statistics)
StatMine (New Technologies and Techniques for Statistics)StatMine (New Technologies and Techniques for Statistics)
StatMine (New Technologies and Techniques for Statistics)
Edwin de Jonge
 

Más de Edwin de Jonge (14)

sdcSpatial user!2019
sdcSpatial user!2019sdcSpatial user!2019
sdcSpatial user!2019
 
Validatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rulesValidatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rules
 
Data error! But where?
Data error! But where?Data error! But where?
Data error! But where?
 
Daff: diff, patch and merge for data.frame
Daff: diff, patch and merge for data.frameDaff: diff, patch and merge for data.frame
Daff: diff, patch and merge for data.frame
 
Uncertainty visualisation
Uncertainty visualisationUncertainty visualisation
Uncertainty visualisation
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014
 
Big data experiments
Big data experimentsBig data experiments
Big data experiments
 
StatMine
StatMineStatMine
StatMine
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualization
 
Tabplotd3, interactive inspection of large data
Tabplotd3, interactive inspection of large dataTabplotd3, interactive inspection of large data
Tabplotd3, interactive inspection of large data
 
Big data as a source for official statistics
Big data as a source for official statisticsBig data as a source for official statistics
Big data as a source for official statistics
 
Statmine, Visuele dataexploratie
Statmine, Visuele dataexploratieStatmine, Visuele dataexploratie
Statmine, Visuele dataexploratie
 
StatMine (New Technologies and Techniques for Statistics)
StatMine (New Technologies and Techniques for Statistics)StatMine (New Technologies and Techniques for Statistics)
StatMine (New Technologies and Techniques for Statistics)
 
StatMine, visual exploration of output data
StatMine, visual exploration of output dataStatMine, visual exploration of output data
StatMine, visual exploration of output data
 

Último

Último (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

ffbase, statistical functions for large datasets

  • 1. Introduction Enter base base: statistical functions for large datasets Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.com BNOSAC - Belgium Network of Open Source Analytical Consultants: www.bnosac.be Centraal Bureau voor Statistiek: www.cbs.nl July 10 2013, useR! 2013 Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 2. Introduction Enter base Overview Introduction Who are we Large data Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 3. Introduction Enter base Who are we Large data About Jan) Founder of www.bnosac.be and very recent father of Midas (2013-06-19) and therefore not present at useR! 2013. Providing consultancy services in open source analytical engineering Poor man's BI: Python/PostgreSQL/Pentaho/R/Hadoop/Sencha/ExtJS... ... Expertise in predictive data mining, biostatistics, geostats, python programming, GUI building, articial intelligence, process automation, analytical web development R implementations R application maintenance Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 4. Introduction Enter base Who are we Large data About Edwin Working at Statistics Netherlands, that produces all dutch ocial statistics. Co-author of several R packages: editrules tabplot whisker ffbase Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 5. Introduction Enter base Who are we Large data Large data Statistical data is becoming larger. If nrow(data) 106 everything works ne in R For larger data.frame's you run quickly into memory problems # create an integer of length = 1 billion integer(1e+09) ## Error: cannot allocate vector of size 3.7 Gb Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 6. Introduction Enter base Who are we Large data If it ts in memory plain R is just ne! If it doesn't t on your hard disk, you'll need some big data stu (Hadoop, rmr, RHadoop) For data sizes range of 106 - 109 several options provided by external packages. Popular one is ff1 1Daniel Adler, Christian Gläser, Oleg Nenadic, Jens Oehlschlägel, Walter Zucchini Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 7. Introduction Enter base Who are we Large data Popularity based on RStudio download log les 0 100 200 300 jan feb mrt apr mei jun jul #uniqueip ff bigmemory filehash ffbase colbycol mmap pbdBASE Rstudio logs 2013, downloads/week Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 8. Introduction Enter base Who are we Large data ff package provides: numerical, integer, boolean and factor variables of type ff stored on disk (via memory mapping) a sort of data.frame of type ffdf ecient indexing, retrieval and sorting of ff vectors and ffdf. ecient chunk-wise retrieval of data. ff-based matrix storage. vectors up-to length of 2 · 109. Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 9. Introduction Enter base Who are we Large data What's the catch? ff is nice, but: Handling ff vectors often results in non-standard R code It oers no statistical functions on ff and ffdf objects No mean, max, min, sd, etc. on ff vectors Requires that you process the data chunkwise. Has no support for character vectors. Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 10. Introduction Enter base Who are we Large data Typical chunkwise code for ff library(ff) x - ff(0, length = 1e+08) # calculating the max value of x m - -Inf for (i in chunk(x)) { m - max(x[i], m, na.rm = T) } Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 11. Introduction Enter base Who are we Large data Value vs reference Note that out-of-memory objects have issues with value vs reference semantics. x - ff(0, length = 1e+07) y - x y[1] - 100 print(x[1]) ## [1] 100 This is not what a normal R vector would do! Trade-of between copying large object and side-eects. Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 12. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets ffbase2 attempts to add basic statistical functions to make code as standard R as possible make working with ff more pleasant. connect ff with big* methods. 2de Jonge/Wijels/van der Laan Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 13. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets Basic operations Most methods works via S3 dispatch (but not all...) mean,min,max,range, sum, all, cumsum, cumprod, quantile, tabulate.ff, table.ff, cut, c, unique, duplicated, Math.Ops x - 1:10 x_ff - ff(x) mean(x) ## [1] 5.5 mean(x_ff) ## [1] 5.5 Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 14. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets Idiosyncratic R Use S3 dispatch (works only for generic methods...) ffbase adds subset3, with, within and transform to ff Makes code interchangeable with normal R code: iris_ff - as.ffdf(iris) iris_ff - transform( iris_ff , Sepal.Ratio = Sepal.Width/Sepal.Length ) str(iris_ff[,]) ## 'data.frame': 150 obs. of 6 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3. ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 ## $ Species : Factor w/ 3 levels setosa,versicolor, ## $ Sepal.Ratio : num 0.686 0.612 0.681 0.674 0.72 ... 3This creates a copy of all selected data Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 15. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets Under the hood ffbase rewrites the expression into chunked expression transform( iris_ff , Sepal.Ratio = Sepal.Width/Sepal.Length ) into4 iris_ff2 - iris_ff for (.i in chunk(iris_ff2)){ iris_ff2[.i,] - transform( iris_ff2[.i,] , Sepal.Ratio=Sepal.Width/Sepal.Length ) } return(iris_ff2) 4greatly simplied Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 16. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets ltering: ffwhich Often we need an index for a subselection, but even this may be to big for memory. ffwhich(dat, expression) returns a ff index vector result can be used to index ffdf data.frame. idx - ffwhich(iris_ff, Sepal.Width 2) iris_ff[idx, ] Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 17. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets Importing data Package ff: read.table.ffdf, read.csv.ffdf etcetera Package ETLutils: read.dbi.ffdf, read.odbc.ffdf, read.jdbc.ffdf SQL Databases (SQLite / PostgreSQL / Oracle / MySQL / SQL Server (read.odbc.df) / Hive (read.jdbc.df) / ...) Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 18. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets ffbase ffbase adds laf_to_ffdf using LaF for importing large csv and fwf les ffappend for appending vectors to an existing ff object x - ffappend(x, 1:10) ffdfappend for appending data.frame's to an existing ffdf dat - ffdfappend(dat, iris) # Note the pattern of assigning the result of the function ap # ff object to itself Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 19. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets storage When data is in ff format processing is fast! Time bottle neck can be loading the data into ff Keeping data in ff format can save time Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 20. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets ff stores all ff vectors on disk, however lenames are not user-friendly. basename(filename(iris_ff$Sepal.Length)) ## [1] ffdf1ad05b442183.ff Furthermore each ff vector stores the absolute path. Makes moving data around more dicult ff provides: ffsave and ffload, which archives and unarchives vectors and df data.frames into a zip le. Note that this still can be time-consuming. Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 21. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets ffbase has: save.ffdf and load.ffdf that store and load ffdf data.frames into a directory with sensible R names. save.ffdf(iris_ff) basename(filename(iris_ff$Sepal.Length)) ## [1] iris_ff$Sepal.Length.ff pack.ffdf and unpack.ffdf that do the same but zip/unzip the result. Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 22. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets Several methods for ff ffbase allows statistical models directly on data.5 Classication + Regression with bigglm.ffdf (biglm + base packages) Least angle regression with biglars.fit (biglars + base packages) Randomforest classication with bigrfc (bigrf + base packages) Clustering with clustering features of package stream 5These work on large data but you can also sample from your df to build your stat model and predict chunkwise. Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 23. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets Example of bigglm Using biglm (with bigglm.df from base) class(iris_ff) ## [1] ffdf nrow(iris_ff) # that is 1e7! ## [1] 1000000 mymodel - bigglm(Sepal.Length ~ Petal.Length, data = iris_ff) coef(mymodel) ## (Intercept) Petal.Length ## 4.3051 0.4093 Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 24. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets Thank you! Questions? Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets