SlideShare una empresa de Scribd logo
1 de 49
Descargar para leer sin conexión
Open Source in Analytics
Introduction
IIML ,DCE
Founder Decisionstats.com
Author R for Business Analytics
Brief History of Analytics
SAS and SPSS led from 1970-s to early 2000s
SAS leads market but very expensive
IBM bought SPSS but still not open source
R, Python and Hadoop Challenged this
Analytics Sub Components
Data Storage
Data Querying
Data Summarization
Data Visualization
Statistical Routines
Analytics Sub Components
Data Storage
Data Querying
Data Summarization
Data Visualization
Statistical Routines
Proprietary Open Source
OracleDBMS
SQL Server
Business Objects
SAP
SQL, SAS,Crystal
Reports
Tableau
SAS,SPSS
Analytics Sub Components
Data Storage
Data Querying
Data Summarization
Data Visualization
Statistical Routines
Proprietary Open Source
OracleDBMS
SQL Server
MySQL, NoSQL,
Hadoop
Business Objects
SAP
Pentaho, Jaspersoft
SQL, SAS,Crystal
Reports
Still SQL,Pig, Hive
Tableau R,Python,Javascript
SAS,SPSS R,Python,RapidMiner
Analytics using Python
● pandas http://pandas.pydata.org/ High-performance, easy-to-use data structures and data analysis tools
● scikit-learn http://scikit-learn.org/stable/ Simple and efficient tools for data mining and data
analysis and built on NumPy, SciPy, and matplotlib
● NumPy http://www.numpy.org/
● SciPy http://www.scipy.org/scipylib/index.html
● matplotlib http://matplotlib.org/
● statsmodels http://statsmodels.sourceforge.net/# Statsmodels is a Python module that allows users to
explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting
functions, and result statistics are available
● iPython http://ipython.org/ interactive computing
Analytics using R
http://www.r-project.org/
● RStudio and Revolution Analytics
● sqldf https://code.google.com/p/sqldf/ and RODBC http://cran.r-project.org/web/packages/RODBC/index.html
● ggplot2 http://ggplot2.org/ and ggmap and shiny
● RHadoop et al https://github.com/RevolutionAnalytics/RHadoop
● car, stats, forecast, sna,tm
● rattle and Rcommander (with plugins)
More at http://rforanalytics.wordpress.com/
Analytics using R
http://www.revolutionanalytics.com/
Analytics using R
http://www.revolutionanalytics.com/
Analytics using R
<blatant self promotion>
http://www.amazon.com/R-Business-Analytics-A-Ohri/dp/1461443423
R for Business Analytics looks at some of the most common tasks performed
by business analysts and helps the user navigate the wealth of information in R
and its packages. With this information the reader can select the packages that
can help process the analytical tasks with minimum effort and maximum usefulness
. The use of Graphical User Interfaces (GUI) is emphasized in this book to
further cut down and bend the famous learning curve in learning R.
</blatant self promotion>
Analytics using Rapid Miner
Early adopter of open source analytics
Recently moved from Germany to USA
following PE infusion
One of the first marketplace for analytics
extensions http://marketplace.rapid-i.com/UpdateServer/
One of the best GUI - Drag and Drop using flow
Analytics using Rapid Miner
Analytics using Rapid Miner
Analytics using other languages
Julia- faster than R http://julialang.org/
Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to
users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical
accuracy, and an extensive mathematical function library. The library, largely written in Julia itself, also integrates mature, best-of-
breed C and Fortran libraries for linear algebra, random number generation, signal processing, and string processing.
IJulia !!
IJulia !!
Analytics using other languages
Clojure- for JVM http://clojure.org/
Clojure is a dynamic programming language that targets the Java Virtual Machine . It is designed to be a general-purpose
language, combining the approachability and interactive development of a scripting language with an efficient and robust
infrastructure for multithreaded programming. Clojure is a compiled language - it compiles directly to JVM bytecode, yet remains
completely dynamic. Every feature supported by Clojure is supported at runtime. Clojure is a dialect of Lisp
https://bigml.com/gallery/models
Analytics using other languages
bigml.com (using clojure)
https://bigml.com/gallery/models
Analytics using other languages
Scala- for big data analytics http://www.scala-lang.org/
● A Scalable language
● Object-Oriented
● Functional
● Seamless Java Interop
● Functions are Objects
● Future-Proof
● Fun
Analytics using Jaspersoft
OLAP
BIG DATA
(offered through cloud, mobile)
Analytics using Pentaho
Basically Weka
Reporting as well
Complete BI and Analytics Stack
Weka
Hadoop
http://hadoop.apache.org/
Hadoop- evolving ecosystem
Hadoop- evolving ecosystem
Hadoop- evolving ecosystem
R
http://www.r-project.org/
Open Source
Free
5000+ Packages
Growing Faster
>2 million users
RAM constraints??
R
http://www.r-project.org/
Object Oriented
has GUI and IDE
has Commercial offerings
R
http://www.r-project.org/
Object Oriented
has GUI and IDE
has Commercial offerings
R - Rattle- Data Mining GUI
http://www.r-project.org/
Object Oriented
has GUI and IDE
has Commercial offerings
R - R Commander
http://www.r-project.org/
Object Oriented
has GUI and IDE
has Commercial offerings
R -R Studio
R -Revolution Analytics
Free for Academics
World Wide !!
RevoScaleR package
for Big Data
Recommended Install -
http://info.revolutionanalytics.com/free-academic.html
R -Revolution Analytics
Free for Academics
World Wide !!
RevoScaleR package
for Big Data
R -Big Data Packages
http://cran.r-project.org/web/views/HighPerformanceComputing.html
● The RHIPE package, started by Saptarshi Guha and now developed by a core team via GitHub, provides an interface
between R and Hadoop for analysis of large complex data wholly from within R using the Divide and Recombine approach
to big data. ( link )
● The rmr package by Revolution Analytics also provides an interface between R and Hadoop for a Map/Reduce
programming framework. ( link )
● A related package, segue package by Long, permits easy execution of embarassingly parallel task on Elastic Map Reduce
(EMR) at Amazon. ( link )
● The RProtoBuf package provides an interface to Google's language-neutral, platform-neutral, extensible mechanism for
serializing structured data. This package can be used in R code to read data streams from other systems in a distributed
MapReduce setting where data is serialized and passed back and forth between tasks.
● The HistogramTools package provides a number of routines useful for the construction, aggregation, manipulation, and
plotting of large numbers of Histograms such as those created by Mappers in a MapReduce application.
Terrific Data Mining using R GUI
Great Data Visualization using R GUI
So many packages- CRAN Views to
the rescue
http://cran.r-project.org/web/views/
Bayesian Bayesian Inference
ChemPhys Chemometrics and Computational Physics
ClinicalTrials Clinical Trial Design, Monitoring, and Analysis
Cluster Cluster Analysis & Finite Mixture Models
DifferentialEquations Differential Equations
Distributions Probability Distributions
Econometrics Computational Econometrics
Environmetrics Analysis of Ecological and Environmental Data
ExperimentalDesign Design of Experiments (DoE) & Analysis of Experimental Data
Finance Empirical Finance
Genetics Statistical Genetics
Graphics Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization
HighPerformanceComputing High-Performance and Parallel Computing with R
MachineLearning Machine Learning & Statistical Learning
MedicalImaging Medical Image Analysis
MetaAnalysis Meta-Analysis
Multivariate Multivariate Statistics
NaturalLanguageProcessing Natural Language Processing
So many packages- CRAN Views to
the rescue
http://cran.r-project.org/web/views/
NumericalMathematics Numerical Mathematics
OfficialStatistics Official Statistics & Survey Methodology
Optimization Optimization and Mathematical Programming
Pharmacokinetics Analysis of Pharmacokinetic Data
Phylogenetics Phylogenetics, Especially Comparative Methods
Psychometrics Psychometric Models and Methods
ReproducibleResearch Reproducible Research
Robust Robust Statistical Methods
SocialSciences Statistics for the Social Sciences
Spatial Analysis of Spatial Data
SpatioTemporal Handling and Analyzing Spatio-Temporal Data
Survival Survival Analysis
TimeSeries Time Series Analysis
WebTechnologies Web Technologies and Services
gR gRaphical Models in R
R in the Browser
http://www.r-fiddle.org/#/
http://statace.com/
http://www.rstudio.com/ide/server/
R -Hadoop Packages
https://github.com/RevolutionAnalytics/RHadoop/wiki
● plyrmr - higher level plyr-like data processing for structured data, powered by rmr
● rmr - functions providing Hadoop MapReduce functionality in R
● rhdfs - functions providing file management of the HDFS from within R
● rhbase - functions providing database management for the HBase distributed database from within R
http://amplab-extras.github.io/SparkR-pkg/
SparkR is an R package that provides a light-weight frontend to use Apache Spark from R.
https://github.com/nexr/RHive
RHive is an R extension facilitating distributed computing via HIVE query. RHive allows easy usage of HQL(Hive SQL) in R, and
allows easy usage of R objects and R functions in Hive.
R - Cloud Computing
http://cran.r-project.org/web/views/WebTechnologies.html
R -Big Data Packages
http://cran.r-project.org/web/views/HighPerformanceComputing.html
Large memory and out-of-memory data
● The biglm package by Lumley uses incremental computations to offer lm() and glm() functionality to data sets stored
outside of R's main memory.
● The ff package by Adler et al. offers file-based access to data sets that are too large to be loaded into memory, along with
a number of higher-level functions.
● The bigmemory package by Kane and Emerson permits storing large objects such as matrices in memory (as well as via
files) and uses external pointer objects to refer to them. .
● A large number of database packages, and database-alike packages (such as sqldf by Grothendieck and data.table
● The HadoopStreaming package provides a framework for writing map/reduce scripts for use in Hadoop Streaming; it also
facilitates operating on data in a streaming fashion which does not require Hadoop.
● The speedglm package permits to fit (generalised) linear models to large data.
● The biglars package by Seligman et al can use the ff to support large-than-memory datasets for least-angle regression,
lasso and stepwise regression.
● The bigrf package provides a Random Forests implementation with support for parellel execution and large memory.
● The MonetDB.R package allows R to access the MonetDB column-oriented, open source database system as a backend.
Data Scientist Tool Kit
● web scraping
● visualization
● machine learning
● data mining
● modeling
● sna
● social media analytics
● web analytics
● reproducible research
● TS forecasting
● spatial analysis
● data storage
● data querying
Data Scientist Programming Skills
Java http://www.learnjavaonline.org/
Python http://www.codecademy.com/tracks/python
SQL http://www.w3schools.com/sql/
R http://bigdatauniversity.com/bdu-wp/bdu-course/introduction-to-data-analysis-using-r/
http://www.statmethods.net/
Hadoop http://hortonworks.com/hadoop-training/
Linuxhttps://github.com/WilliamHackmore/linuxgems/blob/master/cheat_sheet.org.sh
Other place to learn
MOOCs 1 https://www.edx.org/ 2 https://www.coursera.org/ 3 https://www.udacity.com/ 4 https://www.udemy.com/
Books
Courses
Workshops
Summary
Open source has greatly helped cut down cost
of software in analytics
The benefits of analytics continue to be many
Added with Big Data and Cloud and MOOCs
-----total cost to geeks is much lower !!
Thanks
Contact and Feedback-
ohri2007@gmail.com via http://linkedin.com/in/ajayohri

Más contenido relacionado

La actualidad más candente

Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoopguest27e6764
 
Big Data - Analytics with R
Big Data - Analytics with RBig Data - Analytics with R
Big Data - Analytics with RTechsparks
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedRevolution Analytics
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R ServicesGregg Barrett
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaData Science Thailand
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for HadoopWilly Marroquin (WillyDevNET)
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course pptNjain85
 
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Revolution Analytics
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
Rob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopRob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopGhassan Al-Yafie
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design PatternsAllen Day, PhD
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormRevolution Analytics
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asiaMuhammad Rifqi
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and VerilogGanesan Narayanasamy
 

La actualidad más candente (20)

Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
 
Big Data - Analytics with R
Big Data - Analytics with RBig Data - Analytics with R
Big Data - Analytics with R
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R Services
 
R and Data Science
R and Data ScienceR and Data Science
R and Data Science
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
R for data analytics
R for data analyticsR for data analytics
R for data analytics
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
 
Executive Intro to R
Executive Intro to RExecutive Intro to R
Executive Intro to R
 
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics?
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Rob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopRob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoop
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 

Similar a Open source analytics

High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopDataWorks Summit
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big AnalyticsAjay Ohri
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
R as supporting tool for analytics and simulation
R as supporting tool for analytics and simulationR as supporting tool for analytics and simulation
R as supporting tool for analytics and simulationAlvaro Gil
 
Use of Open Source Software Enhancing Curriculum | Developing Opportunities
Use of Open Source Software Enhancing Curriculum | Developing OpportunitiesUse of Open Source Software Enhancing Curriculum | Developing Opportunities
Use of Open Source Software Enhancing Curriculum | Developing OpportunitiesMaurice Dawson
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAlex Palamides
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R StudioRupak Roy
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Study of R Programming
Study of R ProgrammingStudy of R Programming
Study of R ProgrammingIRJET Journal
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 

Similar a Open source analytics (20)

High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
R as supporting tool for analytics and simulation
R as supporting tool for analytics and simulationR as supporting tool for analytics and simulation
R as supporting tool for analytics and simulation
 
Use of Open Source Software Enhancing Curriculum | Developing Opportunities
Use of Open Source Software Enhancing Curriculum | Developing OpportunitiesUse of Open Source Software Enhancing Curriculum | Developing Opportunities
Use of Open Source Software Enhancing Curriculum | Developing Opportunities
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
R_L1-Aug-2022.pptx
R_L1-Aug-2022.pptxR_L1-Aug-2022.pptx
R_L1-Aug-2022.pptx
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
CSB_community
CSB_communityCSB_community
CSB_community
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R Studio
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
 
tools
toolstools
tools
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Study of R Programming
Study of R ProgrammingStudy of R Programming
Study of R Programming
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 

Más de Ajay Ohri

Introduction to R
Introduction to RIntroduction to R
Introduction to RAjay Ohri
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionAjay Ohri
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for freeAjay Ohri
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10Ajay Ohri
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri ResumeAjay Ohri
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientistsAjay Ohri
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...Ajay Ohri
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessAjay Ohri
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data ScienceAjay Ohri
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data ScientistsAjay Ohri
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in PythonAjay Ohri
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen OomsAjay Ohri
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsAjay Ohri
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha Ajay Ohri
 
Analyze this
Analyze thisAnalyze this
Analyze thisAjay Ohri
 
Summer school python in spanish
Summer school python in spanishSummer school python in spanish
Summer school python in spanishAjay Ohri
 
Introduction to sas in spanish
Introduction to sas in spanishIntroduction to sas in spanish
Introduction to sas in spanishAjay Ohri
 

Más de Ajay Ohri (20)

Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 Election
 
Pyspark
PysparkPyspark
Pyspark
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for free
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri Resume
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Tradecraft
Tradecraft   Tradecraft
Tradecraft
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data Scientists
 
Craps
CrapsCraps
Craps
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in Python
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen Ooms
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports Analytics
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha
 
Analyze this
Analyze thisAnalyze this
Analyze this
 
Summer school python in spanish
Summer school python in spanishSummer school python in spanish
Summer school python in spanish
 
Introduction to sas in spanish
Introduction to sas in spanishIntroduction to sas in spanish
Introduction to sas in spanish
 

Último

Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 

Último (20)

Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 

Open source analytics

  • 1. Open Source in Analytics
  • 3. Brief History of Analytics SAS and SPSS led from 1970-s to early 2000s SAS leads market but very expensive IBM bought SPSS but still not open source R, Python and Hadoop Challenged this
  • 4. Analytics Sub Components Data Storage Data Querying Data Summarization Data Visualization Statistical Routines
  • 5. Analytics Sub Components Data Storage Data Querying Data Summarization Data Visualization Statistical Routines Proprietary Open Source OracleDBMS SQL Server Business Objects SAP SQL, SAS,Crystal Reports Tableau SAS,SPSS
  • 6. Analytics Sub Components Data Storage Data Querying Data Summarization Data Visualization Statistical Routines Proprietary Open Source OracleDBMS SQL Server MySQL, NoSQL, Hadoop Business Objects SAP Pentaho, Jaspersoft SQL, SAS,Crystal Reports Still SQL,Pig, Hive Tableau R,Python,Javascript SAS,SPSS R,Python,RapidMiner
  • 7. Analytics using Python ● pandas http://pandas.pydata.org/ High-performance, easy-to-use data structures and data analysis tools ● scikit-learn http://scikit-learn.org/stable/ Simple and efficient tools for data mining and data analysis and built on NumPy, SciPy, and matplotlib ● NumPy http://www.numpy.org/ ● SciPy http://www.scipy.org/scipylib/index.html ● matplotlib http://matplotlib.org/ ● statsmodels http://statsmodels.sourceforge.net/# Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available ● iPython http://ipython.org/ interactive computing
  • 8. Analytics using R http://www.r-project.org/ ● RStudio and Revolution Analytics ● sqldf https://code.google.com/p/sqldf/ and RODBC http://cran.r-project.org/web/packages/RODBC/index.html ● ggplot2 http://ggplot2.org/ and ggmap and shiny ● RHadoop et al https://github.com/RevolutionAnalytics/RHadoop ● car, stats, forecast, sna,tm ● rattle and Rcommander (with plugins) More at http://rforanalytics.wordpress.com/
  • 11. Analytics using R <blatant self promotion> http://www.amazon.com/R-Business-Analytics-A-Ohri/dp/1461443423 R for Business Analytics looks at some of the most common tasks performed by business analysts and helps the user navigate the wealth of information in R and its packages. With this information the reader can select the packages that can help process the analytical tasks with minimum effort and maximum usefulness . The use of Graphical User Interfaces (GUI) is emphasized in this book to further cut down and bend the famous learning curve in learning R. </blatant self promotion>
  • 12. Analytics using Rapid Miner Early adopter of open source analytics Recently moved from Germany to USA following PE infusion One of the first marketplace for analytics extensions http://marketplace.rapid-i.com/UpdateServer/ One of the best GUI - Drag and Drop using flow
  • 15. Analytics using other languages Julia- faster than R http://julialang.org/ Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library. The library, largely written in Julia itself, also integrates mature, best-of- breed C and Fortran libraries for linear algebra, random number generation, signal processing, and string processing.
  • 18. Analytics using other languages Clojure- for JVM http://clojure.org/ Clojure is a dynamic programming language that targets the Java Virtual Machine . It is designed to be a general-purpose language, combining the approachability and interactive development of a scripting language with an efficient and robust infrastructure for multithreaded programming. Clojure is a compiled language - it compiles directly to JVM bytecode, yet remains completely dynamic. Every feature supported by Clojure is supported at runtime. Clojure is a dialect of Lisp https://bigml.com/gallery/models
  • 19. Analytics using other languages bigml.com (using clojure) https://bigml.com/gallery/models
  • 20. Analytics using other languages Scala- for big data analytics http://www.scala-lang.org/ ● A Scalable language ● Object-Oriented ● Functional ● Seamless Java Interop ● Functions are Objects ● Future-Proof ● Fun
  • 21. Analytics using Jaspersoft OLAP BIG DATA (offered through cloud, mobile)
  • 22. Analytics using Pentaho Basically Weka Reporting as well Complete BI and Analytics Stack
  • 23. Weka
  • 28. R http://www.r-project.org/ Open Source Free 5000+ Packages Growing Faster >2 million users RAM constraints??
  • 29. R http://www.r-project.org/ Object Oriented has GUI and IDE has Commercial offerings
  • 30. R http://www.r-project.org/ Object Oriented has GUI and IDE has Commercial offerings
  • 31. R - Rattle- Data Mining GUI http://www.r-project.org/ Object Oriented has GUI and IDE has Commercial offerings
  • 32. R - R Commander http://www.r-project.org/ Object Oriented has GUI and IDE has Commercial offerings
  • 34. R -Revolution Analytics Free for Academics World Wide !! RevoScaleR package for Big Data Recommended Install - http://info.revolutionanalytics.com/free-academic.html
  • 35. R -Revolution Analytics Free for Academics World Wide !! RevoScaleR package for Big Data
  • 36. R -Big Data Packages http://cran.r-project.org/web/views/HighPerformanceComputing.html ● The RHIPE package, started by Saptarshi Guha and now developed by a core team via GitHub, provides an interface between R and Hadoop for analysis of large complex data wholly from within R using the Divide and Recombine approach to big data. ( link ) ● The rmr package by Revolution Analytics also provides an interface between R and Hadoop for a Map/Reduce programming framework. ( link ) ● A related package, segue package by Long, permits easy execution of embarassingly parallel task on Elastic Map Reduce (EMR) at Amazon. ( link ) ● The RProtoBuf package provides an interface to Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data. This package can be used in R code to read data streams from other systems in a distributed MapReduce setting where data is serialized and passed back and forth between tasks. ● The HistogramTools package provides a number of routines useful for the construction, aggregation, manipulation, and plotting of large numbers of Histograms such as those created by Mappers in a MapReduce application.
  • 37. Terrific Data Mining using R GUI
  • 39. So many packages- CRAN Views to the rescue http://cran.r-project.org/web/views/ Bayesian Bayesian Inference ChemPhys Chemometrics and Computational Physics ClinicalTrials Clinical Trial Design, Monitoring, and Analysis Cluster Cluster Analysis & Finite Mixture Models DifferentialEquations Differential Equations Distributions Probability Distributions Econometrics Computational Econometrics Environmetrics Analysis of Ecological and Environmental Data ExperimentalDesign Design of Experiments (DoE) & Analysis of Experimental Data Finance Empirical Finance Genetics Statistical Genetics Graphics Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization HighPerformanceComputing High-Performance and Parallel Computing with R MachineLearning Machine Learning & Statistical Learning MedicalImaging Medical Image Analysis MetaAnalysis Meta-Analysis Multivariate Multivariate Statistics NaturalLanguageProcessing Natural Language Processing
  • 40. So many packages- CRAN Views to the rescue http://cran.r-project.org/web/views/ NumericalMathematics Numerical Mathematics OfficialStatistics Official Statistics & Survey Methodology Optimization Optimization and Mathematical Programming Pharmacokinetics Analysis of Pharmacokinetic Data Phylogenetics Phylogenetics, Especially Comparative Methods Psychometrics Psychometric Models and Methods ReproducibleResearch Reproducible Research Robust Robust Statistical Methods SocialSciences Statistics for the Social Sciences Spatial Analysis of Spatial Data SpatioTemporal Handling and Analyzing Spatio-Temporal Data Survival Survival Analysis TimeSeries Time Series Analysis WebTechnologies Web Technologies and Services gR gRaphical Models in R
  • 41. R in the Browser http://www.r-fiddle.org/#/ http://statace.com/ http://www.rstudio.com/ide/server/
  • 42. R -Hadoop Packages https://github.com/RevolutionAnalytics/RHadoop/wiki ● plyrmr - higher level plyr-like data processing for structured data, powered by rmr ● rmr - functions providing Hadoop MapReduce functionality in R ● rhdfs - functions providing file management of the HDFS from within R ● rhbase - functions providing database management for the HBase distributed database from within R http://amplab-extras.github.io/SparkR-pkg/ SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. https://github.com/nexr/RHive RHive is an R extension facilitating distributed computing via HIVE query. RHive allows easy usage of HQL(Hive SQL) in R, and allows easy usage of R objects and R functions in Hive.
  • 43. R - Cloud Computing http://cran.r-project.org/web/views/WebTechnologies.html
  • 44. R -Big Data Packages http://cran.r-project.org/web/views/HighPerformanceComputing.html Large memory and out-of-memory data ● The biglm package by Lumley uses incremental computations to offer lm() and glm() functionality to data sets stored outside of R's main memory. ● The ff package by Adler et al. offers file-based access to data sets that are too large to be loaded into memory, along with a number of higher-level functions. ● The bigmemory package by Kane and Emerson permits storing large objects such as matrices in memory (as well as via files) and uses external pointer objects to refer to them. . ● A large number of database packages, and database-alike packages (such as sqldf by Grothendieck and data.table ● The HadoopStreaming package provides a framework for writing map/reduce scripts for use in Hadoop Streaming; it also facilitates operating on data in a streaming fashion which does not require Hadoop. ● The speedglm package permits to fit (generalised) linear models to large data. ● The biglars package by Seligman et al can use the ff to support large-than-memory datasets for least-angle regression, lasso and stepwise regression. ● The bigrf package provides a Random Forests implementation with support for parellel execution and large memory. ● The MonetDB.R package allows R to access the MonetDB column-oriented, open source database system as a backend.
  • 45. Data Scientist Tool Kit ● web scraping ● visualization ● machine learning ● data mining ● modeling ● sna ● social media analytics ● web analytics ● reproducible research ● TS forecasting ● spatial analysis ● data storage ● data querying
  • 46. Data Scientist Programming Skills Java http://www.learnjavaonline.org/ Python http://www.codecademy.com/tracks/python SQL http://www.w3schools.com/sql/ R http://bigdatauniversity.com/bdu-wp/bdu-course/introduction-to-data-analysis-using-r/ http://www.statmethods.net/ Hadoop http://hortonworks.com/hadoop-training/ Linuxhttps://github.com/WilliamHackmore/linuxgems/blob/master/cheat_sheet.org.sh
  • 47. Other place to learn MOOCs 1 https://www.edx.org/ 2 https://www.coursera.org/ 3 https://www.udacity.com/ 4 https://www.udemy.com/ Books Courses Workshops
  • 48. Summary Open source has greatly helped cut down cost of software in analytics The benefits of analytics continue to be many Added with Big Data and Cloud and MOOCs -----total cost to geeks is much lower !!
  • 49. Thanks Contact and Feedback- ohri2007@gmail.com via http://linkedin.com/in/ajayohri