SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Apache Spark and R
A (big data) love story?
Mark Sellors - Technical Architect @ Mango Solutions
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
About me.
• Technical Architect
• Design and deploy analytic computing
environments
• Not really an R user but have broad knowledge of
the analytic computing ecosystem
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Overview
• The rise of big data
• Barriers to big data
• Big Data vs R
• Spark
• Spark and Hadoop
• SparkR
• Is it a love story?
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
The rise of ‘big data’
• Storage prices
• Commodity
• compute
infrastructure
• Volume of data
• Hadoop ties the
two together
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Barriers to ‘big data’
• Hadoop is complex ecosystem
• Primary programming paradigm,
Map/Reduce jobs, largely written in Java
• Map/Reduce unsuited to exploratory,
interactive analysis
• Map/Reduce is slow
• RHadoop is built on top of Map/Reduce
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Some problems with ‘big data’
• Many hadoop deployments do not
achieve an appreciable ROI.
• Hard to find the staff with crossover skills
• infrastructure
• analysts
• Existing business processes not fit for
purpose
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Hadoop
• Until recently limited to batch based
operations
• MASSIVE data sets
• easy to add storage/compute capacity
But…
• Map/Reduce operations can be quite slow
• Hard to find/deploy appropriate talent
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
R
• Interactive
• fast
• great for exploratory or batch
But…
• Single threaded
• Limited by available memory
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
The value of your
data is in what you
do with it.
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
What is it?
• Open source cluster computing
framework
• Relies heavily on in memory processing
• One of the most contributed-to big data
projects of the past year
• Started in the AMPLab at UC Berkeley in
2009
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
What problem does it solve
• In memory makes for very fast data
processing
• minimal disk IO
• High level programming abstraction
reduces the amount of code
• In turn makes it more suitable for
exploratory work.
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
How does it do it?
• Provides a core programming abstraction
called RDD
• The RDD API has been extended to
include DataFrames
• Can deploy ad-hoc processing clusters as
well as integrate with HDFS,
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
“Will Spark replace Hadoop?”
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Hadoop is an
Ecosystem!
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Spark and Hadoop
• Very Complimentary.
• Spark already comes with all the major
Hadoop distributions
• easier to use and faster than map/reduce
• suitable for exploratory work, which
previously was difficult in hadoop
deployments
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
How does this fit with R
• Originally supported languages Scala,
Java and Python
• SparkR was a separate project
• Integrated into Spark as of v1.4
• Support is still evolving - v1.5 released
last week
• MASSIVE data frames
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
SparkR Features
• Designed to be familiar
• Massive DataFrames
• SQL operations on those DataFrames
• Fitting of GLM’s
• Works on top of Hadoop or as a stand
alone cluster
• Load data from a variety of sources
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Spark SQL
• Arbitrary SQL operations on massive in-
memory data frames
• Treats the data frame as though it were a
database table
• Useful for exploring your data set
• Also great for creating subsets
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
# Create the DataFrame
df <- createDataFrame(sqlContext, iris)
# Fit a linear model over the dataset.
model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family =
"gaussian")
# Model coefficients are returned in a similar format to R's native glm().
summary(model)
##$coefficients
## Estimate
##(Intercept) 2.2513930
##Sepal_Width 0.8035609
##Species_versicolor 1.4587432
##Species_virginica 1.9468169
# Make predictions based on the model.
predictions <- predict(model, newData = df)
head(select(predictions, "Sepal_Length", "prediction"))
## Sepal_Length prediction
##1 5.1 5.063856
##2 4.9 4.662076
##3 4.7 4.822788
##4 4.6 4.742432
##5 5.0 5.144212
##6 5.4 5.385281 Source: http://spark.apache.org
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Lowering the barrier to adoption
• Hadoop can be tricky to
get started with.
• Spark can run locally on
your laptop
• Can build ad-hoc
processing clusters
• Supports pulling data
from a variety of sources
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
What’s in it for me?
• Currently supports:
• DataFrames
• SparkSQL
• limited subset of MLlib
• Is missing any native R support for:
• Spark Streaming
• GraphX
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Is it a love story?
• It wasn’t originally, but things are heating
up
• Data not getting any smaller
• Dramatically lowers the barrier to entry
• Evolving rapidly
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
@MangoTheCat / @sellorm

Más contenido relacionado

La actualidad más candente

Optimizing Your Cloud Applications in RightScale
Optimizing Your Cloud Applications in RightScaleOptimizing Your Cloud Applications in RightScale
Optimizing Your Cloud Applications in RightScaleRightScale
 
Journey to the Modern App with Containers, Microservices and Big Data
Journey to the Modern App with Containers, Microservices and Big DataJourney to the Modern App with Containers, Microservices and Big Data
Journey to the Modern App with Containers, Microservices and Big DataLightbend
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalVMware Tanzu Korea
 
AWS Outage Analysis
AWS Outage AnalysisAWS Outage Analysis
AWS Outage AnalysisThousandEyes
 
Multi-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud Strategy
Multi-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud StrategyMulti-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud Strategy
Multi-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud StrategyThousandEyes
 
Amazon Redshift, Customer Acquisition Cost & Advertising ROI presented with A...
Amazon Redshift, Customer Acquisition Cost & Advertising ROI presented with A...Amazon Redshift, Customer Acquisition Cost & Advertising ROI presented with A...
Amazon Redshift, Customer Acquisition Cost & Advertising ROI presented with A...Amazon Web Services
 
Correlate Log Data with Business Metrics Like a Jedi
Correlate Log Data with Business Metrics Like a JediCorrelate Log Data with Business Metrics Like a Jedi
Correlate Log Data with Business Metrics Like a JediTrevor Parsons
 
Cloud and Analytics - From Platforms to an Ecosystem
Cloud and Analytics - From Platforms to an EcosystemCloud and Analytics - From Platforms to an Ecosystem
Cloud and Analytics - From Platforms to an EcosystemDatabricks
 
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLPerformance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLTriNimbus
 
Deploying Big Data Platforms
Deploying Big Data PlatformsDeploying Big Data Platforms
Deploying Big Data PlatformsChris Kernaghan
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpJoseph Arriola
 
Disrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging TechnologiesDisrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging TechnologiesDatabricks
 
Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...
Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...
Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...Amazon Web Services
 
Process big data within an hour, with the OVH Public Cloud
Process big data within an hour, with the OVH Public CloudProcess big data within an hour, with the OVH Public Cloud
Process big data within an hour, with the OVH Public CloudOVHcloud
 
How McGraw-Hill Transformed ITOps to Mitigate Digital Risks
How McGraw-Hill Transformed ITOps to Mitigate Digital Risks How McGraw-Hill Transformed ITOps to Mitigate Digital Risks
How McGraw-Hill Transformed ITOps to Mitigate Digital Risks ThousandEyes
 
Cloud Wars: Performance Benchmarking AWS, GCP and Azure
Cloud Wars: Performance Benchmarking AWS, GCP and Azure Cloud Wars: Performance Benchmarking AWS, GCP and Azure
Cloud Wars: Performance Benchmarking AWS, GCP and Azure ThousandEyes
 
Informatica + Hadoop = Best of Both Worlds
Informatica + Hadoop = Best of Both WorldsInformatica + Hadoop = Best of Both Worlds
Informatica + Hadoop = Best of Both WorldsAhmed Tayeh
 
Lambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB TestingLambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB TestingTrieu Nguyen
 
Complex Data Transformations Made Easy
Complex Data Transformations Made EasyComplex Data Transformations Made Easy
Complex Data Transformations Made EasyData Con LA
 

La actualidad más candente (20)

Optimizing Your Cloud Applications in RightScale
Optimizing Your Cloud Applications in RightScaleOptimizing Your Cloud Applications in RightScale
Optimizing Your Cloud Applications in RightScale
 
Journey to the Modern App with Containers, Microservices and Big Data
Journey to the Modern App with Containers, Microservices and Big DataJourney to the Modern App with Containers, Microservices and Big Data
Journey to the Modern App with Containers, Microservices and Big Data
 
IT HealthCheck
IT HealthCheckIT HealthCheck
IT HealthCheck
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from Pivotal
 
AWS Outage Analysis
AWS Outage AnalysisAWS Outage Analysis
AWS Outage Analysis
 
Multi-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud Strategy
Multi-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud StrategyMulti-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud Strategy
Multi-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud Strategy
 
Amazon Redshift, Customer Acquisition Cost & Advertising ROI presented with A...
Amazon Redshift, Customer Acquisition Cost & Advertising ROI presented with A...Amazon Redshift, Customer Acquisition Cost & Advertising ROI presented with A...
Amazon Redshift, Customer Acquisition Cost & Advertising ROI presented with A...
 
Correlate Log Data with Business Metrics Like a Jedi
Correlate Log Data with Business Metrics Like a JediCorrelate Log Data with Business Metrics Like a Jedi
Correlate Log Data with Business Metrics Like a Jedi
 
Cloud and Analytics - From Platforms to an Ecosystem
Cloud and Analytics - From Platforms to an EcosystemCloud and Analytics - From Platforms to an Ecosystem
Cloud and Analytics - From Platforms to an Ecosystem
 
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLPerformance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
 
Deploying Big Data Platforms
Deploying Big Data PlatformsDeploying Big Data Platforms
Deploying Big Data Platforms
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcp
 
Disrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging TechnologiesDisrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging Technologies
 
Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...
Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...
Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...
 
Process big data within an hour, with the OVH Public Cloud
Process big data within an hour, with the OVH Public CloudProcess big data within an hour, with the OVH Public Cloud
Process big data within an hour, with the OVH Public Cloud
 
How McGraw-Hill Transformed ITOps to Mitigate Digital Risks
How McGraw-Hill Transformed ITOps to Mitigate Digital Risks How McGraw-Hill Transformed ITOps to Mitigate Digital Risks
How McGraw-Hill Transformed ITOps to Mitigate Digital Risks
 
Cloud Wars: Performance Benchmarking AWS, GCP and Azure
Cloud Wars: Performance Benchmarking AWS, GCP and Azure Cloud Wars: Performance Benchmarking AWS, GCP and Azure
Cloud Wars: Performance Benchmarking AWS, GCP and Azure
 
Informatica + Hadoop = Best of Both Worlds
Informatica + Hadoop = Best of Both WorldsInformatica + Hadoop = Best of Both Worlds
Informatica + Hadoop = Best of Both Worlds
 
Lambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB TestingLambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB Testing
 
Complex Data Transformations Made Easy
Complex Data Transformations Made EasyComplex Data Transformations Made Easy
Complex Data Transformations Made Easy
 

Similar a Apache Spark and R: A (Big Data) Love Story?

Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Spark Summit
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure applicationCodecamp Romania
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache SparkMiklos Christine
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesAlice Zheng
 
Data Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudDataData Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudDataWeCloudData
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudJames Serra
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
How to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database WorldHow to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database WorldKaren Lopez
 
Big Challenges in Data Modeling: NoSQL and Data Modeling
Big Challenges in Data Modeling: NoSQL and Data ModelingBig Challenges in Data Modeling: NoSQL and Data Modeling
Big Challenges in Data Modeling: NoSQL and Data ModelingDATAVERSITY
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleDatabricks
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data ModelingAdam Doyle
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataAshnikbiz
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureVenu Anuganti
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabasesAdi Challa
 
Managing Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceManaging Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceSense Corp
 
Are Data Lakes the new Core DWHs?
Are Data Lakes the new Core DWHs?Are Data Lakes the new Core DWHs?
Are Data Lakes the new Core DWHs?Andreas Buckenhofer
 
Evolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital eraEvolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital eraVishal Puri
 

Similar a Apache Spark and R: A (Big Data) Love Story? (20)

Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure application
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
Data Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudDataData Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudData
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
How to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database WorldHow to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database World
 
Big Challenges in Data Modeling: NoSQL and Data Modeling
Big Challenges in Data Modeling: NoSQL and Data ModelingBig Challenges in Data Modeling: NoSQL and Data Modeling
Big Challenges in Data Modeling: NoSQL and Data Modeling
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
 
Managing Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceManaging Large Amounts of Data with Salesforce
Managing Large Amounts of Data with Salesforce
 
Are Data Lakes the new Core DWHs?
Are Data Lakes the new Core DWHs?Are Data Lakes the new Core DWHs?
Are Data Lakes the new Core DWHs?
 
Evolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital eraEvolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital era
 

Último

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 

Último (20)

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 

Apache Spark and R: A (Big Data) Love Story?

  • 1. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Apache Spark and R A (big data) love story? Mark Sellors - Technical Architect @ Mango Solutions
  • 2. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com About me. • Technical Architect • Design and deploy analytic computing environments • Not really an R user but have broad knowledge of the analytic computing ecosystem
  • 3. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com
  • 4. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Overview • The rise of big data • Barriers to big data • Big Data vs R • Spark • Spark and Hadoop • SparkR • Is it a love story?
  • 5. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com The rise of ‘big data’ • Storage prices • Commodity • compute infrastructure • Volume of data • Hadoop ties the two together
  • 6. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Barriers to ‘big data’ • Hadoop is complex ecosystem • Primary programming paradigm, Map/Reduce jobs, largely written in Java • Map/Reduce unsuited to exploratory, interactive analysis • Map/Reduce is slow • RHadoop is built on top of Map/Reduce
  • 7. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Some problems with ‘big data’ • Many hadoop deployments do not achieve an appreciable ROI. • Hard to find the staff with crossover skills • infrastructure • analysts • Existing business processes not fit for purpose
  • 8. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com
  • 9. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Hadoop • Until recently limited to batch based operations • MASSIVE data sets • easy to add storage/compute capacity But… • Map/Reduce operations can be quite slow • Hard to find/deploy appropriate talent
  • 10. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com R • Interactive • fast • great for exploratory or batch But… • Single threaded • Limited by available memory
  • 11. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com The value of your data is in what you do with it.
  • 12. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com
  • 13. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com What is it? • Open source cluster computing framework • Relies heavily on in memory processing • One of the most contributed-to big data projects of the past year • Started in the AMPLab at UC Berkeley in 2009
  • 14. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com What problem does it solve • In memory makes for very fast data processing • minimal disk IO • High level programming abstraction reduces the amount of code • In turn makes it more suitable for exploratory work.
  • 15. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com How does it do it? • Provides a core programming abstraction called RDD • The RDD API has been extended to include DataFrames • Can deploy ad-hoc processing clusters as well as integrate with HDFS,
  • 16. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com “Will Spark replace Hadoop?”
  • 17. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Hadoop is an Ecosystem!
  • 18. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Spark and Hadoop • Very Complimentary. • Spark already comes with all the major Hadoop distributions • easier to use and faster than map/reduce • suitable for exploratory work, which previously was difficult in hadoop deployments
  • 19. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com How does this fit with R • Originally supported languages Scala, Java and Python • SparkR was a separate project • Integrated into Spark as of v1.4 • Support is still evolving - v1.5 released last week • MASSIVE data frames
  • 20. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com SparkR Features • Designed to be familiar • Massive DataFrames • SQL operations on those DataFrames • Fitting of GLM’s • Works on top of Hadoop or as a stand alone cluster • Load data from a variety of sources
  • 21. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Spark SQL • Arbitrary SQL operations on massive in- memory data frames • Treats the data frame as though it were a database table • Useful for exploring your data set • Also great for creating subsets
  • 22. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com # Create the DataFrame df <- createDataFrame(sqlContext, iris) # Fit a linear model over the dataset. model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") # Model coefficients are returned in a similar format to R's native glm(). summary(model) ##$coefficients ## Estimate ##(Intercept) 2.2513930 ##Sepal_Width 0.8035609 ##Species_versicolor 1.4587432 ##Species_virginica 1.9468169 # Make predictions based on the model. predictions <- predict(model, newData = df) head(select(predictions, "Sepal_Length", "prediction")) ## Sepal_Length prediction ##1 5.1 5.063856 ##2 4.9 4.662076 ##3 4.7 4.822788 ##4 4.6 4.742432 ##5 5.0 5.144212 ##6 5.4 5.385281 Source: http://spark.apache.org
  • 23. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Lowering the barrier to adoption • Hadoop can be tricky to get started with. • Spark can run locally on your laptop • Can build ad-hoc processing clusters • Supports pulling data from a variety of sources
  • 24. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com What’s in it for me? • Currently supports: • DataFrames • SparkSQL • limited subset of MLlib • Is missing any native R support for: • Spark Streaming • GraphX
  • 25. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Is it a love story? • It wasn’t originally, but things are heating up • Data not getting any smaller • Dramatically lowers the barrier to entry • Evolving rapidly
  • 26. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com @MangoTheCat / @sellorm