SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
Road to
Analytics
Contents
1
2
3
SAS vs Spark
SAS Proc SQL vs Spark SQL
Advantage Analytics
1. SAS vs Spark
OVERVIEW
SAS
○ The largest independent
vendor in “advanced analytics”
○ 1976 foundation of the SAS
Institute, Cary, North Carolina
○ Commercial software product
SPARK
○ A fast and general engine for
large-scale data processing
○ Started in 2009 as a research
project in the UC Berkeley,
AMPLab
○ Open source
CODE
SAS
Basic programming model
consists of code blocks:
○ SAS Data Step
■ generation of data
■ concatenation of data
○ SAS PROCedures
■ special functionalities
SPARK
“Line based” programming
Native Language is Scala, but
flexible programming model:
○ Scala
○ Java
○ Python
○ R
DATA
SAS: DATASET
○ Computed in memory (RAM)
○ A data set contains:
● observations: organized in
rows
● variables: organized in
columns
SPARK: DATAFRAME
○ A distributed collection of data
organized into named columns.
○ It is conceptually equivalent to:
table in a relational database and
dataframe in R/Python
○ It is a programming abstraction
IMMUTABLE,
PARTITIONED,
DISTRIBUTED
DATA STRUCTURE
Transformations like: map,
filter, union, join, group
by… results in an other
dataset
SAS:
data sasData
set sasData;
Fare2 = Fare + 2;
run;
Python Pandas:
pandasDF['Fare2'] = pandasDF['Fare']+2
Spark:
sparkDF = sparkDF
.withColumn('Fare2',sparkDF['Fare']+2)
NOTEBOOK
IMMUTABLE,
PARTITIONED,
DISTRIBUTED
DATA STRUCTURE
READ SAS DATASETS
The SAS-FILE (sas7bdat) is a file with special structure
created by SAS and binary stored
● PYTHON: SAS7BDAT PACKAGE
● R: HAVEN LIBRARY
○
2. SAS Proc SQL vs
Spark SQL
SQL sentences
SAS ProC SQL
SAS Procedure that combines the
functionality of DATA and PROC
steps. It can sort, summarize,
subset, join, concatenate
datasets, create new variables...
Spark SQL
○ Spark’s interface for working
with structured and
semi-structured data, query
using SQL
○ Load data from JSON, Hive,
Parquet
○ Evaluated “lazily”
SQL sentences
SAS ProC SQL
PROC SQL;
CREATE TABLE newTable AS
SELECT Columns
FROM Table
WHERE Column > Value
GROUP BY Columns;
QUIT;
Spark SQL
sqlContext = new
org.apache.spark.sql.SQLContext(sc)
newTable = sqlContext.sql(“
SELECT Columns
FROM Table
WHERE Column > Value
GROUP BY Columns”)
NOTEBOOK
AGGREGATE FUNCTION
IN SPARK SQL
sum, avg, mean, count,
max, min, first, last,
sttedev, variance,
skewness, kurtosis…
After aggregation
Act on each group of
data, return a single
value as a result
WINDOW FUNCTION IN
SPARK SQL
Ranking: rank, dense_rank,
percent_rank, ntile,
row_number
Analytics: cume_dist, lag,
first_value, last_value, lead
Aggregate: aggregate funcs
Calculate a return value
over a set of rows called
window that are
somehow related to the
current
NOTEBOOK
EXTEND SPARK SQL
Standard functions are over 100 functions
(pyspark)
from pyspark.sql.functions import *
BUILT-IN FUNCTIONS,
UDFs
“User Defined Function”
Define new Column-based
functions that extend the
vocabulary of Spark
Act on a single row as an
input, single return value for
every input row
NOTEBOOK
TIPS
○ Not thinking in sorted data. In parallel process we can’t acces per row.
○ Cache tables/DFs when they are used more than once
○ Merge doesn’t need ordered data as SAS
○ Use functions already defined instead of creating your own UDF
○ Save data in columnar format as Parquet
○ Avoid collecting data when you are working with Big Data, take a sample
3. Advantage Analytics
ADVANTAGE ANALYTICS
SAS Stats
Traditional Add-on package to
SAS for Statistics
○ Analysis of variance
○ Bayesian analysis
○ Categorical data analysis
○ Distribution analysis
○ Mixed models
○ Predictive modeling...
Spark MLlib
Scalable machine learning library
○ Basic statistics
○ Classification and regression
○ Collaborative filtering
○ Clustering
○ Dimensionality reduction
○ Feature extraction and
transformation...
BIBLIOGRAPHY
SPARK DOCUMENTATION:
https://spark.apache.org/docs/2.0.0/
PYSPARK API:
https://spark.apache.org/docs/2.0.0/api/python/index.html
PYSPARK FUNCTIONS:
https://spark.apache.org/docs/2.0.0/api/python/_modules/pyspark/sql/functions.html
THANKS!
Any questions?
@datiobd
maguilar@datiobd.com
datio-big-data

Más contenido relacionado

La actualidad más candente

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons Provectus
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Spark Summit
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelMartin Zapletal
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 

La actualidad más candente (20)

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Spark etl
Spark etlSpark etl
Spark etl
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 

Destacado

DC/OS: The definitive platform for modern apps
DC/OS: The definitive platform for modern appsDC/OS: The definitive platform for modern apps
DC/OS: The definitive platform for modern appsDatio Big Data
 
Kafka Connect by Datio
Kafka Connect by DatioKafka Connect by Datio
Kafka Connect by DatioDatio Big Data
 
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
 10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot 10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpotHubSpot
 
PDP Your personal development plan
PDP Your personal development planPDP Your personal development plan
PDP Your personal development planDatio Big Data
 
How to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's BuyerHow to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's BuyerHubSpot
 
25 Discovery Call Questions
25 Discovery Call Questions25 Discovery Call Questions
25 Discovery Call QuestionsHubSpot
 
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...HubSpot
 
Class 1: Email Marketing Certification course: Email Marketing and Your Business
Class 1: Email Marketing Certification course: Email Marketing and Your BusinessClass 1: Email Marketing Certification course: Email Marketing and Your Business
Class 1: Email Marketing Certification course: Email Marketing and Your BusinessHubSpot
 
Behind the Scenes: Launching HubSpot Tokyo
Behind the Scenes: Launching HubSpot TokyoBehind the Scenes: Launching HubSpot Tokyo
Behind the Scenes: Launching HubSpot TokyoHubSpot
 
HubSpot Diversity Data 2016
HubSpot Diversity Data 2016HubSpot Diversity Data 2016
HubSpot Diversity Data 2016HubSpot
 
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...HubSpot
 
What is Inbound Recruiting?
What is Inbound Recruiting?What is Inbound Recruiting?
What is Inbound Recruiting?HubSpot
 
3 Proven Sales Email Templates Used by Successful Companies
3 Proven Sales Email Templates Used by Successful Companies3 Proven Sales Email Templates Used by Successful Companies
3 Proven Sales Email Templates Used by Successful CompaniesHubSpot
 
Add the Women Back: Wikipedia Edit-a-Thon
Add the Women Back: Wikipedia Edit-a-ThonAdd the Women Back: Wikipedia Edit-a-Thon
Add the Women Back: Wikipedia Edit-a-ThonHubSpot
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for PythonWes McKinney
 

Destacado (17)

Del Mono al QA
Del Mono al QADel Mono al QA
Del Mono al QA
 
DC/OS: The definitive platform for modern apps
DC/OS: The definitive platform for modern appsDC/OS: The definitive platform for modern apps
DC/OS: The definitive platform for modern apps
 
Security&Governance
Security&GovernanceSecurity&Governance
Security&Governance
 
Kafka Connect by Datio
Kafka Connect by DatioKafka Connect by Datio
Kafka Connect by Datio
 
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
 10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot 10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
 
PDP Your personal development plan
PDP Your personal development planPDP Your personal development plan
PDP Your personal development plan
 
How to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's BuyerHow to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's Buyer
 
25 Discovery Call Questions
25 Discovery Call Questions25 Discovery Call Questions
25 Discovery Call Questions
 
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
 
Class 1: Email Marketing Certification course: Email Marketing and Your Business
Class 1: Email Marketing Certification course: Email Marketing and Your BusinessClass 1: Email Marketing Certification course: Email Marketing and Your Business
Class 1: Email Marketing Certification course: Email Marketing and Your Business
 
Behind the Scenes: Launching HubSpot Tokyo
Behind the Scenes: Launching HubSpot TokyoBehind the Scenes: Launching HubSpot Tokyo
Behind the Scenes: Launching HubSpot Tokyo
 
HubSpot Diversity Data 2016
HubSpot Diversity Data 2016HubSpot Diversity Data 2016
HubSpot Diversity Data 2016
 
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
 
What is Inbound Recruiting?
What is Inbound Recruiting?What is Inbound Recruiting?
What is Inbound Recruiting?
 
3 Proven Sales Email Templates Used by Successful Companies
3 Proven Sales Email Templates Used by Successful Companies3 Proven Sales Email Templates Used by Successful Companies
3 Proven Sales Email Templates Used by Successful Companies
 
Add the Women Back: Wikipedia Edit-a-Thon
Add the Women Back: Wikipedia Edit-a-ThonAdd the Women Back: Wikipedia Edit-a-Thon
Add the Women Back: Wikipedia Edit-a-Thon
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 

Similar a Road to Analytics

Spark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxSpark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxshivani22y
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
 
Intro to apache spark
Intro to apache sparkIntro to apache spark
Intro to apache sparkAmine Sagaama
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsDatabricks
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizationsGal Marder
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / OptiqJulian Hyde
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj
 
SQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19cSQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19cRachelBarker26
 

Similar a Road to Analytics (20)

Spark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxSpark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptx
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
Intro to apache spark
Intro to apache sparkIntro to apache spark
Intro to apache spark
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / Optiq
 
Spark core
Spark coreSpark core
Spark core
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
SQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19cSQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19c
 

Más de Datio Big Data

Descubriendo la Inteligencia Artificial
Descubriendo la Inteligencia ArtificialDescubriendo la Inteligencia Artificial
Descubriendo la Inteligencia ArtificialDatio Big Data
 
Learning Python. Level 0
Learning Python. Level 0Learning Python. Level 0
Learning Python. Level 0Datio Big Data
 
How to document without dying in the attempt
How to document without dying in the attemptHow to document without dying in the attempt
How to document without dying in the attemptDatio Big Data
 
Ceph: The Storage System of the Future
Ceph: The Storage System of the FutureCeph: The Storage System of the Future
Ceph: The Storage System of the FutureDatio Big Data
 
A Travel Through Mesos
A Travel Through MesosA Travel Through Mesos
A Travel Through MesosDatio Big Data
 
Quality Assurance Glossary
Quality Assurance GlossaryQuality Assurance Glossary
Quality Assurance GlossaryDatio Big Data
 
Gamification: from buzzword to reality
Gamification: from buzzword to realityGamification: from buzzword to reality
Gamification: from buzzword to realityDatio Big Data
 
Pandas: High Performance Structured Data Manipulation
Pandas: High Performance Structured Data ManipulationPandas: High Performance Structured Data Manipulation
Pandas: High Performance Structured Data ManipulationDatio Big Data
 

Más de Datio Big Data (13)

Búsqueda IA
Búsqueda IABúsqueda IA
Búsqueda IA
 
Descubriendo la Inteligencia Artificial
Descubriendo la Inteligencia ArtificialDescubriendo la Inteligencia Artificial
Descubriendo la Inteligencia Artificial
 
Learning Python. Level 0
Learning Python. Level 0Learning Python. Level 0
Learning Python. Level 0
 
Learn Python
Learn PythonLearn Python
Learn Python
 
How to document without dying in the attempt
How to document without dying in the attemptHow to document without dying in the attempt
How to document without dying in the attempt
 
Developers on test
Developers on testDevelopers on test
Developers on test
 
Ceph: The Storage System of the Future
Ceph: The Storage System of the FutureCeph: The Storage System of the Future
Ceph: The Storage System of the Future
 
A Travel Through Mesos
A Travel Through MesosA Travel Through Mesos
A Travel Through Mesos
 
Datio OpenStack
Datio OpenStackDatio OpenStack
Datio OpenStack
 
Quality Assurance Glossary
Quality Assurance GlossaryQuality Assurance Glossary
Quality Assurance Glossary
 
Data Integration
Data IntegrationData Integration
Data Integration
 
Gamification: from buzzword to reality
Gamification: from buzzword to realityGamification: from buzzword to reality
Gamification: from buzzword to reality
 
Pandas: High Performance Structured Data Manipulation
Pandas: High Performance Structured Data ManipulationPandas: High Performance Structured Data Manipulation
Pandas: High Performance Structured Data Manipulation
 

Último

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 

Último (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 

Road to Analytics

  • 2. Contents 1 2 3 SAS vs Spark SAS Proc SQL vs Spark SQL Advantage Analytics
  • 3. 1. SAS vs Spark
  • 4. OVERVIEW SAS ○ The largest independent vendor in “advanced analytics” ○ 1976 foundation of the SAS Institute, Cary, North Carolina ○ Commercial software product SPARK ○ A fast and general engine for large-scale data processing ○ Started in 2009 as a research project in the UC Berkeley, AMPLab ○ Open source
  • 5. CODE SAS Basic programming model consists of code blocks: ○ SAS Data Step ■ generation of data ■ concatenation of data ○ SAS PROCedures ■ special functionalities SPARK “Line based” programming Native Language is Scala, but flexible programming model: ○ Scala ○ Java ○ Python ○ R
  • 6. DATA SAS: DATASET ○ Computed in memory (RAM) ○ A data set contains: ● observations: organized in rows ● variables: organized in columns SPARK: DATAFRAME ○ A distributed collection of data organized into named columns. ○ It is conceptually equivalent to: table in a relational database and dataframe in R/Python ○ It is a programming abstraction
  • 7. IMMUTABLE, PARTITIONED, DISTRIBUTED DATA STRUCTURE Transformations like: map, filter, union, join, group by… results in an other dataset
  • 8. SAS: data sasData set sasData; Fare2 = Fare + 2; run; Python Pandas: pandasDF['Fare2'] = pandasDF['Fare']+2 Spark: sparkDF = sparkDF .withColumn('Fare2',sparkDF['Fare']+2) NOTEBOOK IMMUTABLE, PARTITIONED, DISTRIBUTED DATA STRUCTURE
  • 9. READ SAS DATASETS The SAS-FILE (sas7bdat) is a file with special structure created by SAS and binary stored ● PYTHON: SAS7BDAT PACKAGE ● R: HAVEN LIBRARY ○
  • 10. 2. SAS Proc SQL vs Spark SQL
  • 11. SQL sentences SAS ProC SQL SAS Procedure that combines the functionality of DATA and PROC steps. It can sort, summarize, subset, join, concatenate datasets, create new variables... Spark SQL ○ Spark’s interface for working with structured and semi-structured data, query using SQL ○ Load data from JSON, Hive, Parquet ○ Evaluated “lazily”
  • 12. SQL sentences SAS ProC SQL PROC SQL; CREATE TABLE newTable AS SELECT Columns FROM Table WHERE Column > Value GROUP BY Columns; QUIT; Spark SQL sqlContext = new org.apache.spark.sql.SQLContext(sc) newTable = sqlContext.sql(“ SELECT Columns FROM Table WHERE Column > Value GROUP BY Columns”) NOTEBOOK
  • 13. AGGREGATE FUNCTION IN SPARK SQL sum, avg, mean, count, max, min, first, last, sttedev, variance, skewness, kurtosis… After aggregation Act on each group of data, return a single value as a result
  • 14. WINDOW FUNCTION IN SPARK SQL Ranking: rank, dense_rank, percent_rank, ntile, row_number Analytics: cume_dist, lag, first_value, last_value, lead Aggregate: aggregate funcs Calculate a return value over a set of rows called window that are somehow related to the current NOTEBOOK
  • 15. EXTEND SPARK SQL Standard functions are over 100 functions (pyspark) from pyspark.sql.functions import *
  • 16. BUILT-IN FUNCTIONS, UDFs “User Defined Function” Define new Column-based functions that extend the vocabulary of Spark Act on a single row as an input, single return value for every input row NOTEBOOK
  • 17. TIPS ○ Not thinking in sorted data. In parallel process we can’t acces per row. ○ Cache tables/DFs when they are used more than once ○ Merge doesn’t need ordered data as SAS ○ Use functions already defined instead of creating your own UDF ○ Save data in columnar format as Parquet ○ Avoid collecting data when you are working with Big Data, take a sample
  • 19. ADVANTAGE ANALYTICS SAS Stats Traditional Add-on package to SAS for Statistics ○ Analysis of variance ○ Bayesian analysis ○ Categorical data analysis ○ Distribution analysis ○ Mixed models ○ Predictive modeling... Spark MLlib Scalable machine learning library ○ Basic statistics ○ Classification and regression ○ Collaborative filtering ○ Clustering ○ Dimensionality reduction ○ Feature extraction and transformation...