SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
Analytics pipelines with
Jupyter and Spark
Who we are
● NETOPIA
● mobilPay
● mobilPay Wallet
● web2sms
● btko.in
● kartela.ro
● mobilender.mx
Challenges
Three dimensional problem
● Time: Past events or crystall ball?
● Profile: Who is looking at the data?
● Quantity: How much data is there to look at?
Profile
● Data Scientist
● Data Engineer
Quantity
● Hundreds of MB to a few GB
● Up to million events/records
vs.
● GB to TB to PB
● Hundreds of millions to billions and beyond
events/records
Also
● Computing vs. Storage
● Vertical vs. Horizontal scalability
● Distributed/ML libraries
● Dependency hell
Time
NOW
Past Future
Analytics Forecasting
(a.k.a. Prediction)
“Classic” Approach
Small Data Big Data
Data Engineer grep, sed, awk Java, Scala, Python, PIG,
Hadoop, lately Spark &
others
Data Scientist R/RStudio No way, Josè!
New Approach
Small Data Big Data
Data Engineer
Notebook Technologies: Jupyter (most used),
zeppelin, but also less known ones (Rodeo,
Beaker)
Data Scientist
Data analysis with
Jupyter, Pandas and Spark
Outline
About the data:
● Set of mobile transactions
● Set (separate) of retail transactions
About the tools: Jupyter, Pandas and Spark
Our experience
Future work
Mobile transactions Retail data
Elements of
analysis
Transactions Transactions, Products, Stock data
We know Transaction value, User identifier,
Merchant
Transaction value, Sold products,
Merchant
We don’t know What product was bought Who the user is
Size Hundreds of thousands of entries Hundreds of millions of entries
Status Building prediction models Gathering data
Datasets
Mobile transactions data
SQL Database
Mobile data: Environment
Preprocessing notebooks
Analysis and model testing notebooks
Pandas R (with rpy2)
scikit-learn Custom code
CSV files
pickle files
Other input sources
Jupyter
notebooks
in Docker
container
with
Anaconda
Diagnostics
Cleaning
Feature building
Raw data
Models
Visualizations
Docker image
… with Anaconda
● Anaconda: package manager
for data science
● Using docker-compose for
setting up container
parameters
● Many available images
● Our base image:
○ pyspark from Jupyter Docker
Stacks
○ Extended with required libraries
● Libraries are added or
updated with docker build:
○ Self-contained
○ Easy versioning
Jupyter Notebook
(1)
Web application for creating
documents with live code,
explanations and visualizations
● Initially, part of IPython
● Narrative with live code
● Protocol for interactive
exploration
○ Run blocks of code
○ Embedded JS
● Executable documents
○ Code
○ HTML and Markdown
○ Metadata
● Kernels for multiple
languages
○ Python
○ R
○ Scala
○ Bash
● Internal format: JSON
Jupyter Notebook
(2)
Web application for creating
documents with live code,
explanations and visualizations
● Plugins and widgets
● Easy to share (formats:
Notebook, PDF, HTML, …)
● Large ecosystem
○ Jupyter Lab / Jupyter Hub
○ GitHub visualizations
○ Blog integration
○ Education: teaching, evaluation
○ Microsoft, Google, Bloomberg,
IBM, O'Reilly
○ Executable books
● Versioning is complicated
Pandas
● DataFrame objects
○ Tabular data structures
○ Each column has one data type
● Based on numpy (fast)
● Processing is (mostly) done in
memory
● Data manipulation:
○ Hierarchical indexing
○ Reshaping, pivoting, grouping
○ String operations
○ Time series operations
● Reading / writing from / to
many formats (CSV, JSON,
HDF5, …)
● Visualization: matplotlib,
Seaborn, Bokeh, …
Python library for data
manipulation and analysis
rpy2
Interface between Python and
R
● Translates DataFrames
between Python and R
● Python in Jupyter: use %%R
● Direct access to R objects
(rpy2.robjects)
Jupyter, Pandas and R
R with Rpy2
Python
HTML and Markdown
Notebook
Mobile data: User retention
Active users:
● Classic: 1+ transactions in a given period
● Rolling: 1+ transactions in a given or
subsequent period
Plots:
● X: period (day, week, month)
● Y (cohort): period or another type of
segment
● By transaction criteria (merchant,
product, etc.)
Results:
● Response to campaigns
● Activity recurrence
Cohorts
Periods
Mobile data: Correlations
Features:
● How similar are two features?
Merchants:
● Which merchants have common users?
Products:
● Which products are sold together?
Mobile data: Clusters
● Group users by behavior
● Identify outliers
● Future: automatic cluster labeling
Retail transactions data
Retail data: Our experience
First try: Out-of-core processing with HDF5
● Data does did not fit in memory
● HDF5: format for large data
● Pandas + HDF5, Blaze, Dask, Odo
● Easy to use functions
● Library incompatibilities
● Slow queries, use indexes
● Occasional runtime errors
Cassandra
Retail data: Environment
Preprocessing notebooks
Analysis and model testing notebooks
Large data:
Spark ML + scikit-learn
Small (selection) data:
Pandas, scikit-learn and R
CSV files
Apache Parquet
Cassandra
Other input sources
Jupyter
notebooks
in Docker
containers
with Spark
and
Anaconda
Diagnostics
Cleaning
Feature building
Raw data
Models
Visualizations
In progress
Spark
Engine for big data processing
● DataFrames
○ Built on top of RDDs
○ Similar to Pandas and R
○ SQL queries
○ Automatic query optimization
through query plan
○ String , date-time and statistics
functions
○ Group by, filters
● Jupyter integration: work in
progress
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
Spark
Machine Learning
MLlib and ML
● MLlib
○ Uses RDDs
○ Summaries, correlations,
sampling
○ SVMs, logistic regression,
decision trees, ensembles and
Naive Bayes
○ Clustering
○ Feature transformation
● ML
○ Works with DataFrames
○ Many wrappers for MLlib
○ Pipelines:
■ Transformers, Estimators,
Parameters
■ labelCol, featuresCol,
predictionCol, ...
○ R formulas (y ~ x1 + x2)
Retail data: Our experience
Current: Spark + Docker
● No issues at current size (several GBs)
● Docker Compose for creating master, workers and Jupyter container
(driver)
● ML libraries are easy to work with
● Incomplete Python API for ML (e.g., summaries)
● Documentation needs improvement
● Model diagnostics
○ Some metrics are available
○ Supplement with scikit-learn (example: build ROC curves)
● scikit-learn or R on top of Spark
○ Parallelize parameter search (e.g., grid search)
○ Spark sklearn (github.com/databricks/spark-sklearn): Grid Search
Future work
Mobile wallet transactions:
● Data fits in memory
● Use Spark for distributing workload
ERP transactions:
● Some data fits in memory, after processing
● Build a web app for data exploration
● Forecast
○ Sales
○ Inventory requirements
● Try Spark Streaming
http://xkcd.com/1425/

Más contenido relacionado

La actualidad más candente

PuppetCamp SEA @ Blk 71 - What's New in Puppet DB
PuppetCamp SEA @ Blk 71 - What's New in Puppet DBPuppetCamp SEA @ Blk 71 - What's New in Puppet DB
PuppetCamp SEA @ Blk 71 - What's New in Puppet DBWalter Heck
 
Intro to Python Workshop San Diego, CA (January 19, 2013)
Intro to Python Workshop San Diego, CA (January 19, 2013)Intro to Python Workshop San Diego, CA (January 19, 2013)
Intro to Python Workshop San Diego, CA (January 19, 2013)Kendall
 
Behold the Power of Python
Behold the Power of PythonBehold the Power of Python
Behold the Power of PythonSarah Dutkiewicz
 
DRUG - RDSTK Talk
DRUG - RDSTK TalkDRUG - RDSTK Talk
DRUG - RDSTK Talkrtelmore
 
Python as the Zen of Data Science
Python as the Zen of Data SciencePython as the Zen of Data Science
Python as the Zen of Data ScienceTravis Oliphant
 
An introduction to Jupyter notebooks and the Noteable service
An introduction to Jupyter notebooks and the Noteable serviceAn introduction to Jupyter notebooks and the Noteable service
An introduction to Jupyter notebooks and the Noteable serviceJisc
 
OpenStack: A python based IaaS provider
OpenStack: A python based IaaS providerOpenStack: A python based IaaS provider
OpenStack: A python based IaaS providerFlavio Percoco Premoli
 
Is it a Package or a Wrapper? Designing, Documenting, and Distributing a Pyth...
Is it a Package or a Wrapper? Designing, Documenting, and Distributing a Pyth...Is it a Package or a Wrapper? Designing, Documenting, and Distributing a Pyth...
Is it a Package or a Wrapper? Designing, Documenting, and Distributing a Pyth...Azavea
 
IPTC News in JSON Spring 2013
IPTC News in JSON Spring 2013IPTC News in JSON Spring 2013
IPTC News in JSON Spring 2013Stuart Myles
 
MozillaPH Rust Hack & Learn Session 1
MozillaPH Rust Hack & Learn Session 1MozillaPH Rust Hack & Learn Session 1
MozillaPH Rust Hack & Learn Session 1Robert 'Bob' Reyes
 
Bang a Gong, GIT It On, or Running Drupal With a GIT Repository (11/04/20 - B...
Bang a Gong, GIT It On, or Running Drupal With a GIT Repository (11/04/20 - B...Bang a Gong, GIT It On, or Running Drupal With a GIT Repository (11/04/20 - B...
Bang a Gong, GIT It On, or Running Drupal With a GIT Repository (11/04/20 - B...DrupalCape
 

La actualidad más candente (19)

Python for All
Python for All Python for All
Python for All
 
10 popular software programs written in python
10 popular software programs written in python 10 popular software programs written in python
10 popular software programs written in python
 
PuppetCamp SEA @ Blk 71 - What's New in Puppet DB
PuppetCamp SEA @ Blk 71 - What's New in Puppet DBPuppetCamp SEA @ Blk 71 - What's New in Puppet DB
PuppetCamp SEA @ Blk 71 - What's New in Puppet DB
 
Intro to Python Workshop San Diego, CA (January 19, 2013)
Intro to Python Workshop San Diego, CA (January 19, 2013)Intro to Python Workshop San Diego, CA (January 19, 2013)
Intro to Python Workshop San Diego, CA (January 19, 2013)
 
Behold the Power of Python
Behold the Power of PythonBehold the Power of Python
Behold the Power of Python
 
DRUG - RDSTK Talk
DRUG - RDSTK TalkDRUG - RDSTK Talk
DRUG - RDSTK Talk
 
Intro
IntroIntro
Intro
 
Python as the Zen of Data Science
Python as the Zen of Data SciencePython as the Zen of Data Science
Python as the Zen of Data Science
 
go 1.8 net/http timeouts
go 1.8 net/http timeoutsgo 1.8 net/http timeouts
go 1.8 net/http timeouts
 
Python
PythonPython
Python
 
Welcome to Python
Welcome to PythonWelcome to Python
Welcome to Python
 
Git by example
Git by exampleGit by example
Git by example
 
An introduction to Jupyter notebooks and the Noteable service
An introduction to Jupyter notebooks and the Noteable serviceAn introduction to Jupyter notebooks and the Noteable service
An introduction to Jupyter notebooks and the Noteable service
 
OpenStack: A python based IaaS provider
OpenStack: A python based IaaS providerOpenStack: A python based IaaS provider
OpenStack: A python based IaaS provider
 
Is it a Package or a Wrapper? Designing, Documenting, and Distributing a Pyth...
Is it a Package or a Wrapper? Designing, Documenting, and Distributing a Pyth...Is it a Package or a Wrapper? Designing, Documenting, and Distributing a Pyth...
Is it a Package or a Wrapper? Designing, Documenting, and Distributing a Pyth...
 
IPTC News in JSON Spring 2013
IPTC News in JSON Spring 2013IPTC News in JSON Spring 2013
IPTC News in JSON Spring 2013
 
MozillaPH Rust Hack & Learn Session 1
MozillaPH Rust Hack & Learn Session 1MozillaPH Rust Hack & Learn Session 1
MozillaPH Rust Hack & Learn Session 1
 
Bang a Gong, GIT It On, or Running Drupal With a GIT Repository (11/04/20 - B...
Bang a Gong, GIT It On, or Running Drupal With a GIT Repository (11/04/20 - B...Bang a Gong, GIT It On, or Running Drupal With a GIT Repository (11/04/20 - B...
Bang a Gong, GIT It On, or Running Drupal With a GIT Repository (11/04/20 - B...
 
Intro to Python
Intro to PythonIntro to Python
Intro to Python
 

Destacado

Implementation of Rubik's Cube Formula in PyCuber
Implementation of Rubik's Cube Formula in PyCuberImplementation of Rubik's Cube Formula in PyCuber
Implementation of Rubik's Cube Formula in PyCuberWey-Han Liaw
 
Python for Data Analysis: Chapter 2
Python for Data Analysis: Chapter 2Python for Data Analysis: Chapter 2
Python for Data Analysis: Chapter 2智哉 今西
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariKarissa Rae McKelvey
 
BIG DATA サービス と ツール
BIG DATA サービス と ツールBIG DATA サービス と ツール
BIG DATA サービス と ツールNgoc Dao
 
Mobile Wallet Future in Bangladesh
Mobile Wallet Future in BangladeshMobile Wallet Future in Bangladesh
Mobile Wallet Future in BangladeshHasibur Rahman
 
data science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & Jupyterdata science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & JupyterRaj Singh
 
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A ServiceScala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A ServiceRomeo Kienzler
 
Using docker for data science - part 2
Using docker for data science - part 2Using docker for data science - part 2
Using docker for data science - part 2Calvin Giles
 
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...Roberto Hashioka
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...DataStax Academy
 
Time Series Processing with Solr and Spark
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and SparkJosef Adersberger
 
Growing the Mesos Ecosystem
Growing the Mesos EcosystemGrowing the Mesos Ecosystem
Growing the Mesos EcosystemMesosphere Inc.
 
Practical Data Analysis in Python
Practical Data Analysis in PythonPractical Data Analysis in Python
Practical Data Analysis in PythonHilary Mason
 
Overview of DataStax OpsCenter
Overview of DataStax OpsCenterOverview of DataStax OpsCenter
Overview of DataStax OpsCenterDataStax
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming DataGeoffrey Fox
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Data Con LA
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandasmaikroeder
 

Destacado (20)

2016年疾管署疫情監測週報(第44週)
2016年疾管署疫情監測週報(第44週)2016年疾管署疫情監測週報(第44週)
2016年疾管署疫情監測週報(第44週)
 
Implementation of Rubik's Cube Formula in PyCuber
Implementation of Rubik's Cube Formula in PyCuberImplementation of Rubik's Cube Formula in PyCuber
Implementation of Rubik's Cube Formula in PyCuber
 
Python for Data Analysis: Chapter 2
Python for Data Analysis: Chapter 2Python for Data Analysis: Chapter 2
Python for Data Analysis: Chapter 2
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
BIG DATA サービス と ツール
BIG DATA サービス と ツールBIG DATA サービス と ツール
BIG DATA サービス と ツール
 
Mobile Wallet Future in Bangladesh
Mobile Wallet Future in BangladeshMobile Wallet Future in Bangladesh
Mobile Wallet Future in Bangladesh
 
data science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & Jupyterdata science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & Jupyter
 
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A ServiceScala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
 
Using docker for data science - part 2
Using docker for data science - part 2Using docker for data science - part 2
Using docker for data science - part 2
 
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
 
Time Series Processing with Solr and Spark
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and Spark
 
Growing the Mesos Ecosystem
Growing the Mesos EcosystemGrowing the Mesos Ecosystem
Growing the Mesos Ecosystem
 
Practical Data Analysis in Python
Practical Data Analysis in PythonPractical Data Analysis in Python
Practical Data Analysis in Python
 
Overview of DataStax OpsCenter
Overview of DataStax OpsCenterOverview of DataStax OpsCenter
Overview of DataStax OpsCenter
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
 
Data analysis with pandas
Data analysis with pandasData analysis with pandas
Data analysis with pandas
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandas
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 

Similar a Analytics pipelines with Jupyter and Spark

AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryOpen Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryMarcus Hanwell
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
An Introduction to Pentaho Kettle
An Introduction to Pentaho KettleAn Introduction to Pentaho Kettle
An Introduction to Pentaho KettleDan Moore
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Build an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsBuild an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsShawn Zhu
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan
 
Python in Industry
Python in IndustryPython in Industry
Python in IndustryDharmit Shah
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistAlexey Zinoviev
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"Rob Winters
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 

Similar a Analytics pipelines with Jupyter and Spark (20)

AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryOpen Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
An Introduction to Pentaho Kettle
An Introduction to Pentaho KettleAn Introduction to Pentaho Kettle
An Introduction to Pentaho Kettle
 
Introduction To Pentaho Kettle
Introduction To Pentaho KettleIntroduction To Pentaho Kettle
Introduction To Pentaho Kettle
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Build an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsBuild an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data Scientists
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 
Python in Industry
Python in IndustryPython in Industry
Python in Industry
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
NoSQL for Artificial Intelligence
NoSQL for Artificial IntelligenceNoSQL for Artificial Intelligence
NoSQL for Artificial Intelligence
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 

Más de Felix Crisan

Big data uservices
Big data uservicesBig data uservices
Big data uservicesFelix Crisan
 
BigData in BlockChains
BigData in BlockChainsBigData in BlockChains
BigData in BlockChainsFelix Crisan
 
Smart contracts using web3.js
Smart contracts using web3.jsSmart contracts using web3.js
Smart contracts using web3.jsFelix Crisan
 
Smart contracts in Solidity
Smart contracts in SoliditySmart contracts in Solidity
Smart contracts in SolidityFelix Crisan
 
Big(data) in block(chains)
Big(data) in block(chains)Big(data) in block(chains)
Big(data) in block(chains)Felix Crisan
 
Enablers for o commerce
Enablers for o commerceEnablers for o commerce
Enablers for o commerceFelix Crisan
 
Deconstructing Lambda architectures
Deconstructing Lambda architecturesDeconstructing Lambda architectures
Deconstructing Lambda architecturesFelix Crisan
 
Presentation for the first Bucharest Big data meetup
Presentation for the first Bucharest Big data meetupPresentation for the first Bucharest Big data meetup
Presentation for the first Bucharest Big data meetupFelix Crisan
 

Más de Felix Crisan (15)

Big data uservices
Big data uservicesBig data uservices
Big data uservices
 
Bitcoin:Next
Bitcoin:NextBitcoin:Next
Bitcoin:Next
 
BigData in BlockChains
BigData in BlockChainsBigData in BlockChains
BigData in BlockChains
 
Lightning Network
Lightning  NetworkLightning  Network
Lightning Network
 
Smart contracts using web3.js
Smart contracts using web3.jsSmart contracts using web3.js
Smart contracts using web3.js
 
Smart contracts in Solidity
Smart contracts in SoliditySmart contracts in Solidity
Smart contracts in Solidity
 
Mashing the data
Mashing the dataMashing the data
Mashing the data
 
Big(data) in block(chains)
Big(data) in block(chains)Big(data) in block(chains)
Big(data) in block(chains)
 
Enablers for o commerce
Enablers for o commerceEnablers for o commerce
Enablers for o commerce
 
mcommad
mcommadmcommad
mcommad
 
NoSQL solutions
NoSQL solutionsNoSQL solutions
NoSQL solutions
 
Deconstructing Lambda architectures
Deconstructing Lambda architecturesDeconstructing Lambda architectures
Deconstructing Lambda architectures
 
402 @ Mobile next
402 @ Mobile next402 @ Mobile next
402 @ Mobile next
 
Presentation for the first Bucharest Big data meetup
Presentation for the first Bucharest Big data meetupPresentation for the first Bucharest Big data meetup
Presentation for the first Bucharest Big data meetup
 
TCP/IP of money
TCP/IP of moneyTCP/IP of money
TCP/IP of money
 

Último

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Analytics pipelines with Jupyter and Spark

  • 2. Who we are ● NETOPIA ● mobilPay ● mobilPay Wallet ● web2sms ● btko.in ● kartela.ro ● mobilender.mx
  • 4. Three dimensional problem ● Time: Past events or crystall ball? ● Profile: Who is looking at the data? ● Quantity: How much data is there to look at?
  • 6. Quantity ● Hundreds of MB to a few GB ● Up to million events/records vs. ● GB to TB to PB ● Hundreds of millions to billions and beyond events/records
  • 7. Also ● Computing vs. Storage ● Vertical vs. Horizontal scalability ● Distributed/ML libraries ● Dependency hell
  • 9. “Classic” Approach Small Data Big Data Data Engineer grep, sed, awk Java, Scala, Python, PIG, Hadoop, lately Spark & others Data Scientist R/RStudio No way, Josè!
  • 10. New Approach Small Data Big Data Data Engineer Notebook Technologies: Jupyter (most used), zeppelin, but also less known ones (Rodeo, Beaker) Data Scientist
  • 11. Data analysis with Jupyter, Pandas and Spark
  • 12. Outline About the data: ● Set of mobile transactions ● Set (separate) of retail transactions About the tools: Jupyter, Pandas and Spark Our experience Future work
  • 13. Mobile transactions Retail data Elements of analysis Transactions Transactions, Products, Stock data We know Transaction value, User identifier, Merchant Transaction value, Sold products, Merchant We don’t know What product was bought Who the user is Size Hundreds of thousands of entries Hundreds of millions of entries Status Building prediction models Gathering data Datasets
  • 15. SQL Database Mobile data: Environment Preprocessing notebooks Analysis and model testing notebooks Pandas R (with rpy2) scikit-learn Custom code CSV files pickle files Other input sources Jupyter notebooks in Docker container with Anaconda Diagnostics Cleaning Feature building Raw data Models Visualizations
  • 16. Docker image … with Anaconda ● Anaconda: package manager for data science ● Using docker-compose for setting up container parameters ● Many available images ● Our base image: ○ pyspark from Jupyter Docker Stacks ○ Extended with required libraries ● Libraries are added or updated with docker build: ○ Self-contained ○ Easy versioning
  • 17. Jupyter Notebook (1) Web application for creating documents with live code, explanations and visualizations ● Initially, part of IPython ● Narrative with live code ● Protocol for interactive exploration ○ Run blocks of code ○ Embedded JS ● Executable documents ○ Code ○ HTML and Markdown ○ Metadata ● Kernels for multiple languages ○ Python ○ R ○ Scala ○ Bash ● Internal format: JSON
  • 18. Jupyter Notebook (2) Web application for creating documents with live code, explanations and visualizations ● Plugins and widgets ● Easy to share (formats: Notebook, PDF, HTML, …) ● Large ecosystem ○ Jupyter Lab / Jupyter Hub ○ GitHub visualizations ○ Blog integration ○ Education: teaching, evaluation ○ Microsoft, Google, Bloomberg, IBM, O'Reilly ○ Executable books ● Versioning is complicated
  • 19. Pandas ● DataFrame objects ○ Tabular data structures ○ Each column has one data type ● Based on numpy (fast) ● Processing is (mostly) done in memory ● Data manipulation: ○ Hierarchical indexing ○ Reshaping, pivoting, grouping ○ String operations ○ Time series operations ● Reading / writing from / to many formats (CSV, JSON, HDF5, …) ● Visualization: matplotlib, Seaborn, Bokeh, … Python library for data manipulation and analysis
  • 20. rpy2 Interface between Python and R ● Translates DataFrames between Python and R ● Python in Jupyter: use %%R ● Direct access to R objects (rpy2.robjects)
  • 21. Jupyter, Pandas and R R with Rpy2 Python HTML and Markdown Notebook
  • 22. Mobile data: User retention Active users: ● Classic: 1+ transactions in a given period ● Rolling: 1+ transactions in a given or subsequent period Plots: ● X: period (day, week, month) ● Y (cohort): period or another type of segment ● By transaction criteria (merchant, product, etc.) Results: ● Response to campaigns ● Activity recurrence Cohorts Periods
  • 23. Mobile data: Correlations Features: ● How similar are two features? Merchants: ● Which merchants have common users? Products: ● Which products are sold together?
  • 24. Mobile data: Clusters ● Group users by behavior ● Identify outliers ● Future: automatic cluster labeling
  • 26. Retail data: Our experience First try: Out-of-core processing with HDF5 ● Data does did not fit in memory ● HDF5: format for large data ● Pandas + HDF5, Blaze, Dask, Odo ● Easy to use functions ● Library incompatibilities ● Slow queries, use indexes ● Occasional runtime errors
  • 27. Cassandra Retail data: Environment Preprocessing notebooks Analysis and model testing notebooks Large data: Spark ML + scikit-learn Small (selection) data: Pandas, scikit-learn and R CSV files Apache Parquet Cassandra Other input sources Jupyter notebooks in Docker containers with Spark and Anaconda Diagnostics Cleaning Feature building Raw data Models Visualizations In progress
  • 28. Spark Engine for big data processing ● DataFrames ○ Built on top of RDDs ○ Similar to Pandas and R ○ SQL queries ○ Automatic query optimization through query plan ○ String , date-time and statistics functions ○ Group by, filters ● Jupyter integration: work in progress https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
  • 29. Spark Machine Learning MLlib and ML ● MLlib ○ Uses RDDs ○ Summaries, correlations, sampling ○ SVMs, logistic regression, decision trees, ensembles and Naive Bayes ○ Clustering ○ Feature transformation ● ML ○ Works with DataFrames ○ Many wrappers for MLlib ○ Pipelines: ■ Transformers, Estimators, Parameters ■ labelCol, featuresCol, predictionCol, ... ○ R formulas (y ~ x1 + x2)
  • 30. Retail data: Our experience Current: Spark + Docker ● No issues at current size (several GBs) ● Docker Compose for creating master, workers and Jupyter container (driver) ● ML libraries are easy to work with ● Incomplete Python API for ML (e.g., summaries) ● Documentation needs improvement ● Model diagnostics ○ Some metrics are available ○ Supplement with scikit-learn (example: build ROC curves) ● scikit-learn or R on top of Spark ○ Parallelize parameter search (e.g., grid search) ○ Spark sklearn (github.com/databricks/spark-sklearn): Grid Search
  • 31. Future work Mobile wallet transactions: ● Data fits in memory ● Use Spark for distributing workload ERP transactions: ● Some data fits in memory, after processing ● Build a web app for data exploration ● Forecast ○ Sales ○ Inventory requirements ● Try Spark Streaming http://xkcd.com/1425/