SlideShare una empresa de Scribd logo
1 de 18
Analytic Processes at Precima
Overview
 Analytics pipeline design considerations
 The old way
 The current way
 Looking to the future
Pipeline Design Considerations
 Product pipeline is easy to test, debug and monitor. There are clear solutions for
replaying, rerunning and interrupting tasks or dataflow in production ready pipeline.
 There are several teams involved in product pipeline ( e.g., security, development,
support and etc.) ; however , there is a clear chain of responsibility and protocol for
when things go wrong.
 The pipeline design are reviewed under business/stakeholder use cases and our
pipeline are designed to be highly configurable and scalable.
2 Years Ago – Legacy Stack in a Data Center
cron
SAS Enterprise Guide and Scripting
Shell Scripting and Crontab Scheduling
Current Analytics Stack in AWS
Amazon S3
Control-M
Luigi and Control M
Control-M Scheduler
 Coordinate dependencies between disparate
servers and platforms
 Central dashboard of execution status
 We have gone from a handful of servers to
hundreds
 Understand what runs when and for how long
 Comparison of jobs to historical runtimes
Spot Fleet
 Run hundreds of
independent jobs
concurrently
 Each job gets it’s own server
 Compute cost is about
$0.10/hr
 Shared storage
 Servers automatically
shutdown when jobs
complete
Redshift
Pros
 Flexibility over Data Center
 Very quick to onboard new clients
 Can provide very fast query times
over large datasets
Cons
 Concurrency issues – Leader node
 Inconsistent job runtimes based
on overall workloads
 Need to scale for largest expected
workload
 Storage coupled to compute
 Not quick to scale
 AWS Only
Future Precima Analytics Stack
Amazon S3
Control-M
Databricks and Snowflake
 Databricks for Data Pipelines and Data Science
 Snowflake for high performance data warehouse queries
Benefits
 Decouple compute from storage
 Jobs don’t interfere with each other
 Virtually unlimited compute scaling
 Virtually unlimited low cost storage
 Spot pricing for nodes
 Time Travel features allow for repeatable fast dry runs on live or nearly live data
 Notebook interface including Python, SQL, Scala, R and Markdown for comments
 Multi-cloud support
Vision for the Future
 ETL with Databricks Spark jobs built using Object Oriented Python
 Take advantage of inheritance and configuration
 Quickly map new data feeds to our standard data model for our Precima products
 Built-in validation and conversion for data fields
 DRY – Don’t repeat yourself
 Data Science pipeline using Databricks notebook workflow
 Notebook Workflows allow user to include another notebook within a notebook. Users can
concatenate various notebooks that represent key ETL steps, Spark analysis steps, or ad-hoc
exploration. However, it lacks the ability to build more complex data pipelines.
 Airflow provides tight integration between Databricks and Airflow. Luigi also provides an interface to
accommodate Apache spark jobs
Pipeline Design Considerations
 Product pipeline is easy to test, debug and monitor. There are clear solutions for
replaying, rerunning and interrupting tasks or dataflow in production ready pipeline.
 Workflow management frameworks helped us to achieve most of the desired feature
for data pipeline
 There are several teams involved in product pipeline ( e.g., security, development,
support and etc.) ; however , there is a clear chain of responsibility and protocol for
when things go wrong.
 The pipeline design are reviewed under business/stakeholder use cases and our
pipeline are designed to be highly configurable and scalable.
 Move to AWS unlocked our ability to scale
 Moving toward options that decouple storage from compute in order to scale efficiently
 Have made good progress on embracing configuration
 Moving toward fully configurable
Appendix: Qualities of Ideal Data Pipelines
The desired quality of data pipeline include
 Idempotent with state handling
 Scalable and resilient
 Replaceable or programmable
 Testable and traceable
 Documented and automated

Más contenido relacionado

La actualidad más candente

Dan Querimit - BI Portfolio
Dan Querimit - BI PortfolioDan Querimit - BI Portfolio
Dan Querimit - BI Portfolio
querimit
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
Databricks
 
How to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudHow to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the Cloud
VMware Tanzu
 
Cloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application DevelopmentCloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application Development
Peter Haase
 

La actualidad más candente (20)

Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
 
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDelivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
Transform Your Mainframe Data for the Cloud with Precisely and Apache Kafka
Transform Your Mainframe Data for the Cloud with Precisely and Apache KafkaTransform Your Mainframe Data for the Cloud with Precisely and Apache Kafka
Transform Your Mainframe Data for the Cloud with Precisely and Apache Kafka
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
Dan Querimit - BI Portfolio
Dan Querimit - BI PortfolioDan Querimit - BI Portfolio
Dan Querimit - BI Portfolio
 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
How to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudHow to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the Cloud
 
Cloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application DevelopmentCloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application Development
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
 
Deploying ETL to Cloud
Deploying ETL to CloudDeploying ETL to Cloud
Deploying ETL to Cloud
 
Alteryx Architecture
Alteryx ArchitectureAlteryx Architecture
Alteryx Architecture
 
How Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and AnalyticsHow Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and Analytics
 
Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with Databricks
 

Similar a Tordatasci meetup-precima-retail-analytics-201901

A complete-guide-to-oracle-to-redshift-migration
A complete-guide-to-oracle-to-redshift-migrationA complete-guide-to-oracle-to-redshift-migration
A complete-guide-to-oracle-to-redshift-migration
bindu1512
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
Databricks
 

Similar a Tordatasci meetup-precima-retail-analytics-201901 (20)

Qubole Pipeline Services - A Complete Stream Processing Service - Data Sheets
Qubole Pipeline Services - A Complete Stream Processing Service - Data SheetsQubole Pipeline Services - A Complete Stream Processing Service - Data Sheets
Qubole Pipeline Services - A Complete Stream Processing Service - Data Sheets
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdf
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics Platform
 
A complete-guide-to-oracle-to-redshift-migration
A complete-guide-to-oracle-to-redshift-migrationA complete-guide-to-oracle-to-redshift-migration
A complete-guide-to-oracle-to-redshift-migration
 
Azure SQL DB Managed Instances Built to easily modernize application data layer
Azure SQL DB Managed Instances Built to easily modernize application data layerAzure SQL DB Managed Instances Built to easily modernize application data layer
Azure SQL DB Managed Instances Built to easily modernize application data layer
 
Airflow techtonic template
Airflow   techtonic templateAirflow   techtonic template
Airflow techtonic template
 
Webinar How to Achieve True Scalability in SaaS Applications
Webinar How to Achieve True Scalability in SaaS ApplicationsWebinar How to Achieve True Scalability in SaaS Applications
Webinar How to Achieve True Scalability in SaaS Applications
 
Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine Learning
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
Data Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptxData Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptx
 
Databus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineDatabus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture Pipeline
 
Windows on AWS
Windows on AWSWindows on AWS
Windows on AWS
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
 

Más de WeCloudData

Más de WeCloudData (16)

Data Engineer Intro - WeCloudData
Data Engineer Intro - WeCloudDataData Engineer Intro - WeCloudData
Data Engineer Intro - WeCloudData
 
AWS Well Architected-Info Session WeCloudData
AWS Well Architected-Info Session WeCloudDataAWS Well Architected-Info Session WeCloudData
AWS Well Architected-Info Session WeCloudData
 
Data Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudDataData Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudData
 
Machine learning in Healthcare - WeCloudData
Machine learning in Healthcare - WeCloudDataMachine learning in Healthcare - WeCloudData
Machine learning in Healthcare - WeCloudData
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudData
 
Big Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudDataBig Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudData
 
Introduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataIntroduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudData
 
Data Science with Python - WeCloudData
Data Science with Python - WeCloudDataData Science with Python - WeCloudData
Data Science with Python - WeCloudData
 
SQL for Data Science
SQL for Data ScienceSQL for Data Science
SQL for Data Science
 
Introduction to Python by WeCloudData
Introduction to Python by WeCloudDataIntroduction to Python by WeCloudData
Introduction to Python by WeCloudData
 
Data Science Career Insights by WeCloudData
Data Science Career Insights by WeCloudDataData Science Career Insights by WeCloudData
Data Science Career Insights by WeCloudData
 
Web scraping project aritza-compressed
Web scraping project   aritza-compressedWeb scraping project   aritza-compressed
Web scraping project aritza-compressed
 
Applied Machine Learning Course - Jodie Zhu (WeCloudData)
Applied Machine Learning Course - Jodie Zhu (WeCloudData)Applied Machine Learning Course - Jodie Zhu (WeCloudData)
Applied Machine Learning Course - Jodie Zhu (WeCloudData)
 
Introduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataIntroduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudData
 
Big Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionBig Data for Data Scientists - Info Session
Big Data for Data Scientists - Info Session
 
WeCloudData Toronto Open311 Workshop - Matthew Reyes
WeCloudData Toronto Open311 Workshop - Matthew ReyesWeCloudData Toronto Open311 Workshop - Matthew Reyes
WeCloudData Toronto Open311 Workshop - Matthew Reyes
 

Último

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 

Último (20)

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 

Tordatasci meetup-precima-retail-analytics-201901

  • 1.
  • 3. Overview  Analytics pipeline design considerations  The old way  The current way  Looking to the future
  • 4. Pipeline Design Considerations  Product pipeline is easy to test, debug and monitor. There are clear solutions for replaying, rerunning and interrupting tasks or dataflow in production ready pipeline.  There are several teams involved in product pipeline ( e.g., security, development, support and etc.) ; however , there is a clear chain of responsibility and protocol for when things go wrong.  The pipeline design are reviewed under business/stakeholder use cases and our pipeline are designed to be highly configurable and scalable.
  • 5. 2 Years Ago – Legacy Stack in a Data Center cron
  • 6. SAS Enterprise Guide and Scripting
  • 7. Shell Scripting and Crontab Scheduling
  • 8. Current Analytics Stack in AWS Amazon S3 Control-M
  • 10. Control-M Scheduler  Coordinate dependencies between disparate servers and platforms  Central dashboard of execution status  We have gone from a handful of servers to hundreds  Understand what runs when and for how long  Comparison of jobs to historical runtimes
  • 11. Spot Fleet  Run hundreds of independent jobs concurrently  Each job gets it’s own server  Compute cost is about $0.10/hr  Shared storage  Servers automatically shutdown when jobs complete
  • 12. Redshift Pros  Flexibility over Data Center  Very quick to onboard new clients  Can provide very fast query times over large datasets Cons  Concurrency issues – Leader node  Inconsistent job runtimes based on overall workloads  Need to scale for largest expected workload  Storage coupled to compute  Not quick to scale  AWS Only
  • 13. Future Precima Analytics Stack Amazon S3 Control-M
  • 14. Databricks and Snowflake  Databricks for Data Pipelines and Data Science  Snowflake for high performance data warehouse queries Benefits  Decouple compute from storage  Jobs don’t interfere with each other  Virtually unlimited compute scaling  Virtually unlimited low cost storage  Spot pricing for nodes  Time Travel features allow for repeatable fast dry runs on live or nearly live data  Notebook interface including Python, SQL, Scala, R and Markdown for comments  Multi-cloud support
  • 15. Vision for the Future  ETL with Databricks Spark jobs built using Object Oriented Python  Take advantage of inheritance and configuration  Quickly map new data feeds to our standard data model for our Precima products  Built-in validation and conversion for data fields  DRY – Don’t repeat yourself  Data Science pipeline using Databricks notebook workflow  Notebook Workflows allow user to include another notebook within a notebook. Users can concatenate various notebooks that represent key ETL steps, Spark analysis steps, or ad-hoc exploration. However, it lacks the ability to build more complex data pipelines.  Airflow provides tight integration between Databricks and Airflow. Luigi also provides an interface to accommodate Apache spark jobs
  • 16. Pipeline Design Considerations  Product pipeline is easy to test, debug and monitor. There are clear solutions for replaying, rerunning and interrupting tasks or dataflow in production ready pipeline.  Workflow management frameworks helped us to achieve most of the desired feature for data pipeline  There are several teams involved in product pipeline ( e.g., security, development, support and etc.) ; however , there is a clear chain of responsibility and protocol for when things go wrong.  The pipeline design are reviewed under business/stakeholder use cases and our pipeline are designed to be highly configurable and scalable.  Move to AWS unlocked our ability to scale  Moving toward options that decouple storage from compute in order to scale efficiently  Have made good progress on embracing configuration  Moving toward fully configurable
  • 17.
  • 18. Appendix: Qualities of Ideal Data Pipelines The desired quality of data pipeline include  Idempotent with state handling  Scalable and resilient  Replaceable or programmable  Testable and traceable  Documented and automated

Notas del editor

  1. Partly Reference from SlideShare material https://www.slideshare.net/InfoQ/effective-data-pipelines-data-mngmt-from-chaos
  2. Partly Reference from SlideShare material https://www.slideshare.net/InfoQ/effective-data-pipelines-data-mngmt-from-chaos
  3. Reference from SlideShare material https://www.slideshare.net/InfoQ/effective-data-pipelines-data-mngmt-from-chaos