SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
Towards Data Operations
Dr. Andrea Monacchi
Streaming data, models and code as first-class citizens
Riservato Nome personalizzato dell'azienda Versione 1.0
1. About Myself
2. Big Data Architectures
3. Stream-based computing
4. Integrating the Data Science workflow
Summary
About
● BS Computer Science (2010)
● MS Computer Science (2012)
● PhD Information Technology (2016)
● Consultancy experience @ Reply DE (2016-2018)
● Independent Consultant
Big Data
Architectures
Big Data: why are we here?
Loads:
● Transactional (OLTP)
○ all operations
○ ACID properties: atomicity, consistency, isolation,
and durability
● Analytical (OLAP)
○ append-heavy loads
○ aggregations and explorative queries (analytics)
○ hierarchical indexing (OLAP hyper-cubes)
Scalability:
● ACID properties costly
● CAP Theorem
○ impossible to simultaneously distributed data load
and fulfill 3 properties: consistency, availability,
partition tolerance (tolerance to communication
errors)
○ CA are classic RDBMS - vertical scaling only
○ CP (e.g. quorum-based) and AP are NoSQL DBs
○ e.g. Dynamo DB (eventual consistency, AP)
● NoSQL databases
○ relax ACID properties to achieve horizontal scaling
Scalability
● Key/Value Storage (DHT)
○ decentralised, scalable, fault-tolerant
○ easy to partition and distribute data by key
■ e.g. p = hash(k) % num_partitions
○ replication (partition redundancy)
■ 2: error detection
■ 3: error recovery
● Parallel collections
○ e.g. Spark RDD based on underlying HDFS blocks
○ master-slave (or driver-worker) coordination
○ no shared variables (accumulators, broadcast vars)
Parallel computation:
1. Split dataset into multiple partitions/shards
2. Independently process each partition
3. Combine partitions into result
● MapReduce (Functional Programming)
● Split-apply-combine
● Google’s MapReduce
● Cluster Manager
○ Yarn, Mesos, K8s
○ resource isolation and scheduling
○ security, fault-tolerance, monitoring
● Data Serialization formats
○ Text: CSV, JSON, XML,..
○ Binary: SeqFile, Avro, Parquet, ORC, ..
● Batch Processing
○ Hadoop MapReduce variants
■ Pig, Hive
○ Apache Spark
○ Python Dask
● Workflow management tools
○ Fault-tolerant task coordination
○ Oozie, Airflow, Argo (K8s)
Architectures for Data Analytics
● Stages
○ Ingestion (with retention)
○ (re)-processing
○ Presentation/Serving (indexed data, OLAP)
● Lambda Vs. Kappa architecture
○ batch for correctness, streaming for speed
○ mix of technologies (CAP theorem)
○ complexity & operational costs
Stream-based
computing
1st phase - Ingestion: MQTT
● Pub/Sub
○ Totally uncoupled clients
○ messages can be explicitly retained (flag)
● QoS
○ multilevel (0: at most once, 1: at least once, 2: exactly once)
● Persistent sessions
○ broker keeps further info of clients to speed up reconnections
○ queuing of messages in QoS 1 and 2 (disconnections)
○ queuing of all acks for QoS 2
● Last will and Testament messages (LWT)
○ kept until proper client disconnect is sent to the broker
● fault-tolerant pub/sub system with message retention
● exactly once semantics (from v0.11.0) using transactional.id flag (and acks)
● topic as an append-only file log
○ ordering by arrival time (offset)
○ stores changes on the data source (deltas)
○ topic/log consists of multiple partitions
■ partitioning instructed by producer
■ guaranteed message ordering within partition
■ based on message key
● hash(k) % num_partitions
● if k == null, then round robin is used
■ distribution and replication by partition
● 1 elected active (leader) and n replicas
● Zookeeper
1st phase - Ingestion: Kafka
P
Broker2
Broker1
Partition0
Partition1
Partition2
Partition3
Broker3
Partition4
Partition5
● topic maps to local directory
○ 1 file created per each topic partition
○ log rolling: upon size or time limit partition file is rolled
and new one is created
■ log.roll.ms or log.roll.hours = 168
○ log retention:
■ save rolled log to segment files
■ older segments deleted log.retention.hours
or log.retention.bytes
○ log compaction:
■ deletion of older records by key (latest value)
■ log.cleanup.policy=compact on topic
● number of topic partitions (howto)
○ more data -> more partitions
○ proportional to desired throughput
○ overhead for open file handles and TCP connections
■ n partitions, each with m replicas, n*m
■ total partitions on a broker < 100 *
num_brokers * replication_factor, (divide for
all topics)
1st phase - Ingestion: Kafka
● APIs
○ consumer / producer (single thread)
○ Kafka connect
○ KSQL
● Producer (Stateless wrt Broker)
○ ProducerRecord: (topic : String, partition : int,
timestamp : long, key : K, value : V)
○ partition id and time are optional (for manual setup)
● Kafka Connect
○ API for connectors (Source Vs Sink)
○ automatic offset management (commit and restore)
○ at least once delivery
○ exactly once only for certain connectors
○ standalone Vs. distributed modes
○ connectors configurable via REST interface
1st phase - Ingestion: Kafka
● Consumer
○ stateful (own offset wrt topic’s)
○ earliest Vs. latest recovery
○ offset committing: manual, periodical (default 5 secs)
○ consumers can be organized in load-balanced groups (per partition)
○ ideally: consumer’s threads = topic partitions
2nd phase - Stream processing
● streams
○ bounded - can be ingested and batch processed
○ unbounded - processed per event
● stream partitions
○ unit of parallelism
● stream and table duality
○ log changes <-> table
○ Declarative SQL APIs (e.g. KSQL, TableAPI)
● time
○ event time, ingestion time, processing time
○ late message handling (pre-buffering)
● windowing
○ bounded aggregations
○ time or data driven (e.g. count) windows
○ tumbling, sliding and session windows
● stateful operations/transformations
○ state: intermediate result, in-memory key-value store
used across windows or microbatches
○ e.g. RocksDB (Flink+KafkaStreams), LSM-tree
○ e.g. changeLogging back to Kafka topic (KafkaStreams)
● checkpointing
○ save app status for failure recovery (stream replay)
● frameworks
○ Spark Streaming (u-batching), Kafka Streams and Flink
○ Apache Storm, Apache Samza
Code Examples
Flink
● processing webserver logs for frauds
● count downloaded assets per user
● https://github.com/pilillo/flink-quickstart
● Deploy to Kubernetes
SparkStreaming & Kafka
● https://github.com/pilillo/sparkstreaming-quickstart
Kafka Streams
● project skeleton
Integrating the
Data Science
workflow
Data Science workflow
Technical gaps potentially resulting from this process!
● Data Analytics projects
○ stream of research questions and feedback
● Data forked for exploration
○ Data versioning
○ Periodic data quality assessment
● Team misalignment
○ scientists working aside dev team
○ information gaps w/ team & stakeholders
○ unexpected behaviors upon changes
● Results may not be reproducible
● Releases are not frequent
○ Value misalignment (waste of resources)
● CICD only used for data preparation
Data Operations (DataOps)
● DevOps approaches
○ lean product development (continuous feedback and value delivery) using CICD approaches
○ cross-functional teams with mix of development & operations skillset
● DataOps
○ devops for streaming data and analytics as a manufacturing process - Manifesto, Cookbook
○ mix of data engineering and data science skill set
○ focus: continuous data quality assessment, model reproducibility, incremental/progressive delivery of value
Data Operations workflow
Data Science workflow
CICD activities
CICD for ML
● Continuous Data Quality Assessment
○ data versioning (e.g. apache pachyderm)
○ syntax (e.g. confluent schema registry)
○ semantic (e.g. apache griffin) - accuracy, completeness, timeliness, uniqueness, validity, consistency
● Model Tuning
○ hyperparameters tuning - black-box optimization
○ autoML - model selection
○ continuous performance evaluation (wrt newer input data)
○ stakeholder/user performance (e.g. AB testing)
● Model Deployment
○ TF-serve, Seldon
● ML workflow management
○ Amazon Sagemaker, Google ML Engine, MLflow, Kubeflow, Polyaxon
CICD for ML code
See also: https://github.com/EthicalML/awesome-machine-learning-operations
Data-Mill project
● Based on Kubernetes
○ open and scalable
○ seamless integration of bare-metal and cloud-provided clusters
● Enforcing DataOps principles
○ continuous asset monitoring (code, data, models)
○ open-source tools to reproduce and serve models
● Flavour-based organization of components
○ flavour = cluster_spec + SW_components
● Built-in exploration environments (dashboarding tools, jupyter notebooks with DS libraries)
https://data-mill-cloud.github.io/data-mill/
Thank you!

Más contenido relacionado

La actualidad más candente

Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Julien Le Dem
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applicationsLars Albertsson
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Anant Corporation
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue
 
m2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reusem2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and ReuseVasia Kalavri
 
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Hajira Jabeen
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
Pinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastorePinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastoreKishore Gopalakrishna
 
Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2Tim Allison
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRoverChristoph Matthies
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems researchVasia Kalavri
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in productionPingCAP
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016Nikhil Shekhar
 

La actualidad más candente (20)

Apache flink
Apache flinkApache flink
Apache flink
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
 
m2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reusem2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reuse
 
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Pinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastorePinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastore
 
Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRover
 
ISNCC 2017
ISNCC 2017ISNCC 2017
ISNCC 2017
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems research
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016
 

Similar a Towards Data Operations

Software Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationSoftware Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationHao Xu
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
 
Introduction to Postrges-XC
Introduction to Postrges-XCIntroduction to Postrges-XC
Introduction to Postrges-XCAshutosh Bapat
 
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...confluent
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxXinliShang1
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...Amazon Web Services
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Mobicents Summit 2012 - Alexandre Mendonca - Mobicents jDiameter
Mobicents Summit 2012 - Alexandre Mendonca - Mobicents jDiameterMobicents Summit 2012 - Alexandre Mendonca - Mobicents jDiameter
Mobicents Summit 2012 - Alexandre Mendonca - Mobicents jDiametertelestax
 
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...Hisham Mardam-Bey
 
Gluster dev session #6 understanding gluster's network communication layer
Gluster dev session #6  understanding gluster's network   communication layerGluster dev session #6  understanding gluster's network   communication layer
Gluster dev session #6 understanding gluster's network communication layerPranith Karampuri
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
Event driven architectures with Kinesis
Event driven architectures with KinesisEvent driven architectures with Kinesis
Event driven architectures with KinesisMark Harrison
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
JConf.dev 2022 - Apache Pulsar Development 101 with Java
JConf.dev 2022 - Apache Pulsar Development 101 with JavaJConf.dev 2022 - Apache Pulsar Development 101 with Java
JConf.dev 2022 - Apache Pulsar Development 101 with JavaTimothy Spann
 
Concurrency, Parallelism And IO
Concurrency,  Parallelism And IOConcurrency,  Parallelism And IO
Concurrency, Parallelism And IOPiyush Katariya
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®confluent
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKevin Lynch
 
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...confluent
 

Similar a Towards Data Operations (20)

Software Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationSoftware Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale Automation
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
 
Introduction to Postrges-XC
Introduction to Postrges-XCIntroduction to Postrges-XC
Introduction to Postrges-XC
 
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Mobicents Summit 2012 - Alexandre Mendonca - Mobicents jDiameter
Mobicents Summit 2012 - Alexandre Mendonca - Mobicents jDiameterMobicents Summit 2012 - Alexandre Mendonca - Mobicents jDiameter
Mobicents Summit 2012 - Alexandre Mendonca - Mobicents jDiameter
 
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
 
Gluster dev session #6 understanding gluster's network communication layer
Gluster dev session #6  understanding gluster's network   communication layerGluster dev session #6  understanding gluster's network   communication layer
Gluster dev session #6 understanding gluster's network communication layer
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Event driven architectures with Kinesis
Event driven architectures with KinesisEvent driven architectures with Kinesis
Event driven architectures with Kinesis
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
JConf.dev 2022 - Apache Pulsar Development 101 with Java
JConf.dev 2022 - Apache Pulsar Development 101 with JavaJConf.dev 2022 - Apache Pulsar Development 101 with Java
JConf.dev 2022 - Apache Pulsar Development 101 with Java
 
Concurrency, Parallelism And IO
Concurrency,  Parallelism And IOConcurrency,  Parallelism And IO
Concurrency, Parallelism And IO
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the Datacenter
 
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
 

Más de Andrea Monacchi

Coordination in distributed systems
Coordination in distributed systemsCoordination in distributed systems
Coordination in distributed systemsAndrea Monacchi
 
Anomaly detection on wind turbine data
Anomaly detection on wind turbine dataAnomaly detection on wind turbine data
Anomaly detection on wind turbine dataAndrea Monacchi
 
Welcome to Load Disaggregation and Building Energy Management
Welcome to Load Disaggregation and Building Energy ManagementWelcome to Load Disaggregation and Building Energy Management
Welcome to Load Disaggregation and Building Energy ManagementAndrea Monacchi
 
An Early Warning System for Ambient Assisted Living
An Early Warning System for Ambient Assisted LivingAn Early Warning System for Ambient Assisted Living
An Early Warning System for Ambient Assisted LivingAndrea Monacchi
 
Assisting Energy Management in Smart Buildings and Microgrids
Assisting Energy Management in Smart Buildings and MicrogridsAssisting Energy Management in Smart Buildings and Microgrids
Assisting Energy Management in Smart Buildings and MicrogridsAndrea Monacchi
 
Analytics as value added service for energy utilities
Analytics as value added service for energy utilitiesAnalytics as value added service for energy utilities
Analytics as value added service for energy utilitiesAndrea Monacchi
 
HEMS: A Home Energy Market Simulator
HEMS: A Home Energy Market SimulatorHEMS: A Home Energy Market Simulator
HEMS: A Home Energy Market SimulatorAndrea Monacchi
 
GREEND: An energy consumption dataset of households in Austria and Italy
GREEND: An energy consumption dataset of households in Austria and ItalyGREEND: An energy consumption dataset of households in Austria and Italy
GREEND: An energy consumption dataset of households in Austria and ItalyAndrea Monacchi
 

Más de Andrea Monacchi (9)

Coordination in distributed systems
Coordination in distributed systemsCoordination in distributed systems
Coordination in distributed systems
 
Introduction to istio
Introduction to istioIntroduction to istio
Introduction to istio
 
Anomaly detection on wind turbine data
Anomaly detection on wind turbine dataAnomaly detection on wind turbine data
Anomaly detection on wind turbine data
 
Welcome to Load Disaggregation and Building Energy Management
Welcome to Load Disaggregation and Building Energy ManagementWelcome to Load Disaggregation and Building Energy Management
Welcome to Load Disaggregation and Building Energy Management
 
An Early Warning System for Ambient Assisted Living
An Early Warning System for Ambient Assisted LivingAn Early Warning System for Ambient Assisted Living
An Early Warning System for Ambient Assisted Living
 
Assisting Energy Management in Smart Buildings and Microgrids
Assisting Energy Management in Smart Buildings and MicrogridsAssisting Energy Management in Smart Buildings and Microgrids
Assisting Energy Management in Smart Buildings and Microgrids
 
Analytics as value added service for energy utilities
Analytics as value added service for energy utilitiesAnalytics as value added service for energy utilities
Analytics as value added service for energy utilities
 
HEMS: A Home Energy Market Simulator
HEMS: A Home Energy Market SimulatorHEMS: A Home Energy Market Simulator
HEMS: A Home Energy Market Simulator
 
GREEND: An energy consumption dataset of households in Austria and Italy
GREEND: An energy consumption dataset of households in Austria and ItalyGREEND: An energy consumption dataset of households in Austria and Italy
GREEND: An energy consumption dataset of households in Austria and Italy
 

Último

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 

Último (20)

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 

Towards Data Operations

  • 1. Towards Data Operations Dr. Andrea Monacchi Streaming data, models and code as first-class citizens
  • 2. Riservato Nome personalizzato dell'azienda Versione 1.0 1. About Myself 2. Big Data Architectures 3. Stream-based computing 4. Integrating the Data Science workflow Summary
  • 3. About ● BS Computer Science (2010) ● MS Computer Science (2012) ● PhD Information Technology (2016) ● Consultancy experience @ Reply DE (2016-2018) ● Independent Consultant
  • 5. Big Data: why are we here? Loads: ● Transactional (OLTP) ○ all operations ○ ACID properties: atomicity, consistency, isolation, and durability ● Analytical (OLAP) ○ append-heavy loads ○ aggregations and explorative queries (analytics) ○ hierarchical indexing (OLAP hyper-cubes) Scalability: ● ACID properties costly ● CAP Theorem ○ impossible to simultaneously distributed data load and fulfill 3 properties: consistency, availability, partition tolerance (tolerance to communication errors) ○ CA are classic RDBMS - vertical scaling only ○ CP (e.g. quorum-based) and AP are NoSQL DBs ○ e.g. Dynamo DB (eventual consistency, AP) ● NoSQL databases ○ relax ACID properties to achieve horizontal scaling
  • 6. Scalability ● Key/Value Storage (DHT) ○ decentralised, scalable, fault-tolerant ○ easy to partition and distribute data by key ■ e.g. p = hash(k) % num_partitions ○ replication (partition redundancy) ■ 2: error detection ■ 3: error recovery ● Parallel collections ○ e.g. Spark RDD based on underlying HDFS blocks ○ master-slave (or driver-worker) coordination ○ no shared variables (accumulators, broadcast vars) Parallel computation: 1. Split dataset into multiple partitions/shards 2. Independently process each partition 3. Combine partitions into result ● MapReduce (Functional Programming) ● Split-apply-combine ● Google’s MapReduce
  • 7. ● Cluster Manager ○ Yarn, Mesos, K8s ○ resource isolation and scheduling ○ security, fault-tolerance, monitoring ● Data Serialization formats ○ Text: CSV, JSON, XML,.. ○ Binary: SeqFile, Avro, Parquet, ORC, .. ● Batch Processing ○ Hadoop MapReduce variants ■ Pig, Hive ○ Apache Spark ○ Python Dask ● Workflow management tools ○ Fault-tolerant task coordination ○ Oozie, Airflow, Argo (K8s) Architectures for Data Analytics ● Stages ○ Ingestion (with retention) ○ (re)-processing ○ Presentation/Serving (indexed data, OLAP) ● Lambda Vs. Kappa architecture ○ batch for correctness, streaming for speed ○ mix of technologies (CAP theorem) ○ complexity & operational costs
  • 9. 1st phase - Ingestion: MQTT ● Pub/Sub ○ Totally uncoupled clients ○ messages can be explicitly retained (flag) ● QoS ○ multilevel (0: at most once, 1: at least once, 2: exactly once) ● Persistent sessions ○ broker keeps further info of clients to speed up reconnections ○ queuing of messages in QoS 1 and 2 (disconnections) ○ queuing of all acks for QoS 2 ● Last will and Testament messages (LWT) ○ kept until proper client disconnect is sent to the broker
  • 10. ● fault-tolerant pub/sub system with message retention ● exactly once semantics (from v0.11.0) using transactional.id flag (and acks) ● topic as an append-only file log ○ ordering by arrival time (offset) ○ stores changes on the data source (deltas) ○ topic/log consists of multiple partitions ■ partitioning instructed by producer ■ guaranteed message ordering within partition ■ based on message key ● hash(k) % num_partitions ● if k == null, then round robin is used ■ distribution and replication by partition ● 1 elected active (leader) and n replicas ● Zookeeper 1st phase - Ingestion: Kafka P Broker2 Broker1 Partition0 Partition1 Partition2 Partition3 Broker3 Partition4 Partition5
  • 11. ● topic maps to local directory ○ 1 file created per each topic partition ○ log rolling: upon size or time limit partition file is rolled and new one is created ■ log.roll.ms or log.roll.hours = 168 ○ log retention: ■ save rolled log to segment files ■ older segments deleted log.retention.hours or log.retention.bytes ○ log compaction: ■ deletion of older records by key (latest value) ■ log.cleanup.policy=compact on topic ● number of topic partitions (howto) ○ more data -> more partitions ○ proportional to desired throughput ○ overhead for open file handles and TCP connections ■ n partitions, each with m replicas, n*m ■ total partitions on a broker < 100 * num_brokers * replication_factor, (divide for all topics) 1st phase - Ingestion: Kafka
  • 12. ● APIs ○ consumer / producer (single thread) ○ Kafka connect ○ KSQL ● Producer (Stateless wrt Broker) ○ ProducerRecord: (topic : String, partition : int, timestamp : long, key : K, value : V) ○ partition id and time are optional (for manual setup) ● Kafka Connect ○ API for connectors (Source Vs Sink) ○ automatic offset management (commit and restore) ○ at least once delivery ○ exactly once only for certain connectors ○ standalone Vs. distributed modes ○ connectors configurable via REST interface 1st phase - Ingestion: Kafka ● Consumer ○ stateful (own offset wrt topic’s) ○ earliest Vs. latest recovery ○ offset committing: manual, periodical (default 5 secs) ○ consumers can be organized in load-balanced groups (per partition) ○ ideally: consumer’s threads = topic partitions
  • 13. 2nd phase - Stream processing ● streams ○ bounded - can be ingested and batch processed ○ unbounded - processed per event ● stream partitions ○ unit of parallelism ● stream and table duality ○ log changes <-> table ○ Declarative SQL APIs (e.g. KSQL, TableAPI) ● time ○ event time, ingestion time, processing time ○ late message handling (pre-buffering) ● windowing ○ bounded aggregations ○ time or data driven (e.g. count) windows ○ tumbling, sliding and session windows ● stateful operations/transformations ○ state: intermediate result, in-memory key-value store used across windows or microbatches ○ e.g. RocksDB (Flink+KafkaStreams), LSM-tree ○ e.g. changeLogging back to Kafka topic (KafkaStreams) ● checkpointing ○ save app status for failure recovery (stream replay) ● frameworks ○ Spark Streaming (u-batching), Kafka Streams and Flink ○ Apache Storm, Apache Samza
  • 14. Code Examples Flink ● processing webserver logs for frauds ● count downloaded assets per user ● https://github.com/pilillo/flink-quickstart ● Deploy to Kubernetes SparkStreaming & Kafka ● https://github.com/pilillo/sparkstreaming-quickstart Kafka Streams ● project skeleton
  • 16. Data Science workflow Technical gaps potentially resulting from this process! ● Data Analytics projects ○ stream of research questions and feedback ● Data forked for exploration ○ Data versioning ○ Periodic data quality assessment ● Team misalignment ○ scientists working aside dev team ○ information gaps w/ team & stakeholders ○ unexpected behaviors upon changes ● Results may not be reproducible ● Releases are not frequent ○ Value misalignment (waste of resources) ● CICD only used for data preparation
  • 17. Data Operations (DataOps) ● DevOps approaches ○ lean product development (continuous feedback and value delivery) using CICD approaches ○ cross-functional teams with mix of development & operations skillset ● DataOps ○ devops for streaming data and analytics as a manufacturing process - Manifesto, Cookbook ○ mix of data engineering and data science skill set ○ focus: continuous data quality assessment, model reproducibility, incremental/progressive delivery of value
  • 18. Data Operations workflow Data Science workflow
  • 21. ● Continuous Data Quality Assessment ○ data versioning (e.g. apache pachyderm) ○ syntax (e.g. confluent schema registry) ○ semantic (e.g. apache griffin) - accuracy, completeness, timeliness, uniqueness, validity, consistency ● Model Tuning ○ hyperparameters tuning - black-box optimization ○ autoML - model selection ○ continuous performance evaluation (wrt newer input data) ○ stakeholder/user performance (e.g. AB testing) ● Model Deployment ○ TF-serve, Seldon ● ML workflow management ○ Amazon Sagemaker, Google ML Engine, MLflow, Kubeflow, Polyaxon CICD for ML code See also: https://github.com/EthicalML/awesome-machine-learning-operations
  • 22. Data-Mill project ● Based on Kubernetes ○ open and scalable ○ seamless integration of bare-metal and cloud-provided clusters ● Enforcing DataOps principles ○ continuous asset monitoring (code, data, models) ○ open-source tools to reproduce and serve models ● Flavour-based organization of components ○ flavour = cluster_spec + SW_components ● Built-in exploration environments (dashboarding tools, jupyter notebooks with DS libraries) https://data-mill-cloud.github.io/data-mill/