Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Cloud Native Data
Pipelines
1
Sid Anand (@r39132)
Cloud Data Next 2017
About Me
2
Work [ed | s] @
Committer &
PPMC on
Father of 2
Co-Chair for
Apache Airflow
Agari
3
What We Do!
Agari : What We Do
4
5
Agari : What We Do
6
Agari : What We Do
7
Agari : What We Do
8
Agari : What We Do
9
Enterprise
Customers
email
metadata
apply
trust
models
email md +
trust score
Agari’s Previous EP Version
Agari : What W...
Quarantine,
Label,
PassThrough
10
email
metadata
apply
trust
models
email md +
trust score
Agari’s Current EP VersionEnter...
Motivation
Cloud Native Data Pipelines
11
Cloud Native Data Pipelines
12
Big Data Companies like LinkedIn, Facebook, Twitter, & Google
have large teams to manage th...
Cloud Native Data Pipelines
13
Cloud Native
Techniques

Open Source
Technogies
Data Pipelines seen
in Big Data companies

~
Design Goals
Desirable Qualities of a Resilient Data Pipeline
14
15
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness Cost
16
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness Cost
• Data Integrity (no loss, etc…...
Quickly Recoverable
17
• Bugs happen!
• Bugs in Predictive Data Pipelines have a large blast radius
• Optimize for MTTR
Predictive Analytics @ Agari
Use Cases
18
Use Cases
19
Apply trust models
(message scoring)
batch + near real
time
Build trust models
batch
(Enterprise Protect)
Foc...
Use-Case : Message
Scoring (batch)
Batch Pipeline Architecture
20
Use-Case : Message Scoring
21
enterprise A
enterprise B
enterprise C
S3
S3 uploads an Avro file
every 15 minutes
Use-Case : Message Scoring
22
enterprise A
enterprise B
enterprise C
S3
Airflow kicks of a Spark
message scoring job
every ...
Use-Case : Message Scoring
23
enterprise A
enterprise B
enterprise C
S3
Spark job writes scored
messages and stats to
anot...
Use-Case : Message Scoring
24
enterprise A
enterprise B
enterprise C
S3
This triggers SNS/SQS
messages events
S3
SNS
SQS
Use-Case : Message Scoring
25
enterprise A
enterprise B
enterprise C
S3
An Autoscale Group
(ASG) of Importers spins
up whe...
26
enterprise A
enterprise B
enterprise C
S3
The importers rapidly ingest scored
messages and aggregate statistics into
th...
27
enterprise A
enterprise B
enterprise C
S3
Users receive alerts of
untrusted emails &
can review them in
the web app
S3
...
28
enterprise A
enterprise B
enterprise C
S3 S3
SNS
SQS
Importers
ASG
DB
Airflow manages the entire process
Use-Case : Mess...
29
Architectural Components
Component Role Uses Salient Features Operability Model
Data Lake
• All data stored in S3
• All...
Tackling Cost & Timeliness
Leveraging the AWS Cloud
30
Tackling Cost
31
Between Daily Runs During Daily Runs
When running daily, for 23 hours of a day, we didn’t
pay for instanc...
Tackling Cost
32
Between Hourly Runs During Hourly Runs
When running daily, for 23 hours of a day, we didn’t pay for
insta...
Tackling Timeliness
Auto Scaling Group (ASG)
33
ASG - Overview
34
What is it?
A means to automatically scale out/in clusters to handle
variable load/traffic
A means to kee...
ASG - Data Pipeline
35
importer
importer
importer
importer
Importer
ASG
scaleout/in
SQS
DB
36
Sent
CPU
ACKd/Recvd
CPU-based auto-scaling is
good at scaling in/out to
keep the average CPU
constant
ASG : CPU-based
ASG : CPU-based
37
Sent
CPU
Recv
Premature
Scale-in
Premature Scale-in:
• The CPU drops to noise-levels before all message...
38
Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)
Scale-in: When Invisible Messages = 0 (a.k.a. when t...
Auto Scaling Groups
Build & Deploy
39
ASG - Build & Deploy
40
Component Role Details
Spins up Cloud Resources
• Spins up SQS, Kinesis, EC2, ASG,
ELB, etc.. and ...
ASG - Build & Deploy
41
EC2 Step 1 : Packer spins up a temporary
EC2 node - a blank canvas!
EC2
ASG - Build & Deploy
42
EC2 Step 1 : Packer spins up a temporary
EC2 node - a blank canvas!
Step 2 : Packer runs an An...
EC2
ASG - Build & Deploy
43
EC2
Step 2 : Packer runs an Ansible role against the
EC2 node to set it up.
Step 3 : Snapshots...
EC2
ASG - Build & Deploy
44
EC2
Step 2 : Packer runs an Ansible role against the
EC2 node to set it up.
Step 3 : Snapshots...
EC2
ASG - Build & Deploy
45
EC2
Step 2 : Packer runs an Ansible role against the
EC2 node to set it up.
Step 3 : Snapshots...
46
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness Cost
• ASG
• EMR Spark
Daily
• ASG
•...
Tackling Operability &
Correctness
Leveraging Tooling
47
48
A simple way to author, configure, manage workflows
Provides visual insight into the state & performance of workflow
runs
...
Apache Airflow
Workflow Automation & Scheduling
49
50
Airflow: Author DAGs in Python! No need to bundle many config files!
Apache Airflow - Authoring DAGs
51
Airflow: Visualizing a DAG
Apache Airflow - Authoring DAGs
Apache Airflow - Perf. Insights
52
Airflow: Gantt chart view reveals the slowest tasks for a run!
53
Apache Airflow - Perf. Insights
Airflow: Task Duration chart view show task completion time trends!
54
Airflow: …And easy to integrate with Ops tools!
Apache Airflow - Alerting
55
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness Cost
Use-Case : Message
Scoring (near-real time)
NRT Pipeline Architecture
56
Use-Case : Message Scoring
57
enterprise A
enterprise B
enterprise C
Kinesis batch put every
second
K
Use-Case : Message Scoring
58
enterprise A
enterprise B
enterprise C
K
As ASG of scorers is
scaled up to one process
per c...
Use-Case : Message Scoring
59
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Scorers apply the trust
model a...
Use-Case : Message Scoring
60
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
As ASG of importe...
Use-Case : Message Scoring
61
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages...
Use-Case : Message Scoring
62
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages...
63
Stream Processing Architecture
Component Role Details Pros Operability Model
Data Lake
• All data stored in S3 via
Kine...
Innovations
NRT Pipeline Architecture
64
Apache Avro
What is Avro?
65
66
What is Avro?
Avro is a self-describing serialization format that supports
primitive data types : int, long, boolean, fl...
67
What is Avro?
Avro is a self-describing serialization format that supports
primitive data types : int, long, boolean, fl...
Apache Avro
Why is it useful?
68
69
Why is Avro Useful?
Agari is an IoT company!
Agari Sensors, deployed at customer sites, stream data to Agari’s
Cloud SA...
70
Why is Avro Useful?
enterprise A :
enterprise B :
enterprise C : Kinesis
v1
v2
v3
Agari is an IoT company!
Agari Sensor...
71
Why is Avro Useful?
enterprise A :
enterprise B :
enterprise C : Kinesis
v1
v2
v3
Agari is an IoT company!
Agari Sensor...
72
Why is Avro Useful?
enterprise A :
enterprise B :
enterprise C : Kinesis
v1
v2
v3
Agari SAAS
in AWS
v4
Agari is an IoT ...
73
Why is Avro Useful?
enterprise A :
enterprise B :
enterprise C :
v1
v2
v3
Avro allows Agari to seamlessly handle differ...
74
Why is Avro Useful?
Agari SAAS in AWS
S1 S2 S3
s3 Spark
Avro Everywhere!
Avro is so useful, we don’t just to communicat...
75
Why is Avro Useful?
Agari SAAS in AWS
S1 S2 S3
s3 Spark
Avro Everywhere!
Good Language Bindings :
Data Pipelines servic...
Apache Avro
By Example
76
77
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favor...
78
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favor...
79
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favor...
80
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favor...
Apache Avro
Schema Registry
81
82
Schema
Registry
(Lambda)
Avro Schema Registry
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"nam...
83
Schema
Registry
(Lambda)
register_schema returns a UUID
Message
Producer (P)
Avro Schema Registry
84
Schema
Registry
(Lambda)
Message Producer sends UUID +
Message
Producer (P)
Data
Message
Consumer (C)
Avro Schema Regis...
85
Schema
Registry
(Lambda)
Message
Producer (P)
Data
Message
Consumer (C)
getSchemaById (UUID)
Avro Schema Registry
86
Schema
Registry
(Lambda)
Message
Producer (P)
Data
Message
Consumer (C)
getSchemaById (UUID)
{"namespace": "agari",
"ty...
87
Schema
Registry
(Lambda)
Message
Producer (P)
Message
Consumer (C)
getSchemaById (UUID)
{"namespace": "agari",
"type": ...
88
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
a...
89
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
a...
Acknowledgments
90
• Vidur Apparao
• Stephen Cattaneo
• Jon Chase
• Andrew Flury
• William Forrester
• Chris Haag
• Chris ...
Questions?
(@r39132)
91
Próxima SlideShare
Cargando en…5
×

Cloud Native Data Pipelines

469 visualizaciones

Publicado el

Presented at AI NEXTCon Seattle 1/17-20, 2018
http://aisea18.xnextcon.com
join our free online AI group with 50,000+ tech engineers to learn and practice AI technology, including: latest AI news, tech articles/blogs, tech talks, tutorial videos, and hands-on workshop/codelabs, on machine learning, deep learning, data science, etc..

Publicado en: Tecnología
  • Sé el primero en comentar

Cloud Native Data Pipelines

  1. 1. Cloud Native Data Pipelines 1 Sid Anand (@r39132) Cloud Data Next 2017
  2. 2. About Me 2 Work [ed | s] @ Committer & PPMC on Father of 2 Co-Chair for Apache Airflow
  3. 3. Agari 3 What We Do!
  4. 4. Agari : What We Do 4
  5. 5. 5 Agari : What We Do
  6. 6. 6 Agari : What We Do
  7. 7. 7 Agari : What We Do
  8. 8. 8 Agari : What We Do
  9. 9. 9 Enterprise Customers email metadata apply trust models email md + trust score Agari’s Previous EP Version Agari : What We Do Batch
  10. 10. Quarantine, Label, PassThrough 10 email metadata apply trust models email md + trust score Agari’s Current EP VersionEnterprise Customers Agari : What We Do Near-real time
  11. 11. Motivation Cloud Native Data Pipelines 11
  12. 12. Cloud Native Data Pipelines 12 Big Data Companies like LinkedIn, Facebook, Twitter, & Google have large teams to manage their data pipelines (100s of engineers) Most start-ups have small teams (10s of engineers) & run in the public cloud. Can they leverage aspects of the public cloud to build comparable pipelines?
  13. 13. Cloud Native Data Pipelines 13 Cloud Native Techniques Open Source Technogies Data Pipelines seen in Big Data companies ~
  14. 14. Design Goals Desirable Qualities of a Resilient Data Pipeline 14
  15. 15. 15 Desirable Qualities of a Resilient Data Pipeline OperabilityCorrectness Timeliness Cost
  16. 16. 16 Desirable Qualities of a Resilient Data Pipeline OperabilityCorrectness Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions • All output within time-bound SLAs • Minimize Operational Fatigue / Automate Everything • Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs • Quick Recoverability • Pay-as-you-go
  17. 17. Quickly Recoverable 17 • Bugs happen! • Bugs in Predictive Data Pipelines have a large blast radius • Optimize for MTTR
  18. 18. Predictive Analytics @ Agari Use Cases 18
  19. 19. Use Cases 19 Apply trust models (message scoring) batch + near real time Build trust models batch (Enterprise Protect) Focus of this talk
  20. 20. Use-Case : Message Scoring (batch) Batch Pipeline Architecture 20
  21. 21. Use-Case : Message Scoring 21 enterprise A enterprise B enterprise C S3 S3 uploads an Avro file every 15 minutes
  22. 22. Use-Case : Message Scoring 22 enterprise A enterprise B enterprise C S3 Airflow kicks of a Spark message scoring job every hour (EMR)
  23. 23. Use-Case : Message Scoring 23 enterprise A enterprise B enterprise C S3 Spark job writes scored messages and stats to another S3 bucket S3
  24. 24. Use-Case : Message Scoring 24 enterprise A enterprise B enterprise C S3 This triggers SNS/SQS messages events S3 SNS SQS
  25. 25. Use-Case : Message Scoring 25 enterprise A enterprise B enterprise C S3 An Autoscale Group (ASG) of Importers spins up when it detects SQS messages S3 SNS SQS Importers ASG
  26. 26. 26 enterprise A enterprise B enterprise C S3 The importers rapidly ingest scored messages and aggregate statistics into the DB S3 SNS SQS Importers ASG DB Use-Case : Message Scoring
  27. 27. 27 enterprise A enterprise B enterprise C S3 Users receive alerts of untrusted emails & can review them in the web app S3 SNS SQS Importers ASG DB Use-Case : Message Scoring
  28. 28. 28 enterprise A enterprise B enterprise C S3 S3 SNS SQS Importers ASG DB Airflow manages the entire process Use-Case : Message Scoring
  29. 29. 29 Architectural Components Component Role Uses Salient Features Operability Model Data Lake • All data stored in S3 • All processing uses S3 Scalable, Available, Performant Serverless Messaging • Reliable, Transactional, Pub/Sub Scalable, Available, Performant Serverless ASG General Processing • Used for importing, data cleansing, business logic Scalable, Available, Performant Managed Data Science Processing • Aggregation • Model Building • Scoring Nice programming model at the cost of debugging complexity We Operate Workflow Engine • Coordinates all Spark Jobs & complex flows Lightweight, DAGs as Code, Steep learning curve We Operate DB Persistence for WebApp • Holds subset of data needed for Web App Rails + Postgres ‘nuff said We Operate S3 SNS SQS
  30. 30. Tackling Cost & Timeliness Leveraging the AWS Cloud 30
  31. 31. Tackling Cost 31 Between Daily Runs During Daily Runs When running daily, for 23 hours of a day, we didn’t pay for instances in the ASG or EMR
  32. 32. Tackling Cost 32 Between Hourly Runs During Hourly Runs When running daily, for 23 hours of a day, we didn’t pay for instances in the ASG or EMR This does not help when runs are hourly since AWS charges at an hourly rate for EC2 instances!
  33. 33. Tackling Timeliness Auto Scaling Group (ASG) 33
  34. 34. ASG - Overview 34 What is it? A means to automatically scale out/in clusters to handle variable load/traffic A means to keep a cluster/service of a fixed size always up
  35. 35. ASG - Data Pipeline 35 importer importer importer importer Importer ASG scaleout/in SQS DB
  36. 36. 36 Sent CPU ACKd/Recvd CPU-based auto-scaling is good at scaling in/out to keep the average CPU constant ASG : CPU-based
  37. 37. ASG : CPU-based 37 Sent CPU Recv Premature Scale-in Premature Scale-in: • The CPU drops to noise-levels before all messages are consumed • This causes scale in to occur while the last few messages are still being committed
  38. 38. 38 Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0) Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is ACK’d) This causes the ASG to grow This causes the ASG to shrink ASG : Queue-based
  39. 39. Auto Scaling Groups Build & Deploy 39
  40. 40. ASG - Build & Deploy 40 Component Role Details Spins up Cloud Resources • Spins up SQS, Kinesis, EC2, ASG, ELB, etc.. and associate them using Terraform • A better version of Chef & Puppet • Sets up an EC2 instance • Agentless, idempotent, & declarative tool to set up EC2 instances, by installing & configuring packages, and more • Spins up an EC2 instance for the purposes of building an AMI! • Can be used with Ansible & Terraform to bake AMIs & Launch Auto-Scaling Groups
  41. 41. ASG - Build & Deploy 41 EC2 Step 1 : Packer spins up a temporary EC2 node - a blank canvas!
  42. 42. EC2 ASG - Build & Deploy 42 EC2 Step 1 : Packer spins up a temporary EC2 node - a blank canvas! Step 2 : Packer runs an Ansible role against the EC2 node to set it up.
  43. 43. EC2 ASG - Build & Deploy 43 EC2 Step 2 : Packer runs an Ansible role against the EC2 node to set it up. Step 3 : Snapshots the machine & register the AMI.EC2 Step 1 : Packer spins up a temporary EC2 node - a blank canvas!
  44. 44. EC2 ASG - Build & Deploy 44 EC2 Step 2 : Packer runs an Ansible role against the EC2 node to set it up. Step 3 : Snapshots the machine & register the AMI.EC2 Step 4 : Terminates the EC2 instance! Step 1 : Packer spins up a temporary EC2 node - a blank canvas!
  45. 45. EC2 ASG - Build & Deploy 45 EC2 Step 2 : Packer runs an Ansible role against the EC2 node to set it up. Step 3 : Snapshots the machine & register the AMI.EC2 Step 4 : Terminates the EC2 instance! Step 5 : Using the AMI, Terraform spins up an auto-scaled compute cluster (ASG) Step 1 : Packer spins up a temporary EC2 node - a blank canvas! ASG
  46. 46. 46 Desirable Qualities of a Resilient Data Pipeline OperabilityCorrectness Timeliness Cost • ASG • EMR Spark Daily • ASG • EMR Spark Hourly ASG • No Cost Savings
  47. 47. Tackling Operability & Correctness Leveraging Tooling 47
  48. 48. 48 A simple way to author, configure, manage workflows Provides visual insight into the state & performance of workflow runs Integrates with our alerting and monitoring tools Tackling Operability : Requirements
  49. 49. Apache Airflow Workflow Automation & Scheduling 49
  50. 50. 50 Airflow: Author DAGs in Python! No need to bundle many config files! Apache Airflow - Authoring DAGs
  51. 51. 51 Airflow: Visualizing a DAG Apache Airflow - Authoring DAGs
  52. 52. Apache Airflow - Perf. Insights 52 Airflow: Gantt chart view reveals the slowest tasks for a run!
  53. 53. 53 Apache Airflow - Perf. Insights Airflow: Task Duration chart view show task completion time trends!
  54. 54. 54 Airflow: …And easy to integrate with Ops tools! Apache Airflow - Alerting
  55. 55. 55 Desirable Qualities of a Resilient Data Pipeline OperabilityCorrectness Timeliness Cost
  56. 56. Use-Case : Message Scoring (near-real time) NRT Pipeline Architecture 56
  57. 57. Use-Case : Message Scoring 57 enterprise A enterprise B enterprise C Kinesis batch put every second K
  58. 58. Use-Case : Message Scoring 58 enterprise A enterprise B enterprise C K As ASG of scorers is scaled up to one process per core per kinesis shard Scorers ASG
  59. 59. Use-Case : Message Scoring 59 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Scorers apply the trust model and send scored messages downstream
  60. 60. Use-Case : Message Scoring 60 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Importers ASG As ASG of importers is scaled up to rapidly import messages DB
  61. 61. Use-Case : Message Scoring 61 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Importers ASG Imported messages are also consumed by the alerter DB K Alerters ASG
  62. 62. Use-Case : Message Scoring 62 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Importers ASG Imported messages are also consumed by the alerter DB K Alerters ASG Quarantine Email
  63. 63. 63 Stream Processing Architecture Component Role Details Pros Operability Model Data Lake • All data stored in S3 via Kinesis Firehose Scalable, Available, Performant, Serverless Serverless Kinesis Messaging • Streaming transport modeled on Kafka Scalable, Available, Serverless Serverless General Processing • ASG Replacement except for Rails Apps Scalable, Available, Serverless Serverless ASG General Processing • Used for importing, data cleansing, business logic Scalable, Available, Managed Managed Data Science Processing • Model Building We Operate Workflow Engine • Nightly model builds + some classic Ops cron workloads Lightweight, DAGs as Code We Operate DB Persistence for WebApp • Holds smaller subset of data needed for Web App Rails + Postgres ‘nuff said We Operate Persistence for WebApp • Aggregation + Search moved from DB to ES • Model Building queries moved to Elasticache Redis Faster. more accurate for aggregates, frees up headroom for DB (polyglot persistence) Managed S3
  64. 64. Innovations NRT Pipeline Architecture 64
  65. 65. Apache Avro What is Avro? 65
  66. 66. 66 What is Avro? Avro is a self-describing serialization format that supports primitive data types : int, long, boolean, float, string, bytes, etc… complex data types : records, arrays, unions, maps, enums, etc… many language bindings : Java, Scala, Python, Ruby, etc…
  67. 67. 67 What is Avro? Avro is a self-describing serialization format that supports primitive data types : int, long, boolean, float, string, bytes, etc… complex data types : records, arrays, unions, maps, enums, etc… many language bindings : Java, Scala, Python, Ruby, etc… The most common format for storing structured Big Data at rest in HDFS, S3, Google Cloud Storage, etc… Supports Schema Evolution!
  68. 68. Apache Avro Why is it useful? 68
  69. 69. 69 Why is Avro Useful? Agari is an IoT company! Agari Sensors, deployed at customer sites, stream data to Agari’s Cloud SAAS Data is sent via Kinesis! enterprise A enterprise B enterprise C Kinesis Agari SAAS in AWS
  70. 70. 70 Why is Avro Useful? enterprise A : enterprise B : enterprise C : Kinesis v1 v2 v3 Agari is an IoT company! Agari Sensors, deployed at customer sites, stream data to Agari’s Cloud SAAS Data is sent via Kinesis! At any point in time, customers run different versions of the Agari Sensor Agari SAAS in AWS
  71. 71. 71 Why is Avro Useful? enterprise A : enterprise B : enterprise C : Kinesis v1 v2 v3 Agari is an IoT company! Agari Sensors, deployed at customer sites, stream data to Agari’s Cloud SAAS Data is sent via Kinesis! At any point in time, customers run different versions of the Agari Sensor These Sensors might send different format versions of the data! Agari SAAS in AWS
  72. 72. 72 Why is Avro Useful? enterprise A : enterprise B : enterprise C : Kinesis v1 v2 v3 Agari SAAS in AWS v4 Agari is an IoT company! Agari Sensors, deployed at customer sites, stream data to Agari’s Cloud SAAS Data is sent via Kinesis! At any point in time, customers run different versions of the Agari Sensor These Sensors might send different format versions of the data!
  73. 73. 73 Why is Avro Useful? enterprise A : enterprise B : enterprise C : v1 v2 v3 Avro allows Agari to seamlessly handle different IoT data format versions Agari SAAS in AWS Kinesis v4 datum_reader = DatumReader( writers_schema = writers_schema, readers_schema = readers_schema) Requirements: • Schemas are backward-compatible
  74. 74. 74 Why is Avro Useful? Agari SAAS in AWS S1 S2 S3 s3 Spark Avro Everywhere! Avro is so useful, we don’t just to communicate between our Sensors & our SAAS infrastructure We also use it as the common data-interchange format between all services (streaming & batch) within our AWS deployment
  75. 75. 75 Why is Avro Useful? Agari SAAS in AWS S1 S2 S3 s3 Spark Avro Everywhere! Good Language Bindings : Data Pipelines services are written in Java, Ruby, & Python
  76. 76. Apache Avro By Example 76
  77. 77. 77 {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } complex type (record) Schema name : User 3 fields in the record: 1 required, 2 optional Avro Schema Example
  78. 78. 78 {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Data x 1,000,000,000 Avro Schema Data File Example Schema Data 0.0001 % 99.999 % Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data
  79. 79. 79 {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Binary Data block Avro Schema Streaming Example Schema Data 99 % 1 % Data
  80. 80. 80 {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Binary Data block Avro Schema Streaming Example Schema Data 99 % 1 % Data OVERHEAD!!
  81. 81. Apache Avro Schema Registry 81
  82. 82. 82 Schema Registry (Lambda) Avro Schema Registry {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } register_schema Message Producer (P)
  83. 83. 83 Schema Registry (Lambda) register_schema returns a UUID Message Producer (P) Avro Schema Registry
  84. 84. 84 Schema Registry (Lambda) Message Producer sends UUID + Message Producer (P) Data Message Consumer (C) Avro Schema Registry
  85. 85. 85 Schema Registry (Lambda) Message Producer (P) Data Message Consumer (C) getSchemaById (UUID) Avro Schema Registry
  86. 86. 86 Schema Registry (Lambda) Message Producer (P) Data Message Consumer (C) getSchemaById (UUID) {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Avro Schema Registry
  87. 87. 87 Schema Registry (Lambda) Message Producer (P) Message Consumer (C) getSchemaById (UUID) {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Message Consumers • download & cache the schema • then decode the data Avro Schema Registry
  88. 88. 88 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Importers ASG Imported messages are also consumed by the alerter DB K Alerters ASG SR SR SR Avro Schema Registry
  89. 89. 89 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Importers ASG Imported messages are also consumed by the alerter DB K Alerters ASG SR SR SR Avro Schema Registry
  90. 90. Acknowledgments 90 • Vidur Apparao • Stephen Cattaneo • Jon Chase • Andrew Flury • William Forrester • Chris Haag • Chris Buchanan • Neil Chapin • Wil Collins • Don Spencer • Scot Kennedy • Natia Chachkhiani • Patrick Cockwell • Kevin Mandich • Gabriel Ortiz • Jacob Rideout • Josh Yang • Julian Mehnle • Gabriel Poon • Spencer Sun • Nathan Bryant None of this work would be possible without the essential contributions of the team below
  91. 91. Questions? (@r39132) 91

×