Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Building a Streaming Microservices Architecture - Data + AI Summit EU 2020

Jules Damji and Denny Lee from Databricks Developer Relations will recap some keynote highlights, and each will briefly present personal picks from sessions that resonated well with them. Next, Jacek Laskowski, an independent consultant, will speak about Spark 3.0 internals, and Scott Haines from Twilio, Inc. will give a talk about structured streaming microservice architectures. This live coding session and technical deep dive are not to be missed!

  • Inicia sesión para ver los comentarios

  • Sé el primero en recomendar esto

Building a Streaming Microservices Architecture - Data + AI Summit EU 2020

  1. 1. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Building a Streaming Microservices Architecture With Apache Spark Structured Streaming & Friends Scott Haines Senior Principal Software Engineer, Twilio @newfront
  2. 2. © 2019 TWILIO INC. ALL RIGHTS RESERVED. A little about me. • I work at Twilio building massive data systems • I run a bi-weekly internal Spark Office Hours where I offer training and guidance to teams at the company • >12 years working on large distributed analytics systems @newfront
  3. 3. © 2019 TWILIO INC. ALL RIGHTS RESERVED. A little about me. • I work at Twilio building massive data systems • I run a bi-weekly internal Spark Office Hours where I offer training and guidance to teams at the company • >12 years working on large distributed analytics systems • Published work on Distributed Analytics Systems @newfront
  4. 4. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Voice Insights Accountable observability data and interactive analytics and insights for the voice business and customers.
  5. 5. VIRGINIA, USA DUBLIN, IRELAND SINGAPORE SYDNEY, AUSTRALIA TOKYO, JAPAN SAO PAULO, BRAZIL DATA CENT ERS © 2019 TWILIO INC. ALL RIGHTS RESERVED. Events per Second >1MIL
  6. 6. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Let’s Build a Reliable Data Architecture
  7. 7. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Goal: Reliable E2E Streaming Data Pipeline GRPC Client GRPC Server GRPC Server GRPC Server 1 2 3 Kafka Broker 4 Kafka Broker 5 6 Spark Application 7 8 HDFS S39 HTTP /2 @newfront
  8. 8. Strong and Reliable Data starts at Ingest © 2019 TWILIO INC. ALL RIGHTS RESERVED. @newfront
  9. 9. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Of people love JSON*. >95% { "type": "CallEvent", “call_sid”: "CA123", "attributes": [ { “account_sid”: “AC123” “start_ms": 123, “end_ms": "435" } ] } • JSON has Structure. • But JSON isn't strictly Structured Data. Structured Data @newfront @twilio
  10. 10. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Of people saddened by bad data* 100% { "type": "CallEvent", “call_sid": “CA234”, “attributes":"oops" } • JSON has poor runtime guarantees due to its flexible nature. Optimize for compile time guarantees. • Debugging corrupt data in a large distributed system ruins hopes and dreams. Structured Data @newfront @twilio
  11. 11. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Protocol Buffers message CallEvent { uint64 created_ms = 1; string call_sid = 2; uint64 account_sid = 3; EventType event_type = 4; Region region = 5; } • Well Defined Events tell their own Story • Type-Safety Rules • Versioning your API / Pipeline / Data now just means sticking to a version of your schema • Rely on Releases for versioning • Interoperable with most major languages (java/scala/c++/go/obj-c/node-js/ python/...)
  12. 12. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Protocol Buffers message CallEvent { uint64 created_ms = 1; string call_sid = 2; uint64 account_sid = 3; EventType event_type = 4; Region region = 5; } • Data Accountability • Lightning Fast Serialization / Deserialization • Plays with nicely gRPC • Interoperable with Spark SQL • Like “JSON with Guard Rails” Of people like when things work between releases! 100% Take Aways
  13. 13. Data Engineers love gRPC © 2019 TWILIO INC. ALL RIGHTS RESERVED. *technically not universally accepted @newfront
  14. 14. GRPC Client GRPC Server GRPC Server GRPC Server HTTP /2 gRPC | saves time © 2019 TWILIO INC. ALL RIGHTS RESERVED.
  15. 15. GRPC Client GRPC Server GRPC Server GRPC Server HTTP /2 gRPC | saves time © 2019 TWILIO INC. ALL RIGHTS RESERVED. val time = “$$$”
  16. 16. GRPC Client GRPC Server GRPC Server GRPC Server HTTP /2 © 2019 TWILIO INC. ALL RIGHTS RESERVED. GRPC. GRPC // nutshell • RPC = remote procedure call. “G” stands for generic or Google • Great for Internal Services • High Performance • Compact Binary Exchange Format • Compile Idiomatic API Definitions • Capable of Bi-Directional Streaming • Pluggable HTTP/2 transport
  17. 17. © 2019 TWILIO INC. ALL RIGHTS RESERVED. GRPC. • Building a CallEvent Service. • 1: Define your messages (call.proto)
  18. 18. © 2019 TWILIO INC. ALL RIGHTS RESERVED. GRPC. • Building a CallEvent Service. • 1: Define your messages (call.proto) • 2: Define your services (service.proto) @newfront
  19. 19. © 2019 TWILIO INC. ALL RIGHTS RESERVED. GRPC. • Building a CallEvent Service. • 1: Define your messages (call.proto) • 2: Define your services (service.proto) • 3: Compile your messages and service stubs. sbt clean compile package publishLocal
  20. 20. © 2019 TWILIO INC. ALL RIGHTS RESERVED. GRPC. • Building a CallEvent Service. • 1: Define your messages (call.proto) • 2: Define your services (service.proto) • 3: Compile your messages and service stubs. • 4: Implement Traits and Run!
  21. 21. © 2019 TWILIO INC. ALL RIGHTS RESERVED. GRPC. • Building a CallEvent Service. • Client SDKs essentially write themselves. • JSON <-> Protobuf is still possible for maintaining customer facing APIs
  22. 22. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Protocol Gatewaymessage T<:Event { … } protobuf @ version Kafka Broker gRPC Common Pattern Emerges Protocol Streams
  23. 23. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Still Building… GRPC Client GRPC Server GRPC Server GRPC Server 1 2 3 Kafka Broker 4 Kafka Broker 5 6 Spark Application 7 8 HDFS S39 HTTP /2 @newfront
  24. 24. GRPC Client GRPC Server GRPC Server GRPC Server 1 2 3 Kafka Broker 4 Kafka Broker 5 6 Spark Application 7 8 9 HTTP /2 © 2019 TWILIO INC. ALL RIGHTS RESERVED. Kafka + Protobuf + Spark
  25. 25. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Kafka. • Solve for common Data Pipeline Problems • Partition Keys are important. Use them to your advantage • Spark AQE - adaptive query execution can handle hot spots. Non-spark services not- so-much. Topic: CallEvents key: call_sid partitions*: n * number of partitions: factor of producer records/s * avg(record.bytesize)
  26. 26. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Spark. • Use ScalaPB’s ExpressionEncoders to natively convert protobuf to Catalyst Optimizable DataFrames • Marry this with Sparks Kafka DataSource Reader/Writer
  27. 27. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Spark. • Behind the Scenes… • Spark is doing magic
  28. 28. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Spark. • End to End Tests ensure you can press the release button anywhere in the pipeline. • Can be automated to ensure your Spark Apps can continue working with any changes to the upstream gRPC • Can use for Canary testing updates
  29. 29. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Spark. • Use ScalaPB to read local streams • Ensure your Spark Apps can continue working with any new protobuf dependencies • Test simple reads and complex aggregations locally, deploy globally @newfront
  30. 30. GRPC Client GRPC Server GRPC Server GRPC Server 1 Kafka Broker 4 Kafka Broker 5 6 Spark Application 7 8 HDFS S39 TP /2 © 2019 TWILIO INC. ALL RIGHTS RESERVED. Spark & Beyond @newfront
  31. 31. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Spark + Friends. • Proto to Catalyst DataFrame • Conversion to Parquet in Delta Table • Partitioned by Date @newfront
  32. 32. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Spark + Friends. • Now it is easy for Downstream applications to pick up using Parquet Protocol Streams @newfront
  33. 33. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Rinse and Repeat @newfront Just keep adding new Flows. 1. Define your Structured Data 2. Define your service definitions 3. Emit Data / Enqueue to Kafka 4. Read and Drop in HDFS (delta) or pass along to a new Topic Kafka Topic Kafka Topic Spark Application Spark Application Spark Application Kafka Topic Data Table Data Table Spark Application GRPC Server
  34. 34. THANK YOU @newfront

×