Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Samza tech talk_2015 - huawei

292 visualizaciones

Publicado el

Stream presentation

Publicado en: Software
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Samza tech talk_2015 - huawei

  1. 1. Stream Processing @Scale in LinkedIn Yi Pan Data Infrastructure Samza Team @LinkedIn Databus
  2. 2. • What is Stream Processing? • What is Samza? • Samza Programming API • Stream Processing @LinkedIn • Upcoming features Overview
  3. 3. • What’s stream processing – Input: an unbounded sequence of events • E.g. web server logs, user activity tracking events, database changelogs, etc. – Latency: near real-time • From milliseconds to minutes, instead of hours to days – Output: an unbounded sequence of changes to the derived dataset • The derived dataset is usually the final or partial analytic results that can either be in another stream, or a serving data store Stream Processing
  4. 4. Response latency Milliseconds to minutes Synchronous Later. Possibly much later. 0 ms Stream Processing
  5. 5. • What are the application requirements? – Scalable, fast, stateful stream processing – What scale should we operate at? • Traffic Volume: 1.4 Trillion events/day • Intermediate State Size: multi TB / colo (*) – Why is it expensive to run stream processing at scale? • Intermediate data set needs to be stored to allow low latency processing • Large volume of data needs to be pulled and pushed via network Stream Processing
  6. 6. • What is Stream Processing? • What is Samza? • Samza Programming API • Stream Processing @LinkedIn • Upcoming features Overview
  7. 7. • Samza is a distributed Turing machine – Single Task Samza Job is a stateful Turing machine What’s Samza Samza Task Input stream Output stream State changelog checkpoint
  8. 8. – Scaling a Samza job: partition the streams What’s SamzaInputstreamA partition 0 partition 1 partition 2 partition 3 partition n Samza Task State
  9. 9. – Scaling a Samza job: partition the streams What’s SamzaInputstreamB partition 0 partition 1 partition 2 partition 3 partition n Samza Task State
  10. 10. – Scaling a Samza job: replicating the state machine What’s Samza shared checkpoint Job
  11. 11. • Samza Execution in Yarn What’s Samza Host 1 Host 2 Host 3 Application Master Samza container Samza container Samza container Deploy Samza job
  12. 12. • Samza Execution in Yarn What’s Samza Host 1 Host 2 Host 3 Application Master Samza container Samza container Samza container
  13. 13. • Samza Execution in Yarn What’s Samza Host 1 Host 2 Host 3 Application Master Samza container Samza container Samza container
  14. 14. • States in Samza – Checkpoints • Offsets per input stream partitions – State Stores • In-memory or on-disk (RocksDB) derived data set What’s Samza Samza Task Output stream partitions State changelogpartitions checkpoint Host 1
  15. 15. • States in Samza – Checkpoints and local state stores are backed by distributed logs What’s Samza Samza Task Output stream partitions State changelogpartitions checkpoint Host 1
  16. 16. • States in Samza – Checkpoints and local state stores are backed by distributed logs What’s Samza Samza Task Output stream partitions State changelogpartitions checkpoint Host 1
  17. 17. • States in Samza – Checkpoints and local state stores are backed by distributed logs What’s Samza Samza Task Output stream partitions State changelogpartitions checkpoint Host 2
  18. 18. • Multiple Jobs in a Dataflow What’s Samza Stream A Stream B Stream C Stream E Stream F Job 1 Job 2 Stream D Job 3
  19. 19. • What is Stream Processing? • What is Samza? • Samza Programming API • Stream Processing @LinkedIn • Upcoming features Overview
  20. 20. Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } } Samza Programming API
  21. 21. Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } } Samza Programming API
  22. 22. Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } } Samza Programming API
  23. 23. Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } } Samza Programming API
  24. 24. Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } } Samza Programming API
  25. 25. Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } } Samza Programming API
  26. 26. Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } } Samza Programming API
  27. 27. Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } } Samza Programming API
  28. 28. Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } } Samza Programming API
  29. 29. Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } } Samza Programming API
  30. 30. • What is Stream Processing? • What is Samza? • Samza Programming API • Stream Processing @LinkedIn • Upcoming features Overview
  31. 31. Stream Processing @ LinkedIn WebServers WebServers WebServers WebServers WebServers WebServers WebServersMonitor Servers Oracle Espresso Kafka Databus Tracking events Metrics changelog changelog Samza Jobs Samza Jobs Samza Jobs Samza Jobs bootstrap bootstrap Voldemort Derived Data Derived Data
  32. 32. Stream Processing @ LinkedIn • Tracking aggregate/analysis (ACG)
  33. 33. Stream Processing @ LinkedIn • Content standardization w/ adjunct data set Member Profile DB Bootstrap Job Databus Kafka Content Standardization Kafka Kafka
  34. 34. Stream Processing @ LinkedIn • Kafka Deployment – 1.1 Trillion messages / day • Databus Deployment – 300 Billion messages / day • Samza Deployment – multiple colos – 10+ Yarn clusters – 200+ nodes – 100+ Jobs in production
  35. 35. • What is Stream Processing? • What’s Samza • Samza Programming API • Stream Processing @LinkedIn • Upcoming features Overview
  36. 36. • New features – Local state store improvements • RocksDB TTL support • Fast recovery – Dynamic configuration – Easier deployment w/ standalone jobs – High-level query language for faster development Upcoming Features
  37. 37. Contact Us / Get Involved • Open Source –Documentation: samza.apache.org –Mailing list: dev@samza.apache.org –JIRA: https://issues.apache.org/jira/browse/SA MZA

×