Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Apache Beam (incubating)

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 74 Anuncio

Apache Beam (incubating)

Descargar para leer sin conexión

Presenter: Kenn Knowles, Software Engineer, Google & Apache Beam (incubating) PPMC member

Apache Beam (incubating) is a programming model and library for unified batch & streaming big data processing. This talk will cover the Beam programming model broadly, including its origin story and vision for the future. We will dig into how Beam separates concerns for authors of streaming data processing pipelines, isolating what you want to compute from where your data is distributed in time and when you want to produce output. Time permitting, we might dive deeper into what goes into building a Beam runner, for example atop Apache Apex.

Presenter: Kenn Knowles, Software Engineer, Google & Apache Beam (incubating) PPMC member

Apache Beam (incubating) is a programming model and library for unified batch & streaming big data processing. This talk will cover the Beam programming model broadly, including its origin story and vision for the future. We will dig into how Beam separates concerns for authors of streaming data processing pipelines, isolating what you want to compute from where your data is distributed in time and when you want to produce output. Time permitting, we might dive deeper into what goes into building a Beam runner, for example atop Apache Apex.

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

A los espectadores también les gustó (20)

Anuncio

Similares a Apache Beam (incubating) (20)

Más de Apache Apex (20)

Anuncio

Más reciente (20)

Apache Beam (incubating)

  1. 1. Apache Beam (incubating) Kenneth Knowles klk@google.com @KennKnowles Apache Apex Meetup, 2016-06-27 https://goo.gl/LTLjKt
  2. 2. Motivation Beam Model Beam Project / Technical Vision Agenda 1 2 3 2
  3. 3. 3 Motivation1
  4. 4. https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg 4
  5. 5. 5 Unbounded, delayed, out of order 9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00 5 8:00 8:008:00
  6. 6. Incoming! Score per user? 6
  7. 7. Organizing the stream 7 8:00 8:00 8:00
  8. 8. Completeness Latency Cost $$$ Data Processing Tradeoffs 8
  9. 9. What is important for your application? Completeness Low Latency Low Cost Important Not Important $$$ 9
  10. 10. Monthly Billing Completeness Low Latency Low Cost Important Not Important $$$ 10
  11. 11. Billing estimate Completeness Low Latency Low Cost Important Not Important $$$ 11
  12. 12. Abuse Detection Completeness Low Latency Low Cost Important Not Important $$$ 12
  13. 13. 13 The Beam Model2
  14. 14. The Beam Model Pipeline 14 PTransform PCollection
  15. 15. The Beam Vision (for users) Sum Per Key 15 input.apply( Sum.integersPerKey()) Java input | Sum.PerKey() Python Apache Flink Apache Spark Cloud Dataflow ⋮ ⋮ Apache Apex Apache Gearpump (incubating)
  16. 16. Pipeline p = Pipeline.create(options); p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*")) .apply(FlatMapElements.via(line → Arrays.asList(line.split("[^a-zA-Z']+")))) .apply(Filter.byPredicate(word → !word.isEmpty())) .apply(Count.perElement()) .apply(MapElements.via(count → count.getKey() + ": " + count.getValue()) .apply(TextIO.Write.to("gs://...")); p.run(); What your (Java) Code Looks Like 16
  17. 17. The Beam Model: Asking the Right Questions What are you computing? Where in event time? When in processing time are results produced? How do refinements relate? 17
  18. 18. The Beam Model: Asking the Right Questions What are you computing? Where in event time? When in processing time are results produced? How do refinements relate? 18 Aggregations, transformations, ...
  19. 19. The Beam Model: What are you computing? Sum Per User 19
  20. 20. The Beam Model: What are you computing? Sum Per Key 20 input.apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...)); Java input | Sum.PerKey() | Write(BigQuerySink(...)) Python http://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html
  21. 21. The Beam Model: Asking the Right Questions What are you computing? Where in event time? When in processing time are results produced? How do refinements relate? 21 Event time windowing
  22. 22. 22 The Beam Model: Where in Event Time? 8:00 8:00 8:00
  23. 23. Processing Time vs Event Time Event Time = Processing Time ?? 23
  24. 24. Processing Time vs Event Time 24 ProcessingTime
  25. 25. ProcessingTime Processing Time vs Event Time Realtime 25 This is not possible
  26. 26. Processing Time vs Event Time 26 Processing Delay ProcessingTime
  27. 27. Processing Time vs Event Time Very delayed 27 ProcessingTime Event Time
  28. 28. Processing Time windows (probably are not what you want) ProcessingTime Event Time 28
  29. 29. Event Time Windows 29 ProcessingTime Event Time
  30. 30. ProcessingTime Event Time Event Time Windows 30 (implementing processing time windows) Just throw away your data's timestamps and replace them with "now()"
  31. 31. input | WindowInto(FixedWindows(3600) | Sum.PerKey() | Write(BigQuerySink(...)) Python The Beam Model: Where in Event Time? Sum Per Key Window Into 31 input.apply( Window.into( FixedWindows.of( Duration.standardHours(1))) .apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...)) Java
  32. 32. So that's what and where... 32
  33. 33. The Beam Model: Asking the Right Questions What are you computing? Where in event time? When in processing time are results produced? How do refinements relate? 33 Watermarks & Triggers
  34. 34. Event time windows ProcessingTime 34 Event Time
  35. 35. Fixed cutoff (we can do better) ProcessingTime Event Time 35 Allowed delay Concurrent windows
  36. 36. Perfect watermark ProcessingTime 36 Event Time Check out Slava's slides from Strata London 2016 talk on watermarks: https://goo.gl/K4FnqQ
  37. 37. Heuristic Watermark ProcessingTime 37 Event Time
  38. 38. Heuristic Watermark ProcessingTime 38 Current processing time Event Time
  39. 39. Heuristic Watermark ProcessingTime 39 Current processing time Event Time
  40. 40. Heuristic Watermark ProcessingTime 40 Current processing time Late data Event Time
  41. 41. Watermarks measure completeness 41 $$$ $$$ $$ ? Running Total ✔ Monthly billing ? Abuse Detection
  42. 42. The Beam Model: When in Processing Time? Sum Per Key Window Into 42 input .apply(Window.into(FixedWindows.of(...)) .triggering( AfterWatermark.pastEndOfWindow())) .apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...)) Java input | WindowInto(FixedWindows(3600), trigger=AfterWatermark()) | Sum.PerKey() | Write(BigQuerySink(...)) Python Trigger after end of window
  43. 43. ProcessingTime Event Time AfterWatermark.pastEndOfWindow() 43
  44. 44. Current processing time ProcessingTime Event Time 44 AfterWatermark.pastEndOfWindow()
  45. 45. ProcessingTime Event Time Late data 45 Current processing time AfterWatermark.pastEndOfWindow()
  46. 46. ProcessingTime Event Time 46 High completeness Potentially high latency Low cost AfterWatermark.pastEndOfWindow() $$$
  47. 47. ProcessingTime Event Time Repeatedly.forever( AfterPane.elementCountAtLeast(2)) 47
  48. 48. ProcessingTime Event Time 48 Current processing time Repeatedly.forever( AfterPane.elementCountAtLeast(2))
  49. 49. Current processing time ProcessingTime Event Time 49 Repeatedly.forever( AfterPane.elementCountAtLeast(2))
  50. 50. ProcessingTime Event Time 50 Current processing time Repeatedly.forever( AfterPane.elementCountAtLeast(2))
  51. 51. Current processing time ProcessingTime Event Time 51 Repeatedly.forever( AfterPane.elementCountAtLeast(2))
  52. 52. ProcessingTime Event Time 52 Repeatedly.forever( AfterPane.elementCountAtLeast(2)) Low completeness Low latency Cost driven by input$$$
  53. 53. Build a finely tuned trigger for your use case AfterWatermark.pastEndOfWindow() .withEarlyFirings( AfterProcessingTime .pastFirstElementInPane() .plusDuration(Duration.standardMinutes(1)) .withLateFirings(AfterPane.elementCountAtLeast(1)) 53 Bill at end of month Near real-time estimates Immediate corrections
  54. 54. ProcessingTime Event Time 54 .withEarlyFirings(after 1 minute) .withLateFirings(ASAP after each element)
  55. 55. ProcessingTime Event Time 55 Current processing time .withEarlyFirings(after 1 minute) .withLateFirings(ASAP after each element)
  56. 56. ProcessingTime Event Time 56 Current processing time Low completeness Low latency Low cost, driven by time$$$ .withEarlyFirings(after 1 minute) .withLateFirings(ASAP after each element)
  57. 57. Current processing time ProcessingTime Event Time 57 .withEarlyFirings(after 1 minute) .withLateFirings(ASAP after each element)
  58. 58. Current processing time ProcessingTime Event Time Late output 58 .withEarlyFirings(after 1 minute) .withLateFirings(ASAP after each element)
  59. 59. ProcessingTime Event Time Late output 59 .withEarlyFirings(after 1 minute) .withLateFirings(ASAP after each element)
  60. 60. Trigger Catalogue Composite TriggersBasic Triggers 60 AfterEndOfWindow() AfterCount(n) AfterProcessingTimeDelay(Δ) AfterEndOfWindow() .withEarlyFirings(A) .withLateFirings(B) AfterAny(A, B) AfterAll(A, B) Repeat(A) Sequence(A, B)
  61. 61. The Beam Model: Asking the Right Questions What are you computing? Where in event time? When in processing time are results produced? How do refinements relate? 61 Accumulation Mode
  62. 62. The Beam Model: How do refinements relate? 62 input .apply(Window.into(...).triggering(...).discardingFiredPanes()) .apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...)) vs 1 3 7 4 10 5 1 3 7 4 10 15 discarding accumulating
  63. 63. The Beam Model: Asking the Right Questions What are you computing? Where in event time? When in processing time are results produced? How do refinements relate? 63
  64. 64. 64 Beam Project / Technical Vision3
  65. 65. 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam concepts available in new languages. 3. Runner writers: who have a distributed processing environment and want to run Beam pipelines Beam Fn API: Invoke user-definable functions Apache Flink Apache Spark Beam Runner API: Build and submit a piepline Other LanguagesBeam Java Beam Python Execution Execution Cloud Dataflow Execution The Beam Vision Apache Apex Apache Gearpump (incubating)
  66. 66. Project Setup (vision meets code) GoogleCloudPlatform/DataflowJavaSDK cloudera/spark-dataflow dataArtisans/flink-dataflow apache/incubator-beam Direct (on your laptop) Google Cloud Dataflow Flink Spark In pull request: Apex, Gearpump Integration tests Runners Examples I/O Connectors sharing HDFS Kafka BigQuery Google Cloud Storage, Pubsub, Bigtable, Datastore In pull request: JMS, Cassandra Proposed: Sqoop, Parquet, JDBC, SocketStream, ... SDKs
  67. 67. Committers from Google, Data Artisans, Cloudera, Talend, Paypal ● ~40 commits/week ● Rigorous code review for every commit Contributors [with GitHub badges] from: Spotify, Intel, Twitter, Capital One, DataTorrent, …, <your name here> ● Improvements to existing I/O connectors ● Improvements to Spark runner ● Utility classes for users ● Documentation fixes ● Bug diagnoses ● New I/O connectors ● Gearpump runner PoC ● Apex runner PoC! … and it has been awesome apache/incubator-beam
  68. 68. Java SDK: Transition from Dataflow Dataflow Java 1.x Apache Beam Java 0.x Apache Beam Java 2.x Bug Fix Feature Breaking Change We are here Feb 2016 Late 2016
  69. 69. Understanding: Capability Matrix http://beam.incubator.apache.org/capability-matrix/
  70. 70. Why Apache Beam? Unified - One model handles batch and streaming use cases. Portable - Pipelines can be executed on multiple execution environments, avoiding lock-in. Extensible - Supports user and community driven SDKs, Runners, transformation libraries, and IO connectors.
  71. 71. Why Apache Beam? http://data-artisans.com/why-apache-beam/ "We firmly believe that the Beam model is the correct programming model for streaming and batch data processing." - Kostas Tzoumas (Data Artisans) https://cloud.google.com/blog/big- data/2016/05/why-apache-beam-a-google- perspective "We hope it will lead to a healthy ecosystem of sophisticated runners that compete by making users happy, not [via] API lock in." - Tyler Akidau (Google)
  72. 72. 72 Creating an Apache Beam Community Collaborate - Beam is becoming a community-driven effort with participation from many organizations and contributors. Grow - We want to grow the Beam ecosystem and community with active, open involvement so Beam is a part of the larger OSS ecosystem. We love contributions. Join us!
  73. 73. Apache Beam http://beam.incubator.apache.org/ Why Apache Beam? (from Data Artisans) Why Apache Beam? (from Google) Programming Model Overviews Streaming 101 Streaming 102 The Dataflow Beam Model Join the community! User discussions - user-subscribe@beam.incubator.apache.org Development discussions - dev-subscribe@beam.incubator.apache.org Follow @ApacheBeam on Twitter Learn More! 73
  74. 74. END 74

×