Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Streaming process with Kafka Connect and Kafka Streams

601 visualizaciones

Publicado el

http://2017.datacon.tw/agenda/
Apache Kafka 是一個分散式 streaming 儲存系統;擁有極高的吞吐量 (high throughput) 與容錯機制 (fault-tolerant) 在 BigData 領域逐漸廣泛應用。這裡介紹 Kafka 提供的另外二個工具 Kafka Connect 以及 Kafka Streams 的開發架構以及如何運用在 ETL 領域。

Publicado en: Software

Streaming process with Kafka Connect and Kafka Streams

  1. 1. Streaming process with Kafka Connect & Kafka Streams 鄭紹志@亦思科技 vito@is-land.com.tw 2017/09/30
  2. 2. About me ● 鄭紹志 Vito ● 亦思科技, R&D Director ● BigData 相關研究開發工作 ● Enjoy Java / Scala development
  3. 3. Producer Consumer High Level 架構 Kafka (Broker) Kafka (Broker) Kafka Streams Data Source - Database - Filesystem - . . . Data Sink - Database - Filesystem - . . . KafkaConnect KafkaConnect
  4. 4. Kafka Connect
  5. 5. Kafka Connect 使用場景: ETL ● 把 X (Source) 的資料送進 Kafka ○ 儲存系統, ex: FileSystem, RDB, Cassandra, S3, ... ○ 外部應用系統, ex: Twitter, Github ● 把 Kafka 的資料送進 Y (Sink) ○ 儲存系統, ex: FileSystem, RDB, Cassandra, S3, ... ○ Search, ex: Elastic, Solr
  6. 6. Kafka Connect overview ● Apache Kafka 0.9+ ● A common framework for Kafka connectors ● Standalone and distributed mode ● REST interface(distributed mode) ● Automatic offset management ● Distributed and scalable by default ● Lightweight transformations https://kafka.apache.org/documentation/#connect_overview
  7. 7. Source & Sink Kafka Connect connector connector connector Kafka (Broker) Kafka (Broker) Database File ? Database ? connector connector connector Source Sink Elastic
  8. 8. Running Kafka Connect ● Standalone ● Distributed $ bin/connect-standalone.sh config/connect-standalone.properties connector1.properties [connector2.properties]... $ bin/connect-distributed.sh config/connect-distributed.properties
  9. 9. Connector ● Connector 架構可實作客製化需求 ● Apache Kafka ○ FileStreamSourceConnector / FileStreamSinkConnector ● More connectors: https://www.confluent.io/product/connectors/
  10. 10. Worker ● Worker: 一個 Kafka Connect 的執行單位(JVM process) ● 負責執行 connector 以及 task ● Two types: Standalone / Distributed ● Automatically load balance & fail over
  11. 11. Kafka Connect (Worker) Conn-1 Conn-1, Task 1 Conn-1, Task 2 Partition 1 Partition 2 Partition 3 Conn-2 Conn-2, Task 1 Conn-2, Task 2 Conn-2, Task 3 : : : . . . . . . thread JVM process Inside the worker Max task config (per connector): tasks.max
  12. 12. Distributed mode: Worker cluster Worker 1 Conn-1 Conn-1, Task 1 Conn-1, Task 2 Conn-1, Task 3 Worker 1 Conn-1 Conn-1, Task 1 Worker 2 Conn-1, Task 2 Conn-1, Task 3 Conn-1, Task 2 Conn-1, Task 3
  13. 13. Kafka Streams
  14. 14. Overview & Concept
  15. 15. Streaming data ● Overloaded term ○ streaming data / data stream / event stream ... ○ event / message / log ● 常見特徵 ○ Unbounded data(unlimited size) - 沒有範圍 ○ Immutable - 產生後即不再變更 ○ Time ordered - 有時間順序 ○ Replayable - 重覆播放 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
  16. 16. Kafka Streams overview ● Included with the Apache Kafka v0.10+ , May 2016 ○ Not compatible with old Kafka broker ● Just a java library, no dedicated cluster required ● Realtime ● Highly scalable, Fault-tolerant ● Stateful / Stateless transformation
  17. 17. Time ● Event time ● Ingestion time(Log append time) ● Processing time ● Message's timestamp in Kafka ○ 0.10+ , Add timestamps to Kafka message(KIP-32) ○ Depend on configuration ■ Event time → Producer Time → CreateTime ■ Ingestion time → Broker Time → LogAppendTime
  18. 18. State & State stores ● Stateful transformation 需要持續維持某些狀態(state) ● StateStore: ○ For cache: Memory(HashMap) ○ For persist: RocksDB https://stackoverflow.com/a/40114039/3155650
  19. 19. Steam Processing Topology http://kafka.apache.org/0110/documentation/str eams/core-concepts#streams_topology Building a topology: ● High level: DSL ● Low level: Processor API
  20. 20. Cluster!! Local state store 一個 Kafka Streams 應用程式 https://kafka.apache.org/0110/documentation/streams/developer-guide#treams_developer-guide_interactive -queries_your_app
  21. 21. Quick Sample (DSL)
  22. 22. Question: 計算每個州的機場數量 "iata","airport","city","state","country" "L70","Agua Dulce Airpark", "Agua Dulce","CA","USA" "TPA","Tampa International ","Tampa","FL","USA" airportpush Topic 美國各州的機場資料(csv) http://stat-computing.org/dataexpo/2009/
  23. 23. airport Topic Get 'State' value (Parse csv message) Input message from 'airport' groupBy 'State' Count recordsairport-count Topic output message to 'airport-count'
  24. 24. KStreamBuilder builder = new KStreamBuilder(); KStream<String, String> textLines = builder.stream("airport"); KTable<String, Long> airportCounts = textLines.mapValues(textLine->{ String state; try { state = csvParser.parseLine(textLine)[3]; } catch (Exception e) { state = null; } return state; }).groupBy((key, state)-> state) .count("counts"); airportCounts.to(Serdes.String(), Serdes.Long(), "airport-counts");
  25. 25. Demo
  26. 26. airport Topic Get 'State' value (Parse csv message) Input message from 'airport' groupBy 'State' Count recordsairport-count Topic output message to 'airport-count' KStream<String, String> KStream<String, String> KGroupedStream<String, String> KTable<String, Long>
  27. 27. airport Topic transform Create source stream transform tranformairport-count Topic Write stream to Kafka
  28. 28. 計算結果 AS 3 CT 15 VT 13 IN 65 MT 71 : : : : $ bin/kafka-console-consumer.sh --topic airport-counts --from-beginning --property print.key=true --property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer Key Value
  29. 29. Kafka Streams Application Reset ● 重新執行 streaming 計算, 需要狀態重置 ● Local reset ○ call KafkaStreams#cleanUp() ● Global reset ○ ○ Resetting offsets to zero for input topics ○ Delete all internal(auto-created) topics for application ■ {application.id}-xxxx-repartition ■ {application.id}-xxxx-changelog $ bin/kafka-streams-application-reset.sh
  30. 30. Kafka Streams DSL
  31. 31. Kafka Streams DSL overview ● KStream, KTable, GlobalKTable ● Stateless transformation ● Stateful transformation ○ State ○ Aggregation ○ Join ○ Window
  32. 32. KStream vs KTable |jack| Taipei| |vito|Hsinchu| |jack|Hsinchu| stream data (Person, City)
  33. 33. KStream vs KTable |jack| Taipei| |vito|Hsinchu| |jack|Hsinchu| jack 去了 Taipei KStream jack 去了 Taipei, Hsinchu jack 住在 Taipei Hsinchu KTable jack 住在 Taipei stream data (Person, City) time1 time2
  34. 34. KStream, KTable 互相轉換 ● KStream → KStream ● KTable → KTable ● KStream → KTable ● KTable → KStream http://kafka.apache.org/0110/documentation/strea ms/developer-guide#streams_duality
  35. 35. Stateless transformation ● filter() , filterNot() ● map(), mapValues() ● flatMap() , flatMapValues() ● foreach() , peek() Key 轉變時會re-partition !!
  36. 36. Stateful transformation ● Join ● Aggregation ● Window
  37. 37. Join operations https://docs.confluent.io/3.3.0/streams/developer-guide.html#joining ● Key-based ● Require co-partitioning of the input data
  38. 38. Aggregation operations ● Key-based ● count() ● reduce() ● aggregate() ● Two type ○ Latest(rolling) aggregation ○ Windowed aggregation
  39. 39. Window ● 一個時間區段處理 ● Tumbling window ● Hopping window ● Sliding window ● Session window
  40. 40. Tumbling Window Window size: 3 mins Window move: 3 mins (advance interval) | | | | | 0 3 6 9 12 stream.map( /* do something */ ) .groupByKey() .count(TimeWindows.of(5*60*1000L), "store");
  41. 41. Hopping Window | | | | | 0 3 6 9 12 Window size: 3 mins Window move: 2 mins (advance interval) stream.map( /* do something */ ) .groupByKey() .count( TimeWindows.of(5*60*1000L) .advanceBy(60 * 1000L), "store");
  42. 42. ● move on every record ● used only for join operation Silding window
  43. 43. Session window | | | | | 0 3 6 9 12 final Long INACTIVITY_GAP = TimeUnit.MINUTES.toMinutes(6); stream.map( /* do something */ ) .groupByKey() .count(SessionWindows.with(INACTIVITY_GAP), "store");
  44. 44. Parallelism Model https://kafka.apache.org/documentation/streams/architecture ● Partition: Topic partitions / Stream partitions ● 一個 Thread 執行多個 StreamTask ● Partition 數量決定 StreamTask 數量 ● 一個 partition 只會分配給一個 StreamTask 處理 ● 一個 StreamTask 執行一個 Topology ● StreamConfig: num.stream.threads
  45. 45. Parallelism Model https://kafka.apache.org/documentation/streams/architecture
  46. 46. Thank you !

×