Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

ETL Is Dead, Long-live Streams

566 visualizaciones

Publicado el

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2gron5O.

Neha Narkhede talks about the experience at LinkedIn moving from batch-oriented ETL to real-time streams using Apache Kafka and how the design and implementation of Kafka was driven by this goal of acting as a real-time platform for event data. She covers some of the challenges of scaling Kafka to hundreds of billions of events per day at Linkedin, supporting thousands of engineers, etc. Filmed at qconsf.com.

Neha Narkhede is co-founder and CTO at Confluent, a company backing the popular Apache Kafka messaging system. Prior to founding Confluent, Neha led streams infrastructure at LinkedIn, where she was responsible for LinkedIn’s streaming infrastructure built on top of Apache Kafka and Apache Samza. She is one of the initial authors of Apache Kafka and a committer and PMC member on the project.

Publicado en: Tecnología
  • Sé el primero en comentar

ETL Is Dead, Long-live Streams

  1. 1. ETL is dead; long-live streams Neha Narkhede, Co-founder & CTO, Confluent
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ etl-streams
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  4. 4. “Data and data systems have really changed in the past decade
  5. 5. Old world: Two popular locations for data Operational databases Relational data warehouse DB DB DB DB DWH
  6. 6. “Several recent data trends are driving a dramatic change in the ETL architecture
  7. 7. “#1: Single-server databases are replaced by a myriad of distributed data platforms that operate at company-wide scale
  8. 8. “#2: Many more types of data sources beyond transactional data - logs, sensors, metrics...
  9. 9. “#3: Stream data is increasingly ubiquitous; need for faster processing than daily
  10. 10. “The end result? This is what data integration ends up looking like in practice
  11. 11. App App App App search HadoopDWH monitoring security MQ MQ cache cache
  12. 12. A giant mess! App App App App search HadoopDWH monitoring security MQ MQ cache cache
  13. 13. “We will see how transitioning to streams cleans up this mess and works towards...
  14. 14. Streaming platform DWH Hadoop security App App App App search NoSQL monitor ing request-response messaging OR stream processing streaming data pipelines changelogs
  15. 15. A short history of data integration
  16. 16. “Surfaced in the 1990s in retail organizations for analyzing buyer trends
  17. 17. “Extract data from databases Transform into destination warehouse schema Load into a central data warehouse
  18. 18. “BUT … ETL tools have been around for a long time, data coverage in data warehouses is still low! WHY?
  19. 19. Etl has drawbacks
  20. 20. “#1: The need for a global schema
  21. 21. “#2: Data cleansing and curation is manual and fundamentally error-prone
  22. 22. “#3: Operational cost of ETL is high; it is slow; time and resource intensive
  23. 23. “#4: ETL tools were built to narrowly focus on connecting databases and the data warehouse in a batch fashion
  24. 24. “Early take on real-time ETL = Enterprise Application Integration (EAI)
  25. 25. “EAI: A different class of data integration technology for connecting applications in real-time
  26. 26. “EAI employed Enterprise Service Buses and MQs; weren’t scalable
  27. 27. ETL and EAI are outdated!
  28. 28. Old world: scale or timely data, pick one real-time scale batch EAI ETL real-time BUT not scalable scalable BUT batch
  29. 29. “Data integration and ETL in the modern world need a complete revamp
  30. 30. new world: streaming, real-time and scalable real-time scale EAI ETL Streaming Platform real-time BUT not scalable real-time AND scalable scalable BUT batch batch
  31. 31. “Modern streaming world has new set of requirements for data integration
  32. 32. “#1: Ability to process high-volume and high-diversity data
  33. 33. “#2 Real-time from the grounds up; a fundamental transition to event-centric thinking
  34. 34. Event-Centric Thinking Streaming Platform “A product was viewed” HadoopWeb app
  35. 35. Event-Centric Thinking Streaming Platform “A product was viewed” Hadoop Web app mobile app APIs
  36. 36. mobile app web app APIs Streaming Platform Hadoop Security Monitoring Rec engine “A product was viewed” Event-Centric Thinking
  37. 37. “Event-centric thinking, when applied at a company-wide scale, leads to this simplification ...
  38. 38. Streaming platform DWH Hadoop App App App App App App App App request-response messaging OR stream processing streaming data pipelines changelogs
  39. 39. “#3: Enable forward-compatible data architecture; the ability to add more applications that need to process the same data … differently
  40. 40. “To enable forward compatibility, redefine the T in ETL: Clean data in; Clean data out
  41. 41. app logs app logs app logs app logs #1: Extract as unstructured text #2: Transform1 = data cleansing = “what is a product view” #4: Transform2 = drop PII fields” #3: Load into DWH DWH
  42. 42. #1: Extract as unstructured text #2: Transform1 = data cleansing = “what is a product view” #4: Transform2 = drop PII fields” DWH #2: Transform1 = data cleansing = “what is a product view” #4: Transform2 = drop PII fields” Cassandra #1: Extract as unstructured text again #3: Load cleansed data #3: Load cleansed data
  43. 43. #1: Extract as structured product view events #2: Transforms = drop PII fields” #4:.1 Load product view stream #4: Load filtered product View stream DWH Cassandra Streaming Platform #4.2 Load filtered product view stream
  44. 44. “To enable forward compatibility, redefine the T in ETL: Data transformations, not data cleansing!
  45. 45. #1: Extract once as structured product view events #2: Transform once = drop PII fields” and enrich with product metadata #4.1: Load product views stream #4: Load filtered and enriched product views stream DWH Cassandra Streaming Platform #4.2: Load filtered and enriched product views stream
  46. 46. “Forward compatibility = Extract clean-data once; Transform many different ways before Loading into respective destinations … as and when required
  47. 47. “In summary, needs of modern data integration solution? Scale, diversity, latency and forward compatibility
  48. 48. Requirements for a modern streaming data integration solution - Fault tolerance - Parallelism - Latency - Delivery semantics - Operations and monitoring - Schema management
  49. 49. Data integration: platform vs tool Central, reusable infrastructure for many use cases One-off, non-reusable solution for a particular use case
  50. 50. New shiny future of etl: a streaming platform NoSQL RDBMS Hadoop DWH Apps Apps Apps Search Monitoring RT analytics
  51. 51. “Streaming platform serves as the central nervous system for a company’s data in the following ways ...
  52. 52. “#1: Serves as the real-time, scalable messaging bus for applications; no EAI
  53. 53. “#2: Serves as the source-of-truth pipeline for feeding all data processing destinations; Hadoop, DWH, NoSQL systems and more
  54. 54. “#3: Serves as the building block for stateful stream processing microservices
  55. 55. “ Batch data integration Streaming
  56. 56. “ Batch ETL Streaming
  57. 57. a short history of data integration drawbacks of ETL needs and requirements for a streaming platform new, shiny future of ETL: a streaming platform What does a streaming platform look like and how it enables Streaming ETL?
  58. 58. Apache kafka: a distributed streaming platform
  59. 59. 57Confidential Apache kafka 6 years ago 57
  60. 60. 58Confidential > 1,400,000,000,000 messages processed / day 58
  61. 61. Now Adopted at 1000s of companies worldwide
  62. 62. “What role does Kafka play in the new shiny future for data integration?
  63. 63. “#1: Kafka is the de-facto storage of choice for stream data
  64. 64. The log 0 1 2 3 4 5 6 7 next write reader 1 reader 2
  65. 65. The log & pub-sub 0 1 2 3 4 5 6 7 publisher subscriber 1 subscriber 2
  66. 66. “#2: Kafka offers a scalable messaging backbone for application integration
  67. 67. Kafka messaging APIs: scalable eai app Messaging APIs produce(message) consume(message)
  68. 68. “#3: Kafka enables building streaming data pipelines (E & L in ETL)
  69. 69. Kafka’s Connect API: Streaming data ingestion app Messaging APIsMessaging APIs ConnectAPI ConnectAPI app source sink Extract Load
  70. 70. “#4: Kafka is the basis for stream processing and transformations
  71. 71. Kafka’s streams API: stream processing (transforms) Messaging API Streams API apps apps ConnectAPI ConnectAPI source sink Extract Load Transforms
  72. 72. Kafka’s connect API = E and L in Streaming ETL
  73. 73. Connectors! NoSQL RDBMS Hadoop DWH Search Monitoring RT analytics Apps Apps Apps
  74. 74. How to keep data centers in-sync?
  75. 75. Sources and sinks ConnectAPI ConnectAPI source sink Extract Load
  76. 76. changelogs
  77. 77. Transforming changelogs
  78. 78. Kafka’s Connect API = Connectors Made Easy! - Scalability: Leverages Kafka for scalability - Fault tolerance: Builds on Kafka’s fault tolerance model - Management and monitoring: One way of monitoring all connectors - Schemas: Offers an option for preserving schemas from source to sink
  79. 79. Kafka all the things! ConnectAPI
  80. 80. Kafka’s streams API = The T in STREAMING ETL
  81. 81. “Stream processing = transformations on stream data
  82. 82. 2 visions for stream processing Real-time Mapreduce Event-driven microservicesVS
  83. 83. 2 visions for stream processing Real-time Mapreduce Event-driven microservicesVS - Central cluster - Custom packaging, deployment & monitoring - Suitable for analytics-type use cases - Embedded library in any Java app - Just Kafka and your app - Makes stream processing accessible to any use case
  84. 84. Vision 1: real-time mapreduce
  85. 85. Vision 2: event-driven microservices => Kafka’s streams API Streams API microservice Transforms
  86. 86. “Kafka’s Streams API = Easiest way to do stream processing using Kafka
  87. 87. “#1: Powerful and lightweight Java library; need just Kafka and your app app
  88. 88. “#2: Convenient DSL with all sorts of operators: join(), map(), filter(), windowed aggregates etc
  89. 89. Word count program using Kafka’s streams API
  90. 90. “#3: True event-at-a-time stream processing; no microbatching
  91. 91. “#4: Dataflow-style windowing based on event-time; handles late-arriving data
  92. 92. “#5: Out-of-the-box support for local state; supports fast stateful processing
  93. 93. External state
  94. 94. local state
  95. 95. Fault-tolerant local state
  96. 96. “#6: Kafka’s Streams API allows reprocessing; useful to upgrade apps or do A/B testing
  97. 97. reprocessing
  98. 98. Real-time dashboard for security monitoring
  99. 99. Kafka’s streams api: simple is beautiful Vision 1 Vision 2
  100. 100. Logs unify batch and stream processing
  101. 101. Streams API app sinksource ConnectAPI ConnectAPI Transforms LoadExtract New shiny future of ETL: Kafka
  102. 102. A giant mess! App App App App search HadoopDWH monitoring security MQ MQ cache cache
  103. 103. All your data … everywhere … now Streaming platform DWH Hadoop App App App App App App App App request-response messaging OR stream processing streaming data pipelines changelogs
  104. 104. VISION: All your data … everywhere … now Streaming platform DWH Hadoop App App App App App App App App request-response messaging OR stream processing streaming data pipelines changelogs
  105. 105. Thank you! @nehanarkhede
  106. 106. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ etl-streams

×