Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

ADV Slides: Trends in Streaming Analytics and Message-oriented Middleware

204 visualizaciones

Publicado el

Streaming and real-time data has high business value, but that value can rapidly decay if not processed quickly. If the value of the data is not realized in a certain window of time, its value is lost and the decision or action that was needed as a result never occurs. Streaming data – whether from sensors, devices, applications, or events – needs special attention because a sudden price change, a critical threshold met, a sensor reading changing rapidly, or a blip in a log file can all be of immense value, but only if the alert is in time.

Publicado en: Datos y análisis
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

ADV Slides: Trends in Streaming Analytics and Message-oriented Middleware

  1. 1. William McKnight 214-514-1444 Trends in Streaming Analytics and Message-oriented Middleware @williammcknight
  2. 2. The ETL Legacy • An ad hoc manner of connecting sources and destinations • ETL surfaced in the 1990s – Far fewer data platforms and types – Built for DW – Bottleneck in DW population – Time and Resource intensive – Batch • Can be chaotic and unmanageable 2
  3. 3. EAI • Then came EAI – Facilitate exchange of business transactions messages between applications – Used Enterprise Service classes underneath the covers – Works for small scale data – Not designed to handle the span of data that is required for modern day, like sensors 3
  4. 4. Modern Realities of Data Integration • Desire for consolidated methods for data integration • New types of data sources – Logs, sensors, etc. • We have more than OLTP and OLAP – Distributed data platforms • Desire for real-time data • High-velocity data increasingly needs integration • Traditional approaches, without Stream Processing, turn into ETL+custom scripts+middleware+MQ 4
  5. 5. Streaming: Real-Time and Scalable • Streaming is Forward- thinking • Real-Time and Scale Becoming the Rule Not the Exception 5 SUN MON TUES WED THU FRI SAT BATCHREAL-TIME SCALABILITY ETL EAI STREAMING PLATFORM
  6. 6. Point to point • Old way • Add another database? Repeat process 6 S t a g i n g T a b l e s ERP CRM Financials HR BI Tools BI Tools/ OLAP Clients Physical OLAP Cubes Physical Object
  7. 7. ETL is Insufficient for this combination • Data platforms operating at an enterprise-wide scale • A high variety of data sources • Real-time/streaming data • ETL forces either real-time loading without being scalable or scalability with batch loading – Data, produced from numerous sources, is a torrent of flowing information, needing to be timestamped, dispatched, and even duplicated (to protect against data loss) – A postman is needed to distribute data from message senders to receivers at the right place at the right time. 7
  8. 8. Real-Time Data • A.k.a. messaging, live feeds, real-time, event-driven • Comes in continuously and often quickly, so we also call it streaming data • Needs special attention and can be of immense value, but only if we are alerted in time • Foundation for Artificial Intelligence excellence – Stream data forms the core of data for artificial intelligence 8
  9. 9. Message Brokers • Message Brokers are a way of decoupling the sending and receiving services through the concept of Publish & Subscribe • Another thing Message Brokers do is queue or retain the message till the consumer picks it up • Streaming allows us to have both Pub-Sub as well as queuing features (historically, either one or the other was supported by such brokers 9
  10. 10. Streaming Architecture Apps 10 Streaming Platform Change logs Streaming data pipelines Messaging / Stream processing Request - Response DW Technical Support Web Services API Big Data Analysis IDE / Developer GUI Hadoop Parallel Tools Multi-Threaded Math Libraries Cluster support
  11. 11. All Data Can Be Represented as Streams 11 Streaming Platform DW Hadoop RDBMS NOSQL Apps Real-time Analytics Search Monitoring Web Services API Big Data Analysis Parallel Tools Multi-Threaded Math Libraries Cluster support
  12. 12. Streaming Data • Unbounded, continuous flow of real-time records • Stream APIs transform and enrich data • Millisecond latency • Stateless or stateful • Incorporate data into your applications; deploy anywhere, including containers 12
  13. 13. Enter Message-Oriented Middleware aka Streaming and message queuing technology • Messages can be any kind of data wrapped in a neat package with a very simple header as a bow on top. • Messages are sent by “producers”—systems, sensors, or devices that generate the messages—toward a “broker.” • A broker does not process the messages, but instead routes them into queues according to the information enclosed in the message header or its own routing process. • Then “consumers” retrieve the messages from the queues to which they subscribe (although sometimes messages are pushed to consumers rather than pulled). • The consumers open the messages and perform some kind of action on them. 13
  14. 14. Streaming solutions Intelligent data platform for fast data: Connect, process, and store data in real-time …in a unified, flexible solution …able to meet demanding SLAs even at scale …without operational burdens and complexity 14
  15. 15. Performance and scalability in streaming 15 Storage Ability to retain varying volumes of messages for varying lengths of time Throughput High, sustainable rate of message processing Latency Fast, consistent responsiveness for publishing and consumption Operations Minimizing operational burden for scaling, tuning, and monitoring
  16. 16. Comprehensive capabilities 16 Stream-Native Functions Apply processing functions on data Multi-tenancy A single cluster can support many tenants and use cases Durability Data replicated and synced to disk Geo-replication Out of box support for geographically distributed applications Unified messaging model Support both Topic & Queue semantic in a single model Delivery Guarantees At least once, at most once and effectively once Scalability Supports millions of topics in a single cluster
  17. 17. Apache Kafka • Open source streaming platform developed at LinkedIn • A distributed publish-subscribe messaging system that maintains feeds of messages called topics – Publishers write data to topics and subscribers read from topics – Kafka topics are partitioned and replicated across multiple nodes in your Hadoop cluster • Enables “source to sink” data pipelines • Kafka messages are simple, byte-long arrays that can store objects in virtually any format with a key attached to each message; often in JSON • E&L in ETL through Kafka Connect API • T in ETL through Kafka Streams API • Fault-tolerant • DIY 17
  18. 18. Sources and Sinks 18 Source Sink ConnectAPI ConnectAPI
  19. 19. Application programming interfaces • A ubiquitous method and de facto standard of communication among modern information technologies. • APIs have begun to replace older, more cumbersome methods of information sharing with lightweight endpoints. • Due to the popularity and proliferation of APIs and microservices, the need has arisen to manage the multitude of services a company relies on—both internal and external. • Organizations depend on these services to be properly managed, with high performance and availability. 19
  20. 20. API & Microservices Ecosystem Public Private - External Private - Internal Over 20,000 public APIs* *according to External Partners Connected Apps & Data 20
  21. 21. The Need for Management HTTP Basic Auth OAuth2.0 OpenID API Keys Test Production Rate limiting Analytics Transformations Quotas Caching CORS 21
  22. 22. Platform Architecture Load Balancer (Nginx, HAProxy, ELB, etc.) API Nodes Database Back End API Endpoint 1 API Endpoint 2 API Endpoint…n Client 1 Client 2 Client …n 22
  23. 23. API Requirements • Performance: Good for high performance workloads (>1,000TPS) • Reliability: All workloads completed with 100% message completion • Complexity: Multiple plugins enabled 23
  24. 24. RabbitMQ • Open source message broker platform • Created in 2007 and is managed by Pivotal Software • Uses an exchange to receive messages from brokers and pushes them to the registered consumers • The broker pushes messages—which are queued in random order—toward the consumers. • Brokers are persistently connected to consumer, and they know which ones are subscribed to which queues • Consumers cannot fetch specific messages, but can receive them unordered – unaware of the queue state • Messages, queues, and exchanges do not persist unless otherwise instructed. – If a broker is restarted or fails, the messages are lost – Has settings to make both queues and messages durable. Moreover, non-critical messages can be tagged by the producer to not be sent to a durable queue • Allows producers’ and consumers’ code to declare new queues and exchanges • Several replication and load balancing alternatives 24
  25. 25. Amazon Kinesis • Similar to Kafka • In enterprise-ready package • Amazon users pay for by the shard-hour and payload 25
  26. 26. Apache Pulsar • Originally developed at Yahoo • Began its incubation at Apache in late 2016 • Has been in production at Yahoo since 2013 • Utilized in popular services and applications like Yahoo! Mail, Finance, Sports, Flickr, Gemini Ads, and Sherpa • Follows the publisher-subscriber model (pub-sub), and has the same producers, topics, and consumers as some of the aforementioned technologies • Uses built-in multi-datacenter replication • Architected for multi-tenancy and uses concepts of properties and namespaces 26
  27. 27. Streamlio • Enterprise-ready deployment of Pulsar • Unified solution for connecting, processing and storing fast-moving data • The unified messaging model has three components: • Consumption • Acknowledgement • Retention • Three modes of subscription: exclusive, failover, and shared. • Supports both persistent and non-persistent states. • Has a configurable time-to-live (TTL) feature than can be set to handle messages that have not been consumed. • A unified platform gives enterprises the best of both the streaming and message queuing worlds. 27
  28. 28. Workloads are Distinguished by • The number of topics • The size of the messages being produced and consumed • The number of subscriptions per topic • The number of producers per topic • The rate at which producers produce messages (per second) • The size of the consumer’s backlog (in gigabytes) 28
  29. 29. Creating a Streaming Application • Configure the Application • Serialize data • Set up tables for change logs 29
  30. 30. Migrating ETL to Stream Processing • Sessionization of event data • Tools to acquire: – Message bus – Data storage (i.e., HDFS with S3) – Operations support 30
  31. 31. Biggest Challenges in Streaming • Getting data live at scale • Accenting data with metadata • Misordered events • Job recovery • High operational workload 31
  32. 32. Future of Data Integration 32 Source Dest ConnectAPI ConnectAPI Streaming Solution Streams API App Transformations Streaming PlatformDW Hadoop RDBMS NOSQLApps Real-time Analytics Search Monitoring Web Services API Big Data Analysis Parallel Tools Multi-Threaded Math Libraries Cluster support
  33. 33. In Conclusion • Streaming and message queuing have lasting value to organizations. • They will be as prevalent as ETL was and is in the world of data warehousing and integration. • APIs have begun to replace older, more cumbersome methods of information sharing with lightweight endpoints. • Streaming and messaging will be able to meet the data volume, variety, and timing requirements of the coming years. • Data-driven organizations will benefit from these technologies because it will allow them to ingest data and operate at a scale that would have been practically impossible just a few years ago. 33
  34. 34. Second Thursday of Every Month, at 2:00 ET Presented by: William McKnight President, McKnight Consulting Group (214) 514-1444 #AdvAnalytics