Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Designing Modern Streaming Data Applications

942 visualizaciones

Publicado el

Many industry segments have been grappling with fast data (high-volume, high-velocity data). The enterprises in these industry segments need to process this fast data just in time to derive insights and act upon it quickly. Such tasks include but are not limited to enriching data with additional information, filtering and reducing noisy data, enhancing machine learning models, providing continuous insights on business operations, and sharing these insights just in time with customers. In order to realize these results, an enterprise needs to build an end-to-end data processing system, from data acquisition, data ingestion, data processing, and model building to serving and sharing the results. This presents a significant challenge, due to the presence of multiple messaging frameworks and several streaming computing frameworks and storage frameworks for real-time data.

In this tutorial we lead a journey through the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline, messaging frameworks, streaming computing frameworks, storage frameworks for real-time data, and more. We also share case studies from the IoT, gaming, and healthcare as well as their experience operating these systems at internet scale at Twitter and Yahoo. We conclude by offering their perspectives on how advances in hardware technology and the emergence of new applications will impact the evolution of messaging systems, streaming systems, storage systems for streaming data, and reinforcement learning-based systems that will power fast processing and analysis of a large (potentially of the order of hundreds of millions) set of data streams.

Topics include:

* An introduction to streaming
* Common data processing patterns
* Different types of end-to-end stream processing architectures
* How to seamlessly move data across data different frameworks
* Case studies: Healthcare and the IoT
* Data sketches for mining insights from data streams

Publicado en: Tecnología

Designing Modern Streaming Data Applications

  1. 1. 1 DESIGNING MODERN STREAMING DATA APPLICATIONS ARUN KEJARIWAL, KARTHIK RAMASAMY @arun_kejariwal @karthikz
  2. 2. 2 ABOUT US
  3. 3. 3
  4. 4. 4 Connected World
  5. 5. 5 Ubiquity of Real-Time Data Streams
  6. 6. 6 Data-Driven Decision Making
  7. 7. UNLOCKING INSIGHTS 7 GUIDE DECISION MAKING DETERMINING TOP TRAFFIC INTENSIVE IP ADDRESSES IDENTIFYING ARBITRAGE OPPORTUNITIES IN FINANCIAL TRAINING DETERMINING SESSIONS WHOSE DURATION >2X THAN NORMAL FORECASTING DETERIORATION IN HEALTH METRICS # UNIQUE USERS/DAY OF A MOBILE APP TARGETED ADVERTISING
  8. 8. ON-THE-FLY INSIGHTS 8 “PERISHABLE” RECENCY IS CRITICAL
  9. 9. STREAMING/FAST DATA PROCESSING 9 ✦ Events are analyzed and processed as they arrive ✦ Decisions are timely, contextual and based on fresh data ✦ Decision latency is eliminated ✦ Data in motion Ingest/ Buffer Analyze Act
  10. 10. MICROSERVICES MODEL INFERENCEWORKFLOWS ANALYTICS MONITORING STREAMING APPLICATION PATTERNS
  11. 11. STREAM PROCESSING PATTERN 11 ComputeMessaging Storage Data Inges6on Data Processing Results StorageData Storage Data Serving
  12. 12. LANDSCAPE PLATFORMS APPLICATION DOMAINS PRACTICAL TAKEAWAYS ANALYTICS BUSINESS IMPACT OUTLINE 12 FOCUS AREAS
  13. 13. 13 Landscape of Platforms for Streaming Data
  14. 14. PLATFORM 14 COMPONENTS PROCESSING VELOCITY STORAGE UNBOUNDED VARIETY MESSAGING VOLUME
  15. 15. 15 APACHE KAFKA RABBITMQ AKKA METAQ DATABUS ACTIVEMQ APACHE FLUME KESTREL SCRIBE APACHE PULSAR Messaging
  16. 16. NEXT GENERATION MESSAGING 16 DURABILITYCONSISTENCY EASE OF OPERATIONS ISOLATION RESILIENCY SCALABILITY KEY PROPERTIES b b b b b b INFINITE RETENTIONb
  17. 17. APACHE PULSAR 17 Flexible Messaging + Streaming System backed by a durable log storage
  18. 18. APACHE PULSAR 18 Bookie Bookie Bookie Broker Broker Broker Producer Consumer SERVING Brokers can be added independently Traffic can be shifted quickly across brokers STORAGE Bookies can be added independently New bookies will ramp up traffic quickly
  19. 19. APACHE PULSAR - BROKER 19 ✦ Broker is the only point of interaction for clients (producers and consumers) ✦ Brokers acquire ownership of group of topics and “serve” them ✦ Broker has no durable state ✦ Provides service discovery mechanism for client to connect to right broker
  20. 20. APACHE PULSAR - BROKER 20
  21. 21. APACHE PULSAR - CONSISTENCY 21 Bookie Bookie BookieBrokerProducer
  22. 22. APACHE PULSAR - DURABILITY 22 Bookie Bookie BookieBrokerProducer Journal Journal Journal fsync fsync fsync
  23. 23. APACHE PULSAR - ISOLATION 23
  24. 24. APACHE PULSAR - SEGMENT STORAGE 24 234…20212223…40414243…60616263… Segment 1 Segment 3 Segment 2 Segment 2 Segment 1 Segment 3 Segment 4 Segment 3 Segment 2 Segment 1 Segment 4 Segment 4
  25. 25. APACHE PULSAR - RESILIENCY 25 1234…20212223…40414243…60616263… Segment 1 Segment 3 Segment 2 Segment 2 Segment 1 Segment 3 Segment 4 Segment 3 Segment 2 Segment 1 Segment 4 Segment 4
  26. 26. APACHE PULSAR - SEAMLESS CLUSTER EXPANSION 26 1234…20212223…40414243…60616263… Segment 1 Segment 3 Segment 2 Segment 2 Segment 1 Segment 3 Segment 4 Segment 3 Segment 2 Segment 1 Segment 4 Segment 4 Segment Y Segment Z Segment X
  27. 27. APACHE PULSAR - TIERED STORAGE 27 Low Cost Storage 1234…20212223…40414243…60616263… Segment 1 Segment 3 Segment 2 Segment 2 Segment 1 Segment 3 Segment 4 Segment 3 Segment 2 Segment 1Segment 4 Segment 4
  28. 28. PARTITIONS VS SEGMENTS - WHY SHOULD YOU CARE? 28
  29. 29. PARTITIONS VS. SEGMENTS - WHY SHOULD YOU CARE? 29 ✦ In Kafka, partitions are assigned to brokers “permanently” ✦ A single partition is stored entirely in a single node ✦ Retention is limited by a single node storage capacity ✦ Failure recovery and capacity expansion require expensive “rebalancing” ✦ Rebalancing has a big impact over the system, affecting regular traffic
  30. 30. UNIFIED MESSAGING MODEL - STREAMING 30 Pulsar topic/ partition Producer 2 Producer 1 Consumer 1 Consumer 2 Subscription A M4 M3 M2 M1 M0 M4 M3 M2 M1 M0 X Exclusive
  31. 31. UNIFIED MESSAGING MODEL - STREAMING 31 Pulsar topic/ partition Producer 2 Producer 1 Consumer 1 Consumer 2 Subscription B M4 M3 M2 M1 M0 M4 M3 M2 M1 M0 Failover In case of failure in consumer 1
  32. 32. UNIFIED MESSAGING MODEL - QUEUING 32 Pulsar topic/ partition Producer 2 Producer 1 Consumer 2 Consumer 3 Subscription C M4 M3 M2 M1 M0 Shared Traffic is equally distributed across consumers Consumer 1 M4M3 M2M1M0
  33. 33. DISASTER RECOVERY 33 Topic (T1) Topic (T1) Topic (T1) Subscrip6on (S1) Subscrip6on (S1) Producer (P1) Consumer (C1) Producer (P3) Producer (P2) Consumer (C2) Data Center A Data Center B Data Center C Integrated in the broker message flow Simple configuration to add/remove regions Asynchronous (default) and synchronous replication
  34. 34. MULTITENANCY 34 Apache Pulsar Cluster Product Safety ETL Fraud Detection Topic-1 Account History Topic-2 User Clustering Topic-1 Risk Classification MarketingCampaigns ETL Topic-1 Budgeted Spend Topic-2 Demographic Classification Topic-1 Location Resolution Data Serving Microservice Topic-1 Customer Authentication 10 TB 7 TB 5 TB 3 TB 2 TB 8 TB 4 TB 3 TB ✦ Authentication ✦ Authorization ✦ Software isolation ๏ Storage quotas, flow control, back pressure, rate limiting ✦ Hardware isolation ๏ Constrain some tenants on a subset of brokers/bookies
  35. 35. PULSAR CLIENTS 35 Apache Pulsar Cluster Java Python Go C++ C
  36. 36. PULSAR PRODUCER 36 PulsarClient client = PulsarClient.create( “http://broker.usw.example.com:8080”); Producer producer = client.createProducer( “persistent://my-property/us-west/my-namespace/my-topic”); // handles retries in case of failure producer.send("my-message".getBytes()); // Async version: producer.sendAsync("my-message".getBytes()).thenRun(() -> { // Message was persisted });
  37. 37. PULSAR CONSUMER 37 PulsarClient client = PulsarClient.create( "http://broker.usw.example.com:8080"); Consumer consumer = client.subscribe( "persistent://my-property/us-west/my-namespace/my-topic", "my-subscription-name"); while (true) { // Wait for a message Message msg = consumer.receive(); System.out.println("Received message: " + msg.getData()); // Acknowledge the message so that it can be deleted by broker consumer.acknowledge(msg); }
  38. 38. SCHEMA REGISTRY 38 ✦ Provides type safety to applications built on top of Pulsar ✦ Two approaches ๏ Client side - type safety enforcement up to the application ๏ Server side - system enforces type safety and ensures that producers and consumers remain synced ✦ Schema registry enables clients to upload data schemas on a topic basis. ✦ Schemas dictate which data types are recognized as valid for that topic
  39. 39. PULSAR SCHEMAS - HOW DO THEY WORK? 39 ✦ Enforced at the topic level ✦ Pulsar schemas consists of ๏ Name - Name refers to the topic to which the schema is applied ๏ Payload - Binary representation of the schema ๏ Schema type - JSON, Protobuf and Avro ๏ User defined properties - Map of strings to strings (application specific - e.g git hash of the schema)
  40. 40. SCHEMA VERSIONING 40 PulsarClient client = PulsarClient.builder() .serviceUrl(“http://broker.usw.example.com:6650") .build() Producer<SensorReading> producer = client.newProducer(JSONSchema.of(SensorReading.class)) .topic(“sensor-data”) .sendTimeout(3, TimeUnit.SECONDS) .create() Scenario What happens No schema exists for the topic Producer is created using the given schema Schema already exists; producer connects using the same schema that’s already stored Schema is transmitted to the broker, determines that it is already stored Schema already exists; producer connects using a new schema that is compatible Schema is transmitted, compatibility determined and stored as new schema
  41. 41. HOW TO PROCESS DATA MODELED AS STREAMS 41 ✦ Consume data as it is produced (pub/sub) ✦ Light weight compute - transform and react to data as it arrives ✦ Heavy weight compute - continuous data processing ✦ Interactive query of stored streams
  42. 42. LIGHT WEIGHT COMPUTE 42 f(x) Incoming Messages Output Messages ABSTRACT VIEW OF COMPUTE REPRESENTATION
  43. 43. TRADITIONAL COMPUTE REPRESENTATION 43 DAG % % % % % Source 1 Source 2 Action Action Action Sink 1 Sink 2
  44. 44. REALIZING COMPUTATION - EXPLICIT CODE 44 public static class SplitSentence extends BaseBasicBolt { @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } @Override public Map<String, Object> getComponentConfiguration() { return null; } public void execute(Tuple tuple, BasicOutputCollector basicOutputCollector) { String sentence = tuple.getStringByField("sentence"); String words[] = sentence.split(" "); for (String w : words) { basicOutputCollector.emit(new Values(w)); } } } STITCHED BY PROGRAMMERS
  45. 45. REALIZING COMPUTATION - FUNCTIONAL 45 Builder.newBuilder() .newSource(() -> StreamletUtils.randomFromList(SENTENCES)) .flatMap(sentence -> Arrays.asList(sentence.toLowerCase().split("s+"))) .reduceByKeyAndWindow(word -> word, word -> 1, WindowConfig.TumblingCountWindow(50), (x, y) -> x + y);
  46. 46. TRADITIONAL REAL TIME - SEPARATE SYSTEMS 46 Messaging Compute
  47. 47. TRADITIONAL REAL TIME SYSTEMS 47 DEVELOPER EXPERIENCE ✦ Powerful API but complicated ๏ Does everyone really need to learn functional programming? ✦ Configurable and scalable but management overhead ✦ Edge systems have resource and management constraints
  48. 48. TRADITIONAL REAL TIME SYSTEMS 48 OPERATIONAL EXPERIENCE ✦ Multiple systems to operate ๏ IoT deployments routinely have thousands of edge systems ✦ Semantic differences ๏ Mismatch and duplication between systems ๏ Creates developer and operator friction
  49. 49. LESSONS LEARNED - USE CASES 49 ✦ Data transformations ✦ Data classification ✦ Data enrichment ✦ Data routing ✦ Data extraction and loading ✦ Real time aggregation ✦ Microservices Significant set of processing tasks are exceedingly simple
  50. 50. EMERGENCE OF CLOUD - SERVERLESS 50 ✦ Simple function API ✦ Functions are submitted to the system ✦ Runs per events ✦ Composition APIs to do complex things ✦ Wildly popular
  51. 51. SERVERLESS VS. STREAMING 51 ✦ Both are event driven architectures ✦ Both can be used for analytics and data serving ✦ Both have composition APIs ๏ Configuration based for serverless ๏ DSL based for streaming ✦ Serverless typically does not guarantee ordering ✦ Serverless is pay per action
  52. 52. STREAM NATIVE COMPUTE USING FUNCTIONS 52 ✦ Simplest possible API -function or a procedure ✦ Support for multi language ✦ Use of native API for each language ✦ Scale developers ✦ Use of message bus native concepts - input and output as topics ✦ Flexible runtime - simple standalone applications vs managed system applications APPLYING INSIGHT GAINED FROM SERVERLESS
  53. 53. PULSAR FUNCTIONS 53 SDK LESS API import java.util.function.Function; public class ExclamationFunction implements Function<String, String> { @Override public String apply(String input) { return input + "!"; } }
  54. 54. PULSAR FUNCTIONS 54 SDK API import org.apache.pulsar.functions.api.PulsarFunction; import org.apache.pulsar.functions.api.Context; public class ExclamationFunction implements PulsarFunction<String, String> { @Override public String process(String input, Context context) { return input + "!"; } }
  55. 55. PULSAR FUNCTIONS 55 ✦ Simplest possible API -function or a procedure ✦ Support for multi language ✦ Use of native API for each language ✦ Scale developers ✦ Use of message bus native concepts - input and output as topics ✦ Flexible runtime - simple standalone applications vs managed system applications
  56. 56. PULSAR FUNCTIONS 56 ✦ Function executed for every message of input topic ✦ Support for multiple topics as inputs ✦ Function output goes into output topic - can be void topic as well ✦ SerDe takes care of serialization/deserialization of messages ๏ Custom SerDe can be provided by the users ๏ Integration with schema registry
  57. 57. PROCESSING GUARANTEES 57 ✦ ATMOST_ONCE ๏ Message acked to Pulsar as soon as we receive it ✦ ATLEAST_ONCE ๏ Message acked to Pulsar after the function completes ๏ Default behavior - don’t want people to loose data ✦ EFFECTIVELY_ONCE ๏ Uses Pulsar’s inbuilt effectively once semantics ✦ Controlled at runtime by user
  58. 58. DEPLOYING FUNCTIONS - BROKER 58 Broker 1 Worker Function wordcount-1 Function transform-2 Broker 1 Worker Function transform-1 Function dataroute-1 Broker 1 Worker Function wordcount-2 Function transform-3 Node 1 Node 2 Node 3
  59. 59. DEPLOYING FUNCTIONS - WORKER NODES 59 Worker Function wordcount-1 Function transform-2 Worker Function transform-1 Function dataroute-1 Worker Function wordcount-2 Function transform-3 Node 1 Node 2 Node 3 Broker 1 Broker 2 Broker 3 Node 4 Node 5 Node 6
  60. 60. DEPLOYING FUNCTIONS - KUBERNETES 60 Function wordcount-1 Function transform-1 Function transform-3 Pod 1 Pod 2 Pod 3 Broker 1 Broker 2 Broker 3 Pod 7 Pod 8 Pod 9 Function dataroute-1 Function wordcount-2 Function transform-2 Pod 4 Pod 5 Pod 6
  61. 61. BUILT-IN STATE MANAGEMENT IN FUNCTIONS 61 ✦ Functions can store state in inbuilt storage ๏ Framework provides a simple library to store and retrieve state ✦ Support server side operations like counters ✦ Simplified application development ๏ No need to standup an extra system
  62. 62. DISTRIBUTED STATE IN FUNCTIONS 62 import org.apache.pulsar.functions.api.Context; import org.apache.pulsar.functions.api.PulsarFunction; public class CounterFunction implements PulsarFunction<String, Void> { @Override public Void process(String input, Context context) throws Exception { for (String word : input.split(".")) { context.incrCounter(word, 1); } return null; } }
  63. 63. PULSAR - DATA IN AND OUT 63 ✦ Users can write custom code using Pulsar producer and consumer API ✦ Challenges ๏ Where should the application to publish data or consume data from Pulsar? ๏ How should the application to publish data or consume data from Pulsar? ✦ Current systems have no organized and fault tolerant way to run applications that ingress and egress data from and to external systems
  64. 64. PULSAR IO TO THE RESCUE 64 Apache Pulsar ClusterSource Sink
  65. 65. INTERACTIVE QUERYING OF STREAMS - PULSAR SQL 65 1234…20212223…40414243…60616263… Segment 1 Segment 3 Segment 2 Segment 2 Segment 1 Segment 3 Segment 4 Segment 3 Segment 2 Segment 1 Segment 4 Segment 4 Segment Reader Segment Reader Segment Reader Segment Reader Coordinator
  66. 66. PULSAR IO - EXECUTION 66 Broker 1 Worker Sink Cassandra-1 Source Kinesis-2 Broker 2 Worker Source Kinesis-1 Source Twitter-1 Broker 3 Worker Sink Cassandra-2 Source Kinesis-3 Node 1 Node 2 Node 3 Fault tolerance Parallelism Elasticity Load Balancing On-demand updates
  67. 67. EASE OF OPERATIONS 67 ✦ Brokers don’t have durable state ๏ Easily replaceable ๏ Topics are immediately reassigned to healthy brokers ✦ Expanding capacity ๏ Simply add new broker node ๏ If other brokers are overloaded, traffic will be automatically assigned ✦ Load manager ๏ Monitor traffic load on all brokers (CPU, memory, network, topics) ๏ Initially place topics to least loaded brokers ๏ Reassign topics when a broker is overloaded
  68. 68. CONSUMER BACKLOG 68 ✦ Metrics are available to make assessments ๏ When a problem started ๏ How big is backlog? messages? disk space? ๏ How fast is draining? ๏ What’s the ETA to catch up with publishers? ✦ Establish where is the bottleneck ๏ Application is not fast enough ๏ Disk read I/O
  69. 69. ENFORCING MULTITENANCY 69 ✦ Ensure tenants don’t cause performance issues on other tenants ๏ Backlog quotas ๏ Soft isolation - flow control and throttling ✦ In cases when user behavior is triggering performance degradation ๏ Hard isolation as a last resource for quick reaction while proper fix is deployed ๏ Isolate tenant on a subset of brokers ๏ Can be also applied at the storage node level
  70. 70. PULSAR PERFORMANCE 70
  71. 71. PULSAR PERFORMANCE - LATENCY 71
  72. 72. APACHE PULSAR VS. APACHE KAFKA 72 Mul$-tenancy A single cluster can support many tenants and use cases Seamless Cluster Expansion Expand the cluster without any down $me High throughput & Low Latency Can reach 1.8 M messages/s in a single par$$on and publish latency of 5ms at 99pct Durability Data replicated and synced to disk Geo-replica$on Out of box support for geographically distributed applica$ons Unified messaging model Support both Topic & Queue seman$c in a single model Tiered Storage Hot/warm data for real $me access and cold event data in cheaper storage Pulsar Func$ons Flexible light weight compute Highly scalable Can support millions of topics, makes data modeling easier
  73. 73. 73 SPARK STREAMING APACHE APEX APACHE FLINK SAMZA APACHE BEAM (MILLWHEEL, DATAFLOW) ATHENAX STYLUS APACHE STORM STREAMALERT APACHE HERON Processing
  74. 74. PROCESSING 74 TASK ISOLATIONLOW LATENCY MULTIPLE API'S MULTIPLE SEMANTICS DIVERSE WORKLOADS FAULT TOLERANCE KEY PROPERTIES
  75. 75. HERON TERMINOLOGY 75 Topology Directed acyclic graph ver6ces = computa6on, and edges = streams of data tuples Spouts Sources of data tuples for the topology Examples - Pulsar/KaPa/MySQL/Postgres Bolts Process incoming tuples, and emit outgoing tuples Examples - filtering/aggrega6on/join/any func6on , %
  76. 76. HERON TOPOLOGY 76 % % % % % Spout 1 Spout 2 Bolt 1 Bolt 2 Bolt 3 Bolt 4 Bolt 5
  77. 77. HERON TOPOLOGY - PHYSICAL EXECUTION 77 % % % % % Spout 1 Spout 2 Bolt 1 Bolt 2 Bolt 3 Bolt 4 Bolt 5 %% %% %% %% %%
  78. 78. HERON GROUPINGS 78 01 Shuffle Grouping Random distribution of tuples / Fields Grouping Group tuples by a field or multiple fields All Grouping Replicates tuples to all tasks 02 . 03 - 04 Global Grouping Send the entire stream to one task ,
  79. 79. HERON TOPOLOGY - PHYSICAL EXECUTION 79 % % % % % Spout 1 Spout 2 Bolt 1 Bolt 2 Bolt 3 Bolt 4 Bolt 5 %% %% %% %% %% Shuffle Grouping Shuffle Grouping Fields Grouping Fields Grouping Fields Grouping Fields Grouping
  80. 80. HERON TOPOLOGY - PHYSICAL EXECUTION 80 % % % % % Spout 1 Spout 2 Bolt 1 Bolt 2 Bolt 3 Bolt 4 Bolt 5 %% %% %% %% %% Shuffle Grouping Shuffle Grouping Fields Grouping Fields Grouping Fields Grouping Fields Grouping
  81. 81. HERON ARCHITECTURE 81 Scheduler Topology 1 Topology 2 Topology N Topology Submission
  82. 82. HERON ARCHITECTURE 82 Topology Master ZK Cluster Stream Manager I1 I2 I3 I4 Stream Manager I1 I2 I3 I4 Logical Plan, Physical Plan and Execution State Sync Physical Plan DATA CONTAINER DATA CONTAINER Metrics Manager Metrics Manager Metrics Manager Health Manager MASTER CONTAINER
  83. 83. TOPOLOGY MASTER 83 Monitoring of containers Gateway for metrics Assigns role
  84. 84. STREAM MANAGER 84 Routes tuples Implements backpressure Processing Semantics
  85. 85. STREAM MANAGER 85 % % S1 B2 B3 % B4
  86. 86. STREAM MANAGER 86 S1 B2 B3 Stream Manager Stream Manager Stream Manager Stream Manager S1 B2 B3 B4 S1 B2 B3 S1 B2 B3 B4 B4 PHYSICAL EXECUTION
  87. 87. STREAM MANAGER BACK PRESSURE 87 Spout based back pressureTCP backpressure Stage by stage back pressure
  88. 88. HERON INSTANCE 88 Runs only one task (spout/bolt) Exposes Heron API Collects several metrics API G
  89. 89. HERON INSTANCE 89 Stream Manager Metrics Manager Gateway Thread Task Execution Thread data-in queue data-out queue metrics-out queue
  90. 90. HERON DEPLOYMENT 90 Topology 1 Topology 2 Topology N Heron Tracker Heron API Server Heron Web ZK Cluster Aurora Services Observability
  91. 91. HERON VISUALIZATION 91
  92. 92. HERON TOPOLOGY COMPLEXITY 92
  93. 93. HERON TOPOLOGY SCALE 93 CONTAINERS - 1 TO 600 INSTANCES - 10 TO 6000
  94. 94. HERON HAPPY FACTS :) 94 ✦ No more pages during midnight for Heron team ๏ Backlog quotas ๏ Soft isolation - flow control and throttling ✦ Very rare incidents for Heron customer teams ✦ Easy to debug during incidents for quick turn around ๏ Reduced resource utilization saving cost (~3x cost savings)
  95. 95. 95 Coffee Break
  96. 96. HERON DEVELOPER ISSUES 96 01 02 Container resource alloca$on Parallelism tuning
  97. 97. HERON OPERATIONAL ISSUES 97 01 02 03 Slow Hosts Network Issues Data Skew / . - 04 Load Variations , 05 SLA Violations /
  98. 98. SLOW HOSTS 98 Memory Parity Errors Impending Disk Failures Lower GHZ
  99. 99. NETWORK ISSUES 99 Network Partitioning G Network Slowness
  100. 100. NETWORK SLOWNESS 100 01 02 03 Delays processing Data is accumula$ng Timelines of results Is affected
  101. 101. DATA SKEW 101 Multiple Keys Several keys map into single instance and their count is high Single Key Single key maps into a instance and its count is high H C
  102. 102. LOAD VARIATIONS 102 Multiple Keys Several keys map into single instance and their count is high Single Key Single key maps into a instance and its count is high H C
  103. 103. SELF REGULATING STREAMING SYSTEMS 103 Automate Tuning SLO Maintenance Self Regula$ng Streaming Systems Tuning Manual, 6me-consuming and error-prone task of tuning various systems knobs to achieve SLOs SLO Maintenance of SLOs in the face of unpredictable load varia6ons and hardware or soWware performance degrada6on Self Regula$ng Streaming Systems System that adjusts itself to the environmental changes and con6nue to produce results
  104. 104. SELF REGULATING STREAMING SYSTEMS 104 Self tuning Self stabilizing Self healing G !g Several tuning knobs Time consuming tuning phase The system should take as input an SLO and automatically configure the knobs. The system should react to external shocks and a u t o m a t i c a l l y reconfigure itself Stream jobs are long running Load variations are common The system should identify internal faults and attempt to recover from them System performance affected by hardware or software delivering degraded quality of service
  105. 105. ENTER DHALION 105 Dhalion periodically executes well- specified policies that optimize execution based on some objective. We created policies that dynamically provision resources in the presence of load variations and auto-tune streaming applications so that a throughput SLO is met. Dhalion is a policy based framework integrated into Heron
  106. 106. DHALION POLICY FRAMEWORK 106 Symptom Detector 1 Symptom Detector 2 Symptom Detector 3 Symptom Detector N .... Diagnoser 1 Diagnoser 2 Diagnoser M .... Resolver Invocation D iagnosis 1 Diagnosis 2 D iagnosis M Symptom 1 Symptom 2 Symptom 3 Symptom N Symptom Detection Diagnosis Generation Resolution Resolver 1 Resolver 2 Resolver M .... Resolver Selection Metrics
  107. 107. DYNAMIC RESOURCE PROVISIONING 107 Policy This policy reacts to unexpected load varia6ons (workload spikes) Goal Goal is to scale up and scale down the topology resources as needed - while keeping the topology in a steady state where back pressure is not observed H C
  108. 108. DYNAMIC RESOURCE PROVISIONING 108 Pending Tuples Detector Backpressure Detector Processing Rate Skew Detector Resource Over provisioning Diagnoser Resource Under Provisioning Diagnoser Data Skew Diagnoser Resolver Invocation Diagnosis Symptoms Symptom Detection Diagnosis Generation Resolution Metrics Slow Instances Diagnoser Bolt Scale Down Resolver Bolt Scale Up Resolver Data Skew Resolver Restart Instances Resolver
  109. 109. PROCESSING HERON PULSAR BOOKKEEPER UNIFICATION 109 MESSAGING STORAGE STREAMLIO
  110. 110. STREAMLIO 110 UNIFIED ARCHITECTURE Interactive Querying Storm API Streamlet SQL Application Builder Pulsar API BK/ HDFS API Kubernetes Metadata Management Operational Monitoring Chargeback Security Authentication Quota Management Kafka API
  111. 111. STREAMLIO ARCHITECTURE 111 KEY HIGHLIGHTS DAG PROCESSING GEO-REPLICATION DURABILITY CONSISTENCY FUNCTION PROCESSING INFINITE RETENTION SQL
  112. 112. 112 ETL
  113. 113. DATA PROCESSING 113 CLEANSING AND TRANSFORMATION Data format consistency M:0:Male, F:1:Female Missing Data Imputation Pre-computation Sum, Max, Min, Distance Timestamp Hour of the day, Date
  114. 114. DATA ENRICHMENT 114 MULTIPLE DATA SOURCES Operators Associate Aggregate Filter Properties Scalable Fault-tolerant Secure Challenges Encryption Merge Union Sort
  115. 115. ETL DATA PIPELINES - RETAIL 115 Scenario Consumer retail company migrating multiple data and analytics applications to public cloud Challenges Needed a solution to connect data to dashboards, data warehouse, and applications Solution Cloud data pipeline built on Apache Pulsar to support data movement, transformation, and access
  116. 116. ETL DATA PIPELINES - RETAIL 116 Cloud data warehouse Batch reporting and analytics Real-time dashboards and alerts Data sources and applications Messaging, queuing, data transformations Cloud data lake
  117. 117. 117 Enterprise Event Bus
  118. 118. DATA SILOS 118 INTEGRATION Disparate Data Sources Multiple Data Types Different Frequency Data Fidelity
  119. 119. DATA FUSION 119 EXTRACTING INSIGHTS Why? Complementarity → Completeness Redundancy → Accuracy Cooperative → Dependability Planning Decision-Making Applications Smart Buildings Remote sensing Transportation Multi-Target Tracking (MTT)
  120. 120. 120 STREAMING JOINS CHALLENGES Bounded sliding window of tuples Traditional join semantics Hard to exploit many-core parallelism Sequential operation model (store and process) Linear data flow model (left-to-right / right-to-left) Performance ๏ Deployment parameters ❖ Processing capacity, level of parallelism ๏ Time-varying input parameters ❖ Varying sources, Rate of tuples SplitJoin * Guilsano et al., “Performance Modeling of Stream Joins”, DEBS 2017.
  121. 121. ENTERPRISE EVENT BUS 121 Apache Pulsar Cluster Application 2 Application 5 Application 1 Application 3 Application 4 Application 6
  122. 122. ENTERPRISE EVENT BUS - YAHOO! 122 Scenario Need to collect and distribute user and data events to distributed global applications at Internet scale Challenges ✦ Multiple technologies to handle messaging needs ✦ Multiple, siloed messaging clusters ✦ Hard to meet scale and performance ✦ Complex, fragile environment Solution ✦ Central event data bus using Apache Pulsar ✦ Consolidated multiple technologies and clusters into a single solution ✦ Fully-replicated across 8 global datacenter ✦ Processing >100B messages / day, 2.3M topics
  123. 123. 123 UNBOUNDED UNORDERED EVENT TIME PROCESSING TIME LOW LATENCY LIMITED MEMORY SINGLE PASS , INCREMENTAL APPROXIMATE
  124. 124. APPROXIMATION 124 Guaranteed bounds on approximation factors Error greater than 𝜖 with probability 𝛿 BOUNDS PROBABILISTICDETERMINISTIC
  125. 125. APPROXIMATION 125 CHEBYSHEV HOEFFDING MARKOV PROBABILISTIC BOUNDS X: Random variable, μ: expectation, 𝜖>0
  126. 126. 126 BIAS VS. VARIANCE PROPERTIES BIAS VARIANCE How much the average of the estimate differs from the true mean Variance of the estimate around its mean
  127. 127. 127 BIAS VS. VARIANCE TRADE-OFF * Illustra6on borrowed from h`ps://web.stanford.edu/~has6e/ElemStatLearn/ *
  128. 128. DATA SKETCHES 128
  129. 129. 129 SAMPLING FILTERING CARDINALITY QUANTILE FREQUENT MOMENTS SPECTRUM OF DATA SKETCHES
  130. 130. SAMPLING 130 Volume ๏ Unbounded Velocity Types ๏ Uniform ๏ Non-uniform ❖ Class imbalance Latency
  131. 131. SAMPLING 131 ✦ Maintain dynamic sample ๏ A data stream is a continuous process ๏ Not known in advance how many points may elapse before an analyst may need to use a representative sample ✦ Reservoir sampling[1] ๏ Probabilistic insertions and deletions on arrival of new stream points ๏ Probability of successive insertion of new points reduces with progression of the stream ❖ An unbiased sample contains a larger and larger fraction of points from the distant history of the stream ✦ Practical perspective ๏ Data stream may evolve and hence, the majority of the points in the sample may represent the stale history [1] J. S. Vi`er. Random Sampling with a Reservoir. ACM Transac6ons on Mathema6cal SoWware, Vol. 11(1):37–57, March 1985. OBTAINING A REPRESENTATIVE SAMPLE
  132. 132. SAMPLING 132 ✦ Sliding window approach* (sample size k, window width n) ๏ Sequence-based ❖ Replace expired element with newly arrived element ‣ Disadvantage: highly periodic ❖ Chain-sample approach ‣ Select element ith with probability Min(i,n)/n ‣ Select uniformly at random an index from [i+1, i+n] of the element which will replace the ith item ‣ Maintain k independent chain samples ๏ Timestamp-based ❖ # elements in a moving window may vary over time ❖ Priority-sample approach * B. Babcock. Sampling From a Moving Window Over Streaming Data. In Proceedings of SODA, 2002. 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
  133. 133. SAMPLING 133 COMPRESSED SENSING Distributed ๏ Communication cost ๏ Non-uniform sampling rates High dimensionality ๏ Sparsity Applications Sensor networks Network traffic monitoring Facial recognition Mitigate latency impact
  134. 134. 134 COMPRESSED SENSING EARLY WORK [Candès et al. 2006] [Donoho 2006] [Cormode and Muthukrishnan 2006] [Freris et al. 2013] Recursive approach COMPRESSED SENSING FOR STREAMING DATA [Yan et al. 2015] DISTRIBUTED OUTLIER DETECTION COMPRESSIVE SAMPLING [Candès and Wakin 2008] RELATED WORKS
  135. 135. 135
  136. 136. 136 Document 2015 EVENTS E-COMMERCE OPPORTUNITY
  137. 137. MODEL 137 INCREMENTAL HANDLE NON-STATIONARITY, CONCEPT DRIFT EVOLVE CONTINUOUSLY
  138. 138. PERSONALIZATION 138 THE HOLY GRAIL Continuously evolving data streams Time decay Multi-armed bandit Contextual Deep learning Privacy
  139. 139. INVENTORY MANAGEMENT 139 FORECASTING Mining retail transactions Seasonality Virality New product launches
  140. 140. 140 Real-Time Trending
  141. 141. 141 HERBERT SIMON “A wealth of information creates a poverty of attention.”
  142. 142. 142 CONTINUOUS (PREFERENCE) TOP-K ELEMENTS * Moura6dis et al., “Con8nuous monitoring of top-k queries over sliding windows”, SIGMOD 2006. # Yu et al., “Processing a large number of con8nuous preference top-k queries”, SIGMOD 2012. ^ Zhang et al., “Reverse k-ranks query”, VLDB 2014.
  143. 143. 143 TWITTER TRENDING NEWS
  144. 144. TRENDING 144 Spotify, Apple Music, Pandora AUDIO
  145. 145. 145 YouTube, Vimeo, DailyMotion TRENDING VIDEO
  146. 146. Facebook Live, Periscope, Twitch 146 TRENDING LIVE VIDEO
  147. 147. 147 FAKE ADDRESSING INTEGRITY
  148. 148. 148 BIAS VARIOUS SOURCES Data bias ๏ Public image datasets ❖ Geographic representation ๏ Induces algorithmic bias[1] ❖ Online advertising ❖ Jobs recommendation ❖ Customer profiling Anomalies Data fidelity ๏ Bad actors ❖ Bots ❖ Data farms * Figure borrowed from Baeza-Yates, “Bias on the web”, 2018. [1] Hajian et al., “Algorithmic bias: From Discrimina8on Discovery to Fairness-aware Data mining”, KDD 2016. *
  149. 149. 149 FACT CHECKING CROWDSOURCING
  150. 150. REAL TIME TRENDING IN PULSAR & HERON 150 Streamlio (Apache Pulsar and Apache Heron) Data Source 2 clean-fn 2 Data Source 1 Data Source 3 clean-fn 1 trend- topology 3 Trending Application T1 T2 T3
  151. 151. LATENCY 151 USER EXPERIENCE
  152. 152. THE TAIL[1] 152 CLASSIFICATION[2] ✦ Whether elements can only be added or can be both inserted and deleted? ๏ Cash register model ๏ Turnstile model ✦ What operations are allowed on the elements? ๏ Comparison model ๏ Fixed-universe model ✦ Whether the algorithm is deterministic/randomized? ๏ Monte Carlo randomization QUANTILE ESTIMATION [1] J. Deam and L. Barroso, “The tail at scale”, 2013. [2] G. Luo et al., “Quan8les over data streams: experimental comparisons, new analyses, and further improvements”, 2016.
  153. 153. QUANTILE ESTIMATION 153 [Arasu and Manku 2005] SLIDING WINDOWS [Cormode et al. 2005] [Yi et al. 2009] CONTINUOUS MONITORING [Shrivastava et al. 2004] [Agarwal et al. 2012] DISTRIBUTED DATA [Cormode et al. 2006] BIASED FLAVORS * Wang et al., “Quan8les over Data Streams: An Experimental Study”, SIGMOD 2013.
  154. 154. QUANTILE ESTIMATION 154 PICK [Blum et al. 1973] EXACT MEDIAN [Munro and Patterson 1980] Q-DIGEST [Shrivastava et al. 2002] Interest in this problem may be traced to the realm of sports and the design of (traditionally, tennis) tournaments to select the first- and second-best players. In 1883, Lewis Carroll published an article denouncing the unfair method by which the second- best player is usually determined in a "knockout tournament" -- the loser of the final match is often not the second-best! (Any of the players who lost only to the best player may be second-best.) Around 1930, Hugo Steinhaus brought the problem into the realm of algorithmic complexity by asking for the minimum number of matches required to (correctly) select both the first- and second-best players from a field of n contestants. I T-DIGEST [Dunning and Ertl 2017] p passes space
  155. 155. Max signal value # Elements Compression Factor Complete binary tree Q-DIGEST[1] 155 ✦ Groups values in variable size buckets of almost equal weights ๏ Unlike a traditional histogram, buckets can overlap ✦ Key features ๏ Detailed information about frequent values preserved ๏ Less frequent values lumped into larger buckets ✦ Using message of size m, answer within an error of ✦ Except root and leaf nodes, a node v ∈ q-digest iff [1] Shrivastava et al., “Medians and Beyond: New Aggrega8on Techniques for Sensor Networks”. SenSys, 2004.
  156. 156. Q-DIGEST (CONTD.) 156 ✦ Building a q-digest ✦ q-digests can be constructed in a distributed fashion ๏ Merge q-digests
  157. 157. T-DIGEST[1] 157 [1] T. Dunning and O. Ertl, ”Compu8ng Extremely Accurate Quan8les using t-digests”, 2017. h`ps://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf ✦ Approximation of rank-based statistics ๏ Compute quantile q with an accuracy relative to max(q, 1-q) ๏ Compute hybrid statistics such as trimmed statistics ✦ Key features ๏ Robust with respect to highly skewed distributions ๏ Independent of the range of input values (unlike q-digest) ๏ Relative error is bounded ❖ Non-equal bin sizes ❖ Few samples contribute to the bins corresponding to the extreme quantiles ๏ Merging independent t-digests ❖ Reasonable accuracy
  158. 158. T-DIGEST (CONTD.) 158 ✦ Group samples into sub-sequences ๏ Smaller sub-sequences near the ends ๏ Larger sub-sequences in the middle ✦ Scaling function ๏ Mapping k is monotonic ❖ k(0) = 1 and k(1) = δ ❖ k-size of each subsequence < 1 s Notional Index Compression parameter Quantile
  159. 159. T-DIGEST (CONTD.) 159 ✦ Estimating quantile via interpolation ๏ Sub-sequences contain centroid of the samples ๏ Estimate the boundaries of the sub-sequences ✦ Error ๏ Scales quadratically in # samples ๏ Small # samples in the sub-sequences near q=0 and q=1 improves accuracy ๏ Lower accuracy in the middle of the distribution ❖ Larger sub-sequences in the middle ✦ Two flavors ๏ Progressive merging (buffering based) and clustering variant s
  160. 160. 160 RELATED WORKS SUMMARY DATA STRUCTURE [Greenwald and Khanna 2001] [Luo et al. 2016] RANDOM BQ-SUMMARY [Zhang et al. 2006] One-scan randomized MR [Cormode et al. 2006] A SMALL SNAPSHOT
  161. 161. 161
  162. 162. PERFORMANCE 162 MONITORING CPM (Cost per 1000 Impressions), vCPM, CPCV CPC (Cost per Click) CPI (Cost per Install) CPE (Cost per Engagement) CPA (Cost per Action), CPO
  163. 163. PERFORMANCE 163 MONITORING AD FORMATS POP UNDER INTERSTITIAL REDIRECTBANNERS REWARDED VIDEO POP UP SOCIAL VIDEO
  164. 164. FRAUD 164 DETECTION IMPRESSION CLICK INSTALL ENGAGEMENT
  165. 165. FRAUD DETECTION USING APACHE PULSAR 165 Apache Pulsar Data Source 2 clean-fn 2 Data Source 1 Data Source 3 clean-fn 1 fraud-fn 1 Fraud Detection T1 T2 T3 fraud-fn 2
  166. 166. IP/DEVICE ID BLACKLISTING 166 CUCKOO FILTER [Fan et al. CoNext 2014] QUOTIENT FILTER [Bender et al. Hot Storage 2011] MORTON FILTER [Breslow and Jayasena, VLDB 2018] BLOOM FILTER [Bloom, CACM 1970] FILTERING
  167. 167. BLOOM FILTER 167 [1] Illustra6on borrowed from h`p://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf [1]
  168. 168. BLOOM FILTER 168 ✦ Natural generalization of hashing ✦ False positives are possible ✦ No false negatives No deletions allowed ✦ For false positive rate ε, # hash functions = log2(1/ε) where, n = # elements, k = # hash functions m = # bits in the array
  169. 169. VARIATIONS 169 BLOOM FILTER [Mitzenmacher 2002] COMPRESSED [Fan et al. 2000] COUNTING [Cohen and Matias 2003] SPECTRAL [Chazelle et al. 2004] BLOOMIER FILTER [Guo et al. 2010] BUFFERED [Lall and Ogihara, 2007] BITWISE [Yoon 2010] A2 [Canim et al. 2010] DYNAMIC [Einziger and Friedman, 2014] BLOOM-g [Debnath et al. 2011] Bloomflash [Naor and Yogev, 2013] SLIDING [Qiao et al. 2014] Tinyset
  170. 170. QUOTIENT FILTER 170 ✦ Supports insertion, deletion ✦ Merging/resizing feasible ✦ Lookups incur a single cache miss ✦ 20% bigger than Bloom Filter ✦ Stores p-bit fingerprints of elements ✦ Compact hash table ✦ P(hard collision) = Employs quotienting: remainder (r least significant bits) is stored in the bucket indexed by the quotient (q most significant bits) where, ⍺ = n/m, m=2q , n: # elements, m: # slots
  171. 171. VARIATIONS 171 QUOTIENT FILTER [Bender et al. 2011] CASCADE [Pandey et al. 2017] COUNTING RANK-AND-SELECT ✦ Multiset of integer, each of width p-bits ✦ l =log2( n/M) + O(1) in-flash QFs of exponentially increasing size, QF1, QF2, …, QFl stored contiguously, M: RAM size ✦ Search: O(log(n/M)) block reads ✦ Insert: O((log( n/M))/B) amortized block writes/erases, B: natural block size of the flash [Dutta et al. 2014] STREAMING [Bungeroth, 2018]
  172. 172. CUCKOO FILTER 172 ✦ Key Highlights ๏ Add and remove items dynamically ๏ For false positive rate ε < 3%, more space efficient than Bloom filter ๏ Higher performance than Bloom filter for many real workloads ๏ Asymptotically worse performance than Bloom filter ❖ Min fingerprint size α log (# entries in table) ✦ Overview ๏ Stores only a fingerprint of an item inserted ❖ Original key and value bits of each item not retrievable ๏ Set membership query for item x: search hash table for fingerprint of x
  173. 173. Cuckoo Hashing [1] CUCKOO FILTER 173 [1] R. Pagh and F. Rodler. Cuckoo hashing. Journal of Algorithms, 51(2):122-144, 2004. [2] Illustra6on borrowed from “Fan et al. Cuckoo Filter: Prac6cally Be`er Than Bloom. In Proceedings of the 10th ACM Interna6onal on Conference on Emerging Networking Experiments and Technologies, 2014.” [2] Illustra6on of Cuckoo hashing [2] ✦ High space occupancy ✦ Practical implementations: multiple items/bucket ✦ Example uses: Software-based Ethernet switches Cuckoo Filter [2] ✦ Uses a multi-way associative Cuckoo hash table ✦ Employs partial-key cuckoo hashing ๏ Store fingerprint of an item ๏ Relocate existing fingerprints to their alternative locations [2]
  174. 174. CUCKOO FILTER 174 ✦ Deletion ๏ Item must have been previously inserted ✦ Partial-key cuckoo hashing ✦ Fingerprint hashing ensures uniform distribution of items in the table ✦ Length of fingerprint << Size of h1 or h2 ✦ Possible to have multiple entries of a fingerprint in a bucket Alternate bucket Significantly shorter than h1 and h2
  175. 175. CONCURRENT CUCKOO HASHING VARIATIONS 175 CUCKOO HASHING AND CUCKOO FILTER [Li et al. 2014] [Eppstein et al. 2017] 2-3 CUCKOO FILTER HORTON TABLES [Sun et al. 2017] SMART CUCKOO [Breslow et al. 2016]
  176. 176. MORTON FILTER 176 ✦ Key Highlights ๏ Insertions heavily biased towards the hash function H1 ❖ Fingerprint retrieval require fewer hardware cache accesses ๏ Overflow Tracking Array (OTA): simple bit vector ❖ Tracks when fingerprints cannot be placed using H1 ❖ Most negative lookups only require accessing a single bucket ๏ Fullness counters ❖ Replace the empty slots (compression) and track the load of each logical bucket ❖ Reads and updates happen in-situ without explicit need for materialization ๏ Most insertions require accessing a single cache line ๏ Fewer fingerprint comparisons than a Cuckoo filter
  177. 177. MORTON FILTER 177 ✦ Hashing ๏ Reduces TLB misses, DRAM row buffer misses and page faults ๏ Does not require total # buckets to be a power of 2 K’s fingerprint Buckets per block Total buckets (multiple of 2) Bucket index where K’s fingerprint is stored Hash function ๏ Compute H’(K) and remap K’s fingerprint without knowing K itself ๏ For any key, candidate buckets mostly fall within the same page of memory Power of 2
  178. 178. SPEND 178 DYNAMIC (RE)-ALLOCATION SANs Facebook, Google, Twitter, Yahoo, Snapchat Ad Networks Affiliates Offer wall TV
  179. 179. 179
  180. 180. EXCHANGES 180 PROGRAMMATIC DoubleClick MoPub OpenX AppNexus
  181. 181. USER PROFILE 181 MULTI-DIMENSIONAL TARGETING Platform Device Type OS Version Apps Installed Age, Gender
  182. 182. STATISTICAL ARBITRAGE 182 PREDICTION Click Through Rate (CTR) Click to Install (CTI) Life Time Value (LTV) Churn probability REAL-TIME ANOMALY DETECTION
  183. 183. ANOMALY DETECTION 183 Temporal Distribution based A N O M A L Y DIFFERENT FLAVORS safaribooksonline.com/library/view/understanding-anomaly-detection/9781491983676/ “On the Runtime-Efficacy Trade-off of Anomaly Detection Techniques for Real-Time Streaming Data”, by Choudhary et al. 2017. (https://arxiv.org/abs/1710.04735)
  184. 184. ANOMALY DETECTION 184 ROOTED IN STUDIES IN PROBABILITY 1777 1713 Reconciling discrepant observations Euler 1749 Irregularities in the motion of Saturn and Jupiter Mayer 1750 Study of lunar libration Boscovich 1755 Measurements of the mean ellipticity of the Earth
  185. 185. ANOMALY DETECTION 185 LONG HISTORY First formalization of an exact rule for rejection of anomalies
  186. 186. ANOMALY DETECTION 186 CHARACTERISTICS MAGNITUDE Severity WIDTH Actionability FREQUENCY Reliability DIRECTION Positive/Negative Global Local
  187. 187. ANOMALY DETECTION 187 WHY IT’S NON-TRIVIAL SEASONALITY TREND BREAKOUTSTATIONARITYNOISE
  188. 188. ANOMALY DETECTION 188 ASSUMPTIONS Normality, Stationarity MOVING AVERAGES Params: Width, Decay RULE BASED µ ± σ COMMON APPROACHES
  189. 189. ANOMALY DETECTION 189 MAD[1] Median Absolute Deviation MEDIAN MCD[2] Minimum Covariance Determinant MVEE[3,4] Minimum Volume Enclosing Ellipsoid ROBUST MEASURES [1]P. J. Rousseeuw and C. Croux, “Alterna6ves to the Median Absolute Devia6on”, 1993. [2] h`p://onlinelibrary.wiley.com/wol1/doi/10.1002/wics.61/abstract [3] P. J. Rousseeuw and A. M. Leroy.,“Robust Regression and Outlier Detec6on”, 1987. [4] M. J.Todda and E. A. Yıldırım , “On Khachiyan's algorithm for the computa6on of minimum-volume enclosing ellipsoids”, 2007.
  190. 190. ANOMALY DETECTION 190 CONTEXT ✦ Data types ๏ Text, Audio, Video ✦ Data veracity ๏ Wearables ๏ Smart cities, Connected Home ๏ Internet of Things LIVE DATA ✦ Multi-dimensional ✦ Mow memory footprint ✦ Accuracy-speed tradeoff CHALLENGES
  191. 191. 191 ANOMALY DETECTION DIFFERENT APPLICATION DOMAINS RADIO ANOMALY DETECTION MARKER DISCOVERY CROWDED SCENE ANALYSIS ANOMALOUS HUMAN BEHAVIOR HUMAN TRAJECTORY PREDICTION [Popoola and Wang 2012] [Alahi et al. 2016] [O’Shea et al. 2016] [Schegl et al. 2017] [Sabokrou et al. 2017]
  192. 192. 192 Real-Time Security
  193. 193. THREAT DETECTION 193 POTENTIAL HIGH BUSINESS DOWNSIDE DDos Attack Network Intrusion Harassement Account hacking Identity theft Purchases
  194. 194. MONITORING 194 AUDIT LOGS Answering the “What, When, Who, Why” Downloading of unusual amount of data Troubleshoot systems, networks Real-time packet analysis Challenge: Encryption Suspicious access pattern
  195. 195. REAL TIME SECURITY USING APACHE PULSAR 195 Apache Pulsar Data Source 2 threat-fn 2 Data Source 1 Data Source 3 threat-fn 1 Query Audit Logs (SQL) Streaming threats & anomalies T1 T2 T3 Audit Log Investigation
  196. 196. DEVICE IDS/IP ADDRESSES 196 FREQUENT/TOP-K ELEMENTS, HEAVY HITTERS ✦ Require large amount of memory ✦ Long tail - low business value EXACT ✦ Types ๏ Counter-based ❖ Example: Frequent Algorithm [Demaine et al. 2002] ๏ Sketch-based ❖ Example: GroupTest [Cormode and Muthukrishnan, 2002] ✦ Parameters ๏ # counters, # update operations, error APPROXIMATE
  197. 197. FLAVORS 197 FREQUENT/TOP-K ELEMENTS, HEAVY HITTERS l 1 heavy hitter ๏ m can be known/unknown ๏ Deterministic ❖ [Misra and Gries 1982, Karp et al. 2003] ๏ Randomized ❖ [Estan and Varghese 2003, Kumar and Xu 2006] [1] Woodruff, 2016. “New Algorithms for Heavy Hi`ers in Data Streams”.ICDT. Frequency of item i # elements in stream Output
  198. 198. FLAVORS 198 FREQUENT/TOP-K ELEMENTS, HEAVY HITTERS l 2 heavy hitter ๏ Example ❖ [Charikar et al. 2004] ๏ Applications ❖ Compressed sensing ❖ Numerical linear algebra ❖ Cascaded aggregates Frequency of item i # dis6nct items Output
  199. 199. COUNT-MIN SKETCH[1] 199 ✦ A two-dimensional array counts with w columns and d rows ✦ Each entry of the array is initially zero ✦ d hash functions are chosen uniformly at random from a pairwise independent family ✦ Update ๏ For a new element i, for each row j and k = hj(i), increment the kth column by one ✦ Point query where, sketch is the table ✦ Parameters [1] Cormode, Graham; S. Muthukrishnan (2005). "An Improved Data Stream Summary: The Count-Min Sketch and its Applica6ons". J. Algorithms 55: 29–38. ),( δε ! ! " # # $ = ε e w ! ! " # # $ = δ 1 lnd }1{}1{:,,1 wnhh d ……… →
  200. 200. COUNT-MIN SKETCH 200 ✦ Count-Min sketch with conservative update (CU sketch) ๏ Update an item with frequency c ๏ Avoid unnecessary updating of counter values => Reduce over-estimation error ๏ Prone to over-estimation error on low-frequency items ✦ Lossy Conservative Update (LCU) - SWS ๏ Divide stream into windows ๏ At window boundaries, ∀ 1 ≤ i ≤ w, 1 ≤ j ≤ d, decrement sketch[i,j] if 0 < sketch[i,j] ≤ [1] Cormode, G. 2009. Encyclopedia entry on ’Count-MinSketch’. In Encyclopedia of Database Systems. Springer., 511–516. VARIANTS [1]
  201. 201. COUNTER TREE[1] 201 ✦ Motivation ๏ Equi-bitwidth counters → inefficient space usage owing variance in counts ✦ Two-dimensional counter sharing ✦ m virtual counters (V[i], 0 ≤ i < m) ๏ Path from a leaf node to the root ๏ Counter at layer j in V[i] given by: [1] Chen et al., “Counter Tree: A Scalable Counter Architecture for Per-FLow Traffic Measurement”, TON 2017. # leaf nodes Total space (# bits) Counter bit width Tree height Degree of a non-leaf node
  202. 202. COUNTER TREE (CONTD.) 202 ✦ Counting range ✦ Virtual counter array Vf ๏ Vf[i] = V[hi(f)], 0 ≤ i < r, ✦ Recording ๏ Increment a counter in Vf, chosen uniformly at random, by one ๏ May result in update of multiple counters # counters in the bo`om layer
  203. 203. COUNTER TREE (CONTD.) 203 ✦ Estimation ๏ CT based Estimation (CTE) ❖ k = dh-1 ❖ Xi is the value of the subtree rooted at C[v], where C[v] is the component counter of V[u] at the highest level and ๏ CT based Maximum Likelihood Estimation (CTM) Reduce positive bias (overestimation)
  204. 204. VARIATIONS 204 FREQUENT/TOP-K ELEMENTS, HEAVY HITTERS LOSSY COUNTING [Manku and Motwani, 2002] [Homem and Carvalho, 2010] FILTERED SPACE SAVING COUNT-SKETCH [Dimitropoulos et al. 2008] PROBABILISTIC LOSSY COUNTING [Charikar et al, 2002] [Pitel and Fouquier, 2015] COUNT-MIN-LOG SPACE SAVING ALGORITHM [Metwally et al. 2005] FDPM [Yu etal., 2006] [Sivaraman et al. 2017] HASHPIPE
  205. 205. 205
  206. 206. HEALTH CARE 206 WEARABLES Smart watches, smart glasses, smart clothing, implantables Data veracity ๏ Hear rate (ECG and HRV) ๏ Brainwave (EEG) ๏ Muscle bio-signals (EMG) ๏ Blood pressure ๏ Calories burned ๏ Body temperature ๏ Steps walked Other applications ๏ Blood alcohol content ๏ Athletic performance
  207. 207. MONITORING 207 OIL DRILLING Early detection of leaks and/or spills ๏ Oil ๏ Flammable gases ๏ Hazardous chemicals Equipment health ๏ Corrosion Operational metrics ๏ Pressure ๏ Temperature ๏ Flow ๏ Air quality
  208. 208. QUALITY OF SERVICE 208 CUSTOMER FOCUS WiFi, Video streaming, E-commerce Metrics ๏ Availability ๏ Throughput ๏ Response time ๏ Document Complete ๏ Jitter ๏ Packet loss ratio ๏ Delivery success rate (messaging) ๏ Location accuracy Common issues ๏ Anomalies, Breakouts
  209. 209. INDUSTRIAL IOT USING APACHE PULSAR 209 Apache Pulsar Cloud Data Source 2 clean-fn 2 Data Source 1 Data Source 3clean-fn 1 logic-fn 1 T1 T2 T3 logic-fn 2 Apache Pulsar Edge filter-fn 1 filter-fn 1
  210. 210. NUMBER OF DISTINCT DATA SOURCES 210 CARDINALITY ESTIMATION ✦ Hash values as strings ✦ Occurrence of particular patterns in the binary representation ✦ Example: Hyperloglog [Flajolet et al. 2008] BIT-PATTERN OBSERVABLES ✦ Hash values as real numbers ✦ k-th smallest value ๏ Insensitive to distribution of repeated values ✦ Examples: MinCount [Giroire, 2000] ORDER STATISTIC OBSERVABLES
  211. 211. CARDINALITY ESTIMATION 211 ANOTHER CLASSIFICATION SKETCH-BASED Scan the entire data set once Hash the items Create sketch SAMPLING Potentially large estimation error UNIFORM HASHING Employs a uniform hash function LOGARITHMIC HASHING Keeps track of the most uncommon element observed so far Example: Hyperloglog INTERVAL BASED How packed an interval is? BUCKET BASED P(bucket is non-empty)
  212. 212. HYPERLOGLOG 212 where ✦ Apply hash function h to every element in a multiset ✦ Cardinality of multiset is 2max(ϱ) where 0ϱ-11 is the bit pattern observed at the beginning of a hash value ✦ Above suffers with high variance ๏ Employ stochastic averaging ๏ Partition input stream into m sub-streams Si using first p bits of hash values (m = 2p)
  213. 213. HYPERLOGLOG 213 OPTIMIZATIONS ✦ Use of 64-bit hash function ๏ Total memory requirement 5 * 2p -> 6 * 2p, where p is the precision ✦ Empirical bias correction ๏ Uses empirically determined data for cardinalities smaller than 5m and uses the unmodified raw estimate otherwise ✦ Sparse representation ๏ For n≪m, store an integer obtained by concatenating the bit patterns for idx and ϱ(w) ๏ Use variable length encoding for integers that uses variable number of bytes to represent integers ๏ Use difference encoding - store the difference between successive elements ✦ Other optimizations [1, 2] [1] h`p://druid.io/blog/2014/02/18/hyperloglog-op6miza6ons-for-real-world-systems.html [2] h`p://an6rez.com/news/75
  214. 214. MIN-COUNT VARIATIONS 214 CARDINALITY ESTIMATION [Giroire, 2009] Optimal statistical efficiency [Ting, 2014] DISCRETE MAX-COUNT F0 AND L0 ESTIMATION [Chen et al. 2011] S-BITMAP [Kane et al. 2010] Optimal space complexity
  215. 215. WRAPPING UP 215 PLATFORM UNIFICATION APPLICATIONS ANALYTICS
  216. 216. QUESTIONS 216
  217. 217. STAY IN TOUCH @arun_kejariwal @karthikz TWITTER EMAIL arun_kejariwal@acm.org karthik@streaml.io 217
  218. 218. 218 @arun_kejariwal @karthikz
  219. 219. READINGS 219 Data Streams: Algorithms and Applications”, by M. Muthukrishnan “Sketching as a Tool for Numerical Linear Algebra”, by D. Woodruff “Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches”, by G. Cormode, M. Garofalakis and P. J. Haas “Data Streams: Models and Algorithms”, by C. Aggarwal. “Graph Streaming Algorithms”, by A. McGregor “Data Sketching”, by G. Cormode.
  220. 220. READINGS 220 “The algebra of stream processing functions”, TCS 2001. STREAM CALCULUS “A conductive calculus of streams”, MSCS 2005. “Temporal stream algebra”, 2012. “Stream processing col-algebraically”, SCP 2013. “An algebra for pattern matching, time aware aggregates and partitions on relational data streams”, DEBS 2015.
  221. 221. READINGS 221 “Streaming Algorithms for Robust Distinct Elements”, SIGMOD 2016. “Pyramid Sketch”, VLDB 2017. “Bias-Aware Sketches”, VLDB 2017. “Efficient Adaptive Detection of Complex Event Patterns”, VLDB 2018. “Cardinality Estimation: An Experimental Survey”, VLDB 2018. “Augmented Sketch: Faster and More Accurate Stream Processing”, SIGMOD 2016. “BPTree: an L2 heavy hitters algorithm using constant memory”, PODS 2017. “A comparative analysis of state-of-the-art SQL- on-Hadoop systems for interactive analytics”, BigData 2018. “Unsupervised real-time anomaly detection for streaming data”, Neurocomputing2017. “Time Adaptive Sketches (Ada-Sketches) for Summarizing Data Streams”, SIGMOD 2016.
  222. 222. 222 READINGS “Streaming Anomaly Detection Using Randomized Matrix Sketching”, VLDB 2015. “Graph Sketches: Sparsification, Spanners, and Subgraphs”, PODS 2012. “Stream Order and Order Statistics: Quantile Estimation in Random-Order Streams”, SIAM JoC 2009. “Querying and mining data streams: You only get one look”, SIGMOD 2002. “Models and Issues in Data Stream Systems”, PODS 2002. “Clustering Data Streams”, FOCS 2000. “Persistent Data Sketching”, VLDB 2015. “SCREEN: Stream Data Cleaning under Speed Constraints”, SIGMOD 2015. “An optimal algorithm for the distinct elements problem”, PODS 2010. “Statistical Analysis of Sketch Estimators”, SIGMOD 2007.
  223. 223. OPEN SOURCE 223 Data Sketches (https://datasketches.github.io/) Algebird (https://github.com/twitter/algebird) StreamDM (http://huawei-noah.github.io/streamDM/) stream (https://cran.r-project.org/web/packages/stream/) FluRS (https://github.com/takuti/flurs)
  224. 224. RESOURCES 224 https://www.slideshare.net/arunkejariwal/correlation-analysis-on-live-data-streams (O’Reilly Strata) https://www.slideshare.net/arunkejariwal/modern-realtime-streaming-architectures (O’Reilly Strata) https://www.slideshare.net/arunkejariwal/live-anomaly-detection-80287265 (O’Reilly Strata) https://www.slideshare.net/arunkejariwal/anomaly-detection-in-realtime-data-streams-using-heron (O’Reilly Strata) https://www.slideshare.net/arunkejariwal/real-time-analytics-algorithms-and-systems (VLDB 2015)
  225. 225. RESOURCES 225 http://www.cs.toronto.edu/~koudas/courses/csc2508/SQL-on-Hadoop-Final.pdf (VLDB 2015) https://www.cse.ust.hk/vldb2002/program-info/tutorial-slides/T5garofalalis.pdf (VLDB 2002) https://datasketches.github.io/docs/Research.html https://people.cs.umass.edu/~mcgregor/slides/07-nipsslides.pdf (NIPS 2007) http://dimacs.rutgers.edu/~graham/pubs/slides/streammining.pdf
  226. 226. RESOURCES 226 https://people.cs.umass.edu/~mcgregor/slides/10-jhu1.pdf http://dmac.rutgers.edu/Workshops/WGUnifyingTheory/Slides/cormode.pdf https://archive.siam.org/meetings/sdm13/gupta.pdf (SDM 2013) http://www.outlier-analytics.org/odd13kdd/papers/slides_charu_aggarwal.pdf (ODD 2013) https://www.siam.org/meetings/sdm08/TS2.ppt (SDM 2008)
  227. 227. RESOURCES 227 https://www.cc.gatech.edu/~jx/reprints/talks/sigm07_tutorial.pdf (SIGMETRICS 2007) hanj.cs.illinois.edu/bk3/bk3_slides/12Outlier.ppt https://gist.github.com/debasishg/8172796 https://www.euroscipy.org/2017/descriptions/19827.html

×