SlideShare una empresa de Scribd logo
1 de 46
© Rocana, Inc. All Rights Reserved. | 1
Joey Echeverria, Platform Technical Lead
San Francisco Hadoop Users Group, June 14th 2016
San Francisco, CA
Streaming ETL for All
© Rocana, Inc. All Rights Reserved. | 2
Slides
http://bit.ly/streaming-etl-slides
© Rocana, Inc. All Rights Reserved. | 3
Context
© Rocana, Inc. All Rights Reserved. | 4
Joey
• Where I work: Rocana – Platform Technical Lead
• Where I used to work: Cloudera (’11-’15), NSA
• Distributed systems, security, data processing, big data
© Rocana, Inc. All Rights Reserved. | 5
© Rocana, Inc. All Rights Reserved. | 6
History
© Rocana, Inc. All Rights Reserved. | 7
Spark
Impala
“Legacy” data architecture
HDFS
Avro/Parquet FilesFlume/Sqoop
Data Producers
MapReduc
e
Visualization/Query
© Rocana, Inc. All Rights Reserved. | 8
Flink
Storm
Stream data architecture
Kafka
Avro Serialized
Recrods
Data Producers Spark Streaming
Real-time Visualization
HDFS
Avro/Parquet FilesKafka Consumers
© Rocana, Inc. All Rights Reserved. | 9
Flink
Storm
Stream data architecture
Kafka
Avro Serialized
Recrods
Data Producers Spark Streaming
Real-time Visualization
HDFS
Avro/Parquet FilesKafka Consumers
© Rocana, Inc. All Rights Reserved. | 10
Stream processing
A primer
© Rocana, Inc. All Rights Reserved. | 11
Stream processing
• Filter
• Extract
• Project
• Aggregate
• Join
• Model
© Rocana, Inc. All Rights Reserved. | 12
Stream processing
• Filter
• Extract
• Project
• Aggregate
• Join
• Model
© Rocana, Inc. All Rights Reserved. | 13
Stream processing
• Filter
• Extract
• Project
• Aggregate
• Join
• Model
• Data transformation
© Rocana, Inc. All Rights Reserved. | 14
Apache Storm
• "Distributed real-time computation system"
• Applications packaged into topologies (think MapReduce job)
• Topologies operate over streams of tuples
• Spout: source of a stream
• Bolt: arbitrary operation such as filtering, aggregating, joining, or
executing arbitrary functions
© Rocana, Inc. All Rights Reserved. | 15
Apache Spark
• Supports batch and stream processing
• Continuous stream of records discretized into a DStream
• DStream: a sequence of RDDs (batches of records)
• Micro-batch
© Rocana, Inc. All Rights Reserved. | 16
Apache Flink
• Supports batch and stream processing
• DataStream: unbounded collection of records
• Operations can apply to individual records or windows of records
• Supports record-at-a-time processing (like Storm)
© Rocana, Inc. All Rights Reserved. | 17
Apache Kafka
• Pub-sub messaging system implemented as a distributed commit log
• Popular as a source and sink for data streams
• Scalability, durability, and easy-to-understand delivery guarantees
• Can do stream processing directly in Kafka consumers
• Kafka Streams
© Rocana, Inc. All Rights Reserved. | 18
Data transformation
© Rocana, Inc. All Rights Reserved. | 19
Filter
filter
© Rocana, Inc. All Rights Reserved. | 20
Extract
127.0.0.1 Mozilla/5.0 laura [31/Mar/2016] "GET /index.html HTTP/1.0" 200 2326
ts: 1436576671000
body: <binary blob>
event_type_id: 100
...
extract
ts: 1436576671000
body: <binary blob>
event_type_id: 100
attributes: {
ip: "127.0.0.1"
user_agent: "Mozilla/5.0"
user_id: "laura"
date: "[31/March/2016]"
request: "GET /index.html HTTP/1.0"
status_code: "200"
size: "2326"
}
© Rocana, Inc. All Rights Reserved. | 21
Project
ts: 1436576671000
body: <binary blob>
event_type_id: 100
attributes: {
ip: "127.0.0.1"
user_agent: "Mozilla/5.0"
user_id: "laura"
date: "[31/March/2016]"
request: "GET /index.html HTTP/1.0"
status_code: "200"
size: "2326"
}
ts: 1459444413000
ip: "127.0.0.1"
user_agent: "Mozilla/5.0"
user_id: "laura"
request: "GET /index.html HTTP/1.0"
status_code: 200
size: 2326
project
© Rocana, Inc. All Rights Reserved. | 22
Problem
© Rocana, Inc. All Rights Reserved. | 23
Who
• Developers
• Data engineers
• Sysadmins
• Analysts
© Rocana, Inc. All Rights Reserved. | 24
Tools
© Rocana, Inc. All Rights Reserved. | 25
The dark art of data science
• Feature engineering
• “Getting a mess of raw data that can be used as input to a machine
learning algorithm” - @josh_wills
• Video from Midwest.io 2014
© Rocana, Inc. All Rights Reserved. | 26
Data transformation for all
© Rocana, Inc. All Rights Reserved. | 27
Rocana Transform
• Library
• Java
• Rocana configuration
• JSON + comments + specific numeric types - excess quoting
© Rocana, Inc. All Rights Reserved. | 28
Data model
• Event schema
• id: A globally unique identifier for this event
• ts: Epoch timestamp in milliseconds
• event_type_id: ID indicating the type of the event
• location: Location from which the event was generated
• host: Hostname, IP, or other device identifier from which the event was
generated
• service: Service or process from which the event was generated
• body: Raw event content in bytes
• attributes: Event type-specific key/value pairs
© Rocana, Inc. All Rights Reserved. | 29
Example event
{
"id": "JRHAIDMLCKLEAPMIQDHFLO3MXYXV7NVBEJNDKZGS2XVSEINGGBHA====",
"event_type_id": 100,
"ts": 1436576671000,
"location": "aws/us-west-2a",
"host": "example01.rocana.com",
"service": "dhclient",
"body": "<36>Jul 10 18:04:31 gs09.example.com dhclient[865] DHCPACK from …",
"attributes": {
"syslog_timestamp": "1436576671000",
"syslog_process": "dhclient",
"syslog_pid": "865",
"syslog_facility": "3",
"syslog_severity": "6",
"syslog_hostname": "example01",
"syslog_message": "DHCPACK from 10.10.1.1 (xid=0x5c64bdb0)"
}
}
© Rocana, Inc. All Rights Reserved. | 30
Filter, extract, and flatten
© Rocana, Inc. All Rights Reserved. | 31
Filter, extract, and flatten
• Filter out events without type id 100
• Filter out events without hostname prefix "ex"
• Extract a numeric prefix from the syslog message
• Flatten syslog attributes to top-level fields in a different avro schema
© Rocana, Inc. All Rights Reserved. | 32
Filter, extract, and flatten
{
load-event: {},
// Filter by event_type_id
filter: { expression: "${event_type_id == 100}" },
// Extract hostname prefix
regex: { ... },
filter: { expression: "${host_prefix.match.group.1 == 'ex'}",
// Extract a numeric prefix from the syslog message
regex: { ... },
// Build flattened record
build-avro-record: { ... },
// Accumulate output record
accumulate-output: {
value: "${output_record}"
}
}
© Rocana, Inc. All Rights Reserved. | 33
Extract hostname prefix
{
load-event: {},
filter: { expression: "${event_type_id == 100}" },
regex: {
pattern: "^(.{2}).*$",
value: "${attr.syslog_hostname}",
destination: "host_prefix"
},
filter: { expression: "${host_prefix.match.group.1 == 'ex'}",
...
}
© Rocana, Inc. All Rights Reserved. | 34
Extract numeric prefix
...
filter: { expression: "${host_prefix.match.group.1 == 'ex'}",
regex: {
pattern: "^([0-9]*)",
value: "${attributes['syslog_message']}",
destination: "msg",
match-actions: {
set-values: { extracted_field: "${msg.match.group.1}" }
},
no-match-actions: {
set-values: { extracted_field: "" }
}
},
...
© Rocana, Inc. All Rights Reserved. | 35
Build flattened record
...
build-avro-record: {
schema-uri: "resource:avro-schemas/flattened-syslog.avsc",
destination: "output_record",
field-mapping: {
ts: "${ts}",
event_type_id: "${event_type_id}",
source: "${source}",
syslog_facility: "${convert:toInt(attributes['syslog_facility'])}",
syslog_severity: "${convert:toInt(attributes['syslog_severity'])}",
...
syslog_message: "${attributes['syslog_message']}",
syslog_pid: "${convert:toInt(attributes['syslog_pid)}",
extracted_field: "${extracted_field}"
},
},
...
© Rocana, Inc. All Rights Reserved. | 36
Extract metrics from log data
© Rocana, Inc. All Rights Reserved. | 37
Extract metrics
• Input: HTTP status logs
• Extract request latency
• Extract counts by HTTP status code
• Metric types
• Guage: A value that varies over time (think latency, CPU %, etc.)
• Counter: A value that accumulates over time (think event volume, status codes,
etc.)
© Rocana, Inc. All Rights Reserved. | 38
Example metric event
{
"id": "JRHAIDMLCKLEAPMIQDHFLO3MXBBQ7NVBEJNDKZGS2XVSEINGGBHA====",
"event_type_id": 107,
"ts": 1436576671000,
"location": "aws/us-west-2a",
"host": "web01.rocana.com",
"service": "httpd",
"attributes": {
"m.http.request.latency": "4.2000000000E1|g",
"m.http.status.401.count": "1.0000000000E0|c",
}
}
© Rocana, Inc. All Rights Reserved. | 39
Extract metrics
{
load-event: {},
build-metric: {
gauge-mapping: {
http.request.latency: "${convert:toDouble(attributes['latency'])}"
},
destination: "latency_metric"
},
accumulate-output: { value: "${latency_metric}" },
build-metric: {
dynamic-counter-mapping: [
"${string:format('http.status.%s.count', attributes['sc_status'])}", 1D
],
destination: "status_metric"
},
accumulate-output: { value: "${status_metric}" }
}
© Rocana, Inc. All Rights Reserved. | 40
Architecture
© Rocana, Inc. All Rights Reserved. | 41
Java action objects
Architecture
Configuration file Java action objects Context
Variables
Driver
1. Parse config
2. Initialize
context
5. Copy output
3. Execute actions
4. Read/write
variables
© Rocana, Inc. All Rights Reserved. | 42
Custom actions
• Actions loaded at runtime using Java services framework
• Add your jar to the classpath
• Custom actions appear as top-level keywords just like regular actions
• Implement the execute() method of the Action interface
• Implement the build() method of the ActionBuilder interface
© Rocana, Inc. All Rights Reserved. | 43
Custom actions
• Parse custom log formats
• Cisco ACS
• Citrix
• Juniper
• Customer-specific formats
• Lookup IP addresses in the MaxMind GeoIP2 database
• Reference dataset lookups
• Device id to device name
© Rocana, Inc. All Rights Reserved. | 44
Putting it all together
• Stream processing is causing us to re-think how we analyze data
• Limiting accessibility of data transformation side increases costs and
decreases velocity
• Reduce your reliance on developers to code custom pipelines
• Re-use transformation configuration in any stream processing framework
or batch job
© Rocana, Inc. All Rights Reserved. | 45
Coming soon
• Rocana transform will be released under the ASL 2.0
• The configuration library is available today:
• https://github.com/scalingdata/rocana-configuration
© Rocana, Inc. All Rights Reserved. | 46
Questions?

Más contenido relacionado

La actualidad más candente

Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with SparkVincent GALOPIN
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Databricks
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to OneSerg Masyutin
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streamingdatamantra
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioPiotr Czarnas
 
Streaming Microservices With Akka Streams And Kafka Streams
Streaming Microservices With Akka Streams And Kafka StreamsStreaming Microservices With Akka Streams And Kafka Streams
Streaming Microservices With Akka Streams And Kafka StreamsLightbend
 
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’ Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’ confluent
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationPatrick Di Loreto
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSDatabricks
 
Designing Payloads for Event-Driven Systems | Lorna Mitchell, Aiven
Designing Payloads for Event-Driven Systems | Lorna Mitchell, AivenDesigning Payloads for Event-Driven Systems | Lorna Mitchell, Aiven
Designing Payloads for Event-Driven Systems | Lorna Mitchell, AivenHostedbyConfluent
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Apache Apex
 
Future of data visualization
Future of data visualizationFuture of data visualization
Future of data visualizationhadoopsphere
 
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...Lightbend
 

La actualidad más candente (20)

How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan Kessler
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.io
 
Streaming Microservices With Akka Streams And Kafka Streams
Streaming Microservices With Akka Streams And Kafka StreamsStreaming Microservices With Akka Streams And Kafka Streams
Streaming Microservices With Akka Streams And Kafka Streams
 
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’ Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RS
 
Designing Payloads for Event-Driven Systems | Lorna Mitchell, Aiven
Designing Payloads for Event-Driven Systems | Lorna Mitchell, AivenDesigning Payloads for Event-Driven Systems | Lorna Mitchell, Aiven
Designing Payloads for Event-Driven Systems | Lorna Mitchell, Aiven
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
 
Future of data visualization
Future of data visualizationFuture of data visualization
Future of data visualization
 
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
 

Destacado

Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applicationsJoey Echeverria
 
Spark to Production @Windward
Spark to Production @WindwardSpark to Production @Windward
Spark to Production @WindwardDemi Ben-Ari
 
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)Spark Summit
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkAnu Shetty
 
Test Automation and Continuous Integration
Test Automation and Continuous Integration Test Automation and Continuous Integration
Test Automation and Continuous Integration TestCampRO
 
Distributed Testing Environment
Distributed Testing EnvironmentDistributed Testing Environment
Distributed Testing EnvironmentŁukasz Morawski
 
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...Lucidworks
 
Production Readiness Testing Using Spark
Production Readiness Testing Using SparkProduction Readiness Testing Using Spark
Production Readiness Testing Using SparkSalesforce Engineering
 
Ektron 8.5 RC - Search
Ektron 8.5 RC - SearchEktron 8.5 RC - Search
Ektron 8.5 RC - SearchBillCavaUs
 
Solr JDBC - Lucene/Solr Revolution 2016
Solr JDBC - Lucene/Solr Revolution 2016Solr JDBC - Lucene/Solr Revolution 2016
Solr JDBC - Lucene/Solr Revolution 2016Kevin Risden
 
Events, Signals, and Recommendations
Events, Signals, and RecommendationsEvents, Signals, and Recommendations
Events, Signals, and RecommendationsLucidworks
 
Netflix Global Search - Lucene Revolution
Netflix Global Search - Lucene RevolutionNetflix Global Search - Lucene Revolution
Netflix Global Search - Lucene Revolutionivan provalov
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkabandatamantra
 
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...Lucidworks
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Lucidworks
 
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, AirbnbAirbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, AirbnbLucidworks
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To ProductionJen Aman
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 

Destacado (20)

Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
 
Spark to Production @Windward
Spark to Production @WindwardSpark to Production @Windward
Spark to Production @Windward
 
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing spark
 
Test Automation and Continuous Integration
Test Automation and Continuous Integration Test Automation and Continuous Integration
Test Automation and Continuous Integration
 
Distributed Testing Environment
Distributed Testing EnvironmentDistributed Testing Environment
Distributed Testing Environment
 
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Production Readiness Testing Using Spark
Production Readiness Testing Using SparkProduction Readiness Testing Using Spark
Production Readiness Testing Using Spark
 
Ektron 8.5 RC - Search
Ektron 8.5 RC - SearchEktron 8.5 RC - Search
Ektron 8.5 RC - Search
 
Solr JDBC - Lucene/Solr Revolution 2016
Solr JDBC - Lucene/Solr Revolution 2016Solr JDBC - Lucene/Solr Revolution 2016
Solr JDBC - Lucene/Solr Revolution 2016
 
Events, Signals, and Recommendations
Events, Signals, and RecommendationsEvents, Signals, and Recommendations
Events, Signals, and Recommendations
 
Netflix Global Search - Lucene Revolution
Netflix Global Search - Lucene RevolutionNetflix Global Search - Lucene Revolution
Netflix Global Search - Lucene Revolution
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
 
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, AirbnbAirbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 

Similar a Streaming ETL for All

Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streamsJoey Echeverria
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Felicia Haggarty
 
Fabric - Realtime stream processing framework
Fabric - Realtime stream processing frameworkFabric - Realtime stream processing framework
Fabric - Realtime stream processing frameworkShashank Gautam
 
nuclio Overview October 2017
nuclio Overview October 2017nuclio Overview October 2017
nuclio Overview October 2017iguazio
 
iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)Eran Duchan
 
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...confluent
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaTreasure Data, Inc.
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer confluent
 
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016cdmaxime
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Elk presentation 2#3
Elk presentation 2#3Elk presentation 2#3
Elk presentation 2#3uzzal basak
 
Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Eric Sammer
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Timothy Spann
 
Stream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NETStream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NETconfluent
 

Similar a Streaming ETL for All (20)

Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015
 
Fabric - Realtime stream processing framework
Fabric - Realtime stream processing frameworkFabric - Realtime stream processing framework
Fabric - Realtime stream processing framework
 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
 
nuclio Overview October 2017
nuclio Overview October 2017nuclio Overview October 2017
nuclio Overview October 2017
 
iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)
 
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with Rocana
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer
 
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Spark etl
Spark etlSpark etl
Spark etl
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Elk presentation 2#3
Elk presentation 2#3Elk presentation 2#3
Elk presentation 2#3
 
Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
 
Stream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NETStream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NET
 

Más de Joey Echeverria

The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityJoey Echeverria
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and ClouderaJoey Echeverria
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoopJoey Echeverria
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use casesJoey Echeverria
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itchJoey Echeverria
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real worldJoey Echeverria
 

Más de Joey Echeverria (10)

Debugging Apache Spark
Debugging Apache SparkDebugging Apache Spark
Debugging Apache Spark
 
The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop Security
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and Cloudera
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
 
Big data security
Big data securityBig data security
Big data security
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itch
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real world
 

Último

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Último (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Streaming ETL for All

  • 1. © Rocana, Inc. All Rights Reserved. | 1 Joey Echeverria, Platform Technical Lead San Francisco Hadoop Users Group, June 14th 2016 San Francisco, CA Streaming ETL for All
  • 2. © Rocana, Inc. All Rights Reserved. | 2 Slides http://bit.ly/streaming-etl-slides
  • 3. © Rocana, Inc. All Rights Reserved. | 3 Context
  • 4. © Rocana, Inc. All Rights Reserved. | 4 Joey • Where I work: Rocana – Platform Technical Lead • Where I used to work: Cloudera (’11-’15), NSA • Distributed systems, security, data processing, big data
  • 5. © Rocana, Inc. All Rights Reserved. | 5
  • 6. © Rocana, Inc. All Rights Reserved. | 6 History
  • 7. © Rocana, Inc. All Rights Reserved. | 7 Spark Impala “Legacy” data architecture HDFS Avro/Parquet FilesFlume/Sqoop Data Producers MapReduc e Visualization/Query
  • 8. © Rocana, Inc. All Rights Reserved. | 8 Flink Storm Stream data architecture Kafka Avro Serialized Recrods Data Producers Spark Streaming Real-time Visualization HDFS Avro/Parquet FilesKafka Consumers
  • 9. © Rocana, Inc. All Rights Reserved. | 9 Flink Storm Stream data architecture Kafka Avro Serialized Recrods Data Producers Spark Streaming Real-time Visualization HDFS Avro/Parquet FilesKafka Consumers
  • 10. © Rocana, Inc. All Rights Reserved. | 10 Stream processing A primer
  • 11. © Rocana, Inc. All Rights Reserved. | 11 Stream processing • Filter • Extract • Project • Aggregate • Join • Model
  • 12. © Rocana, Inc. All Rights Reserved. | 12 Stream processing • Filter • Extract • Project • Aggregate • Join • Model
  • 13. © Rocana, Inc. All Rights Reserved. | 13 Stream processing • Filter • Extract • Project • Aggregate • Join • Model • Data transformation
  • 14. © Rocana, Inc. All Rights Reserved. | 14 Apache Storm • "Distributed real-time computation system" • Applications packaged into topologies (think MapReduce job) • Topologies operate over streams of tuples • Spout: source of a stream • Bolt: arbitrary operation such as filtering, aggregating, joining, or executing arbitrary functions
  • 15. © Rocana, Inc. All Rights Reserved. | 15 Apache Spark • Supports batch and stream processing • Continuous stream of records discretized into a DStream • DStream: a sequence of RDDs (batches of records) • Micro-batch
  • 16. © Rocana, Inc. All Rights Reserved. | 16 Apache Flink • Supports batch and stream processing • DataStream: unbounded collection of records • Operations can apply to individual records or windows of records • Supports record-at-a-time processing (like Storm)
  • 17. © Rocana, Inc. All Rights Reserved. | 17 Apache Kafka • Pub-sub messaging system implemented as a distributed commit log • Popular as a source and sink for data streams • Scalability, durability, and easy-to-understand delivery guarantees • Can do stream processing directly in Kafka consumers • Kafka Streams
  • 18. © Rocana, Inc. All Rights Reserved. | 18 Data transformation
  • 19. © Rocana, Inc. All Rights Reserved. | 19 Filter filter
  • 20. © Rocana, Inc. All Rights Reserved. | 20 Extract 127.0.0.1 Mozilla/5.0 laura [31/Mar/2016] "GET /index.html HTTP/1.0" 200 2326 ts: 1436576671000 body: <binary blob> event_type_id: 100 ... extract ts: 1436576671000 body: <binary blob> event_type_id: 100 attributes: { ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" date: "[31/March/2016]" request: "GET /index.html HTTP/1.0" status_code: "200" size: "2326" }
  • 21. © Rocana, Inc. All Rights Reserved. | 21 Project ts: 1436576671000 body: <binary blob> event_type_id: 100 attributes: { ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" date: "[31/March/2016]" request: "GET /index.html HTTP/1.0" status_code: "200" size: "2326" } ts: 1459444413000 ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" request: "GET /index.html HTTP/1.0" status_code: 200 size: 2326 project
  • 22. © Rocana, Inc. All Rights Reserved. | 22 Problem
  • 23. © Rocana, Inc. All Rights Reserved. | 23 Who • Developers • Data engineers • Sysadmins • Analysts
  • 24. © Rocana, Inc. All Rights Reserved. | 24 Tools
  • 25. © Rocana, Inc. All Rights Reserved. | 25 The dark art of data science • Feature engineering • “Getting a mess of raw data that can be used as input to a machine learning algorithm” - @josh_wills • Video from Midwest.io 2014
  • 26. © Rocana, Inc. All Rights Reserved. | 26 Data transformation for all
  • 27. © Rocana, Inc. All Rights Reserved. | 27 Rocana Transform • Library • Java • Rocana configuration • JSON + comments + specific numeric types - excess quoting
  • 28. © Rocana, Inc. All Rights Reserved. | 28 Data model • Event schema • id: A globally unique identifier for this event • ts: Epoch timestamp in milliseconds • event_type_id: ID indicating the type of the event • location: Location from which the event was generated • host: Hostname, IP, or other device identifier from which the event was generated • service: Service or process from which the event was generated • body: Raw event content in bytes • attributes: Event type-specific key/value pairs
  • 29. © Rocana, Inc. All Rights Reserved. | 29 Example event { "id": "JRHAIDMLCKLEAPMIQDHFLO3MXYXV7NVBEJNDKZGS2XVSEINGGBHA====", "event_type_id": 100, "ts": 1436576671000, "location": "aws/us-west-2a", "host": "example01.rocana.com", "service": "dhclient", "body": "<36>Jul 10 18:04:31 gs09.example.com dhclient[865] DHCPACK from …", "attributes": { "syslog_timestamp": "1436576671000", "syslog_process": "dhclient", "syslog_pid": "865", "syslog_facility": "3", "syslog_severity": "6", "syslog_hostname": "example01", "syslog_message": "DHCPACK from 10.10.1.1 (xid=0x5c64bdb0)" } }
  • 30. © Rocana, Inc. All Rights Reserved. | 30 Filter, extract, and flatten
  • 31. © Rocana, Inc. All Rights Reserved. | 31 Filter, extract, and flatten • Filter out events without type id 100 • Filter out events without hostname prefix "ex" • Extract a numeric prefix from the syslog message • Flatten syslog attributes to top-level fields in a different avro schema
  • 32. © Rocana, Inc. All Rights Reserved. | 32 Filter, extract, and flatten { load-event: {}, // Filter by event_type_id filter: { expression: "${event_type_id == 100}" }, // Extract hostname prefix regex: { ... }, filter: { expression: "${host_prefix.match.group.1 == 'ex'}", // Extract a numeric prefix from the syslog message regex: { ... }, // Build flattened record build-avro-record: { ... }, // Accumulate output record accumulate-output: { value: "${output_record}" } }
  • 33. © Rocana, Inc. All Rights Reserved. | 33 Extract hostname prefix { load-event: {}, filter: { expression: "${event_type_id == 100}" }, regex: { pattern: "^(.{2}).*$", value: "${attr.syslog_hostname}", destination: "host_prefix" }, filter: { expression: "${host_prefix.match.group.1 == 'ex'}", ... }
  • 34. © Rocana, Inc. All Rights Reserved. | 34 Extract numeric prefix ... filter: { expression: "${host_prefix.match.group.1 == 'ex'}", regex: { pattern: "^([0-9]*)", value: "${attributes['syslog_message']}", destination: "msg", match-actions: { set-values: { extracted_field: "${msg.match.group.1}" } }, no-match-actions: { set-values: { extracted_field: "" } } }, ...
  • 35. © Rocana, Inc. All Rights Reserved. | 35 Build flattened record ... build-avro-record: { schema-uri: "resource:avro-schemas/flattened-syslog.avsc", destination: "output_record", field-mapping: { ts: "${ts}", event_type_id: "${event_type_id}", source: "${source}", syslog_facility: "${convert:toInt(attributes['syslog_facility'])}", syslog_severity: "${convert:toInt(attributes['syslog_severity'])}", ... syslog_message: "${attributes['syslog_message']}", syslog_pid: "${convert:toInt(attributes['syslog_pid)}", extracted_field: "${extracted_field}" }, }, ...
  • 36. © Rocana, Inc. All Rights Reserved. | 36 Extract metrics from log data
  • 37. © Rocana, Inc. All Rights Reserved. | 37 Extract metrics • Input: HTTP status logs • Extract request latency • Extract counts by HTTP status code • Metric types • Guage: A value that varies over time (think latency, CPU %, etc.) • Counter: A value that accumulates over time (think event volume, status codes, etc.)
  • 38. © Rocana, Inc. All Rights Reserved. | 38 Example metric event { "id": "JRHAIDMLCKLEAPMIQDHFLO3MXBBQ7NVBEJNDKZGS2XVSEINGGBHA====", "event_type_id": 107, "ts": 1436576671000, "location": "aws/us-west-2a", "host": "web01.rocana.com", "service": "httpd", "attributes": { "m.http.request.latency": "4.2000000000E1|g", "m.http.status.401.count": "1.0000000000E0|c", } }
  • 39. © Rocana, Inc. All Rights Reserved. | 39 Extract metrics { load-event: {}, build-metric: { gauge-mapping: { http.request.latency: "${convert:toDouble(attributes['latency'])}" }, destination: "latency_metric" }, accumulate-output: { value: "${latency_metric}" }, build-metric: { dynamic-counter-mapping: [ "${string:format('http.status.%s.count', attributes['sc_status'])}", 1D ], destination: "status_metric" }, accumulate-output: { value: "${status_metric}" } }
  • 40. © Rocana, Inc. All Rights Reserved. | 40 Architecture
  • 41. © Rocana, Inc. All Rights Reserved. | 41 Java action objects Architecture Configuration file Java action objects Context Variables Driver 1. Parse config 2. Initialize context 5. Copy output 3. Execute actions 4. Read/write variables
  • 42. © Rocana, Inc. All Rights Reserved. | 42 Custom actions • Actions loaded at runtime using Java services framework • Add your jar to the classpath • Custom actions appear as top-level keywords just like regular actions • Implement the execute() method of the Action interface • Implement the build() method of the ActionBuilder interface
  • 43. © Rocana, Inc. All Rights Reserved. | 43 Custom actions • Parse custom log formats • Cisco ACS • Citrix • Juniper • Customer-specific formats • Lookup IP addresses in the MaxMind GeoIP2 database • Reference dataset lookups • Device id to device name
  • 44. © Rocana, Inc. All Rights Reserved. | 44 Putting it all together • Stream processing is causing us to re-think how we analyze data • Limiting accessibility of data transformation side increases costs and decreases velocity • Reduce your reliance on developers to code custom pipelines • Re-use transformation configuration in any stream processing framework or batch job
  • 45. © Rocana, Inc. All Rights Reserved. | 45 Coming soon • Rocana transform will be released under the ASL 2.0 • The configuration library is available today: • https://github.com/scalingdata/rocana-configuration
  • 46. © Rocana, Inc. All Rights Reserved. | 46 Questions?