This is my presentation to the inaugural meetup of the Amazon Kinesis London User Group.
In it I briefly introduced Snowplow, explained why we were excited about Kinesis (drawing on my "three eras" blog post) and then set out how we are updating Snowplow to run on Kinesis. I concluded with a live demo of what we have running on Kinesis so far.
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London User Group
1. Let’s introduce Amazon Kinesis
Inaugural meetup of the
Amazon Kinesis - London User Group
2. This evening
•
Introducing Amazon Kinesis, Ian Meyers, AWS
•
Pizza and drinks break
•
Kinesis and Snowplow, Alex Dean, Snowplow Analytics
•
Drinks
•
All courtesy of our hosts:
6. Today, Snowplow is primarily an open source web analytics
platform
Snowplow: data pipeline
Website / webapp
Amazon S3
Collect
Transform
and enrich
Amazon
Redshift /
PostgreSQL
• Your granular, event-level and customer-level data,
in your own data warehouse
• Connect any analytics tool to your data
• Join your web analytics data with any other data set
7. Snowplow was born out of our frustration with traditional web
analytics tools…
• Limited set of reports that don’t answer business questions
•
•
•
•
Traffic levels by source
Conversion levels
Bounce rates
Pages / visit
• Web analytics tools don’t understand the entities that
matter to business
• Customers, intentions, behaviours, articles, videos, authors,
subjects, services…
• …vs pages, conversions, goals, clicks, transactions
• Web analytics tools are siloed
• Hard to integrate with other data sets incl. digital (marketing
spend, ad server data), customer data (CRM), financial data
(cost of goods, customer lifetime value)
8. …and out of the opportunities to tame big data new
technologies presented
These tools make it possible to capture, transform, store and analyse all your
granular, event-level data, to you can perform any analysis
9. Snowplow is composed of a set of loosely coupled subsystems,
architected to be robust and scalable
1. Trackers
A
2. Collectors
B
3. Enrich
C
4. Storage
D
5. Analytics
Generate event
data
Receive data
from trackers
and log it to S3
Clean and
enrich raw data
Store data
ready for
analysis
Examples:
• Javascript
tracker
• Python /
Lua / No-JS
/ Arduino
tracker
Examples:
• Cloudfront
collector
• Clojure
collector for
Amazon EB
Built on
Scalding /
Cascading /
Hadoop and
powered by
Amazon EMR
Examples:
• Amazon
Redshift
• PostgreSQL
• Amazon S3
• Batch-based A D Standardised data protocols
• Normally run overnight; sometimes
every 4-6 hours
11. A quick history lesson: the three eras of business data processing
1.
The classic era, 1996+
2.
The hybrid era, 2005+
3.
The unified era, 2013+
For more see http://snowplowanalytics.com/blog/2014/01/20/the-three-eras-of-business-data-processing/
12. The classic era, 1996+
OWN DATA CENTER
NARROW DATA SILOES
LOW LATENCY LOCAL LOOPS
Point-to-point
connections
CMS
E-comm
Local loop
ERP
Local loop
Silo
CRM
Local loop
Silo
Local loop
Silo
Nightly batch ETL process
HIGH LATENCY
WIDE DATA
COVERAGE
Management
reporting
Data warehouse
FULL DATA
HISTORY
Silo
13. The hybrid era, 2005+
CLOUD VENDOR / OWN DATA CENTER
NARROW DATA SILOES
Search
LOW LATENCY LOCAL LOOPS
CMS
Local loop
SAAS VENDOR #1
E-comm
Local loop
Silo
Local loop
Silo
APIs
ERP
Local loop
Silo
CRM
Local loop
Silo
Bulk exports
SAAS VENDOR #2
Stream
processing
Micro-batch
processing
Batch
processing
Batch
processing
Email
marketing
Local loop
Product
rec’s
Local loop
LOW LATENCY
Systems
monitoring
Data
warehouse
Hadoop
SAAS VENDOR #3
Local loop
LOW LATENCY
Management
reporting
HIGH LATENCY
Ad hoc
analytics
HIGH LATENCY
Web
analytics
Local loop
14. The unified era, 2013+
CLOUD VENDOR / OWN DATA CENTER
NARROW DATA SILOES
SOME LOW LATENCY LOCAL LOOPS
Search
CMS
Silo
E-comm
Silo
APIs
ERP
Silo
LOW LATENCY
Streaming APIs /
web hooks
WIDE DATA
SAAS VENDOR #2
COVERAGE
Unified log
Email
marketing
FEW DAYS’
DATA HISTORY
Hadoop
HIGH LATENCY
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
CRM
Silo
Eventstream
Archiving
SAAS VENDOR #1
Ad hoc
analytics
Product rec’s
Systems
monitoring
Management
reporting
Fraud
detection
Churn
prevention
LOW LATENCY
15. The unified log is Kinesis (or Kafka)
CLOUD VENDOR / OWN DATA CENTER
NARROW DATA SILOES
Search
SAAS VENDOR #1
SOME LOW LATENCY LOCAL LOOPS
CMS
Silo
E-comm
Silo
APIs
ERP
Silo
CRM
Silo
Streaming APIs /
web hooks
Eventstream
SAAS VENDOR #2
Unified log
Archiving
Hadoop
HIGH LATENCY
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Email
marketing
Ad hoc
analytics
Product rec’s
Systems
monitoring
Management
reporting
Fraud
detection
Churn
prevention
LOW LATENCY
16. Can we implement Snowplow on top of Kinesis?
CLOUD VENDOR / OWN DATA CENTER
NARROW DATA SILOES
Search
SAAS VENDOR #1
SOME LOW LATENCY LOCAL LOOPS
CMS
Silo
E-comm
Silo
APIs
ERP
Silo
CRM
Silo
Streaming APIs /
web hooks
Eventstream
SAAS VENDOR #2
Unified log
Archiving
Hadoop
HIGH LATENCY
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Email
marketing
Ad hoc
analytics
Product rec’s
Systems
monitoring
Management
reporting
Fraud
detection
Churn
prevention
LOW LATENCY
18. Where we are heading with our Kinesis architecture
Snowplow
Trackers
Scala Stream
Collector
Raw event
stream
S3 sink
Kinesis app
S3
Enrich
Kinesis app
Enriched
event
stream
Redshift
sink Kinesis
app
Redshift
Bad raw
events
stream
19. We took an important first step in our last release…
0.8.12
pre-0.8.12
hadoop-etl
scala-hadoopenrich
scala-kinesis-enrich
Record-level
enrichment
functionality
scala-common-enrich
20. … and the next release should get us much closer
Snowplow
Trackers
Scala Stream
Collector
Raw event
stream
S3 sink Kinesis
app
S3
Enrich
Kinesis app
Enriched
event
stream
Redshift sink
Kinesis app
Redshift
Bad raw
events stream