Modeling the Smart and Connected City of the Future with Kafka and Spark

Modeling the Smart and Connected
City of the Future with Kafka and Spark
Eric Frenkiel, CEO & Co-Founder, MemSQL
@ericfrenkiel
MAKE DATA WORK
DECEMBER 1-3, 2015  SINGAPORE

2
MemSQL at a Glance
Enterprise Focused
 Our Mission:
 Real-time database for transactions and analytics
 Founded in 2011, based in San Francisco
 Founders are former Facebook, SQL Server
database engineers
 $50 million in funding to date
Make every company a real-time enterprise.

What does a Smart City Look Like?

6
3.9b people live in cities today

7
By 2050, we’ll add another 2.5b people

8
We need to create sustainable cities

9
We need to use technology to help us

10
We don’t live in Tomorrowland

The good news:
the Technology of Today can
build smart cities.
12

13
 City-wide WiFi
 City App to report issues
 Open-Data Initiatives to
share data with the public
 Most importantly, an
adaptive IT department
A Smart City Should Have…

A Model Application: MemCity
Capturing data from 1.4 million households
Total AWS hardware costs at $2.35 per hour

MemCity
Reach
1.4 million
households
(approximately
the size of
Chicago)

Capturing data from
8 devices in each home,
every minute
*
#MemCity

186,667 transactions per second
from Kafka Spark MemSQL
#MemCity

1.4 Million Households
8 Devices per Household
186K Events per Second

Designing the Ideal Real-Time Pipeline
Message Queue Transformation Speed/Serving Layer
End-to-End Data Pipeline Under One Second
21

 A high-throughput
distributed messaging
system
 Publish and subscribe to
Kafka “topics”
 Centralized data transport
for the organization
Kafka
22

 In-memory execution
engine
 High level operators for
procedural and
programmatic analytics
 Faster than MapReduce
Spark
23

 In-memory, distributed
database
 Full transactions and
complete durability
 Enable real-time,
performant applications
MemSQL
24

Subscribing to Kafka
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
0111001010101111101111100000001010
111100001110101100000010010010111…
Publish to Kafka Topic
0111001010101111101111100000001010
111100001110101100000010010010111…
1110010101000101010001010100010111
111010100011110101100011010101000…
0101111000011100101010111110001111
011010111100000000101110101100000…
Event added to message queue
25

Enrich and Transform the Data
Spark polling Kafka for new messages
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
(2015-07-06T16:43:40.33Z, 329280, 94110, 23,
‘kitchen_appliance’, 60)
Deserialization
Enrichment
0111001010101111101111100000001010
111100001110101100000010010010111…
26

Persist and Prepare for Production
RDD.saveToMemSQL()
INSERT INTO memcity_table ...
time house_id zip
device
_id
device_type watts
2015-
07-
06T16:4
3:40.33
Z
329280 94110 23
‘kitchen_app
liance’
60
… … … … … …
27

Go to Production
Compress development
timelines
SELECT ... FROM memcity_table ...
28

We can use In-Memory
technology to build
interactive applications
for Cities.

31
 Urban planning
 Efficient power consumption
 Efficient transportation
 Sustainable energy practices
So We Can Optimize…

Creating Real-Time Pipelines
should be push button easy.
32

 One click deployment of
integrated Apache Spark
 Put Spark in the Fast Lane
• GUI pipeline setup
• Multiple data pipelines
• Real-time transformation
 Eliminates batch ETL
 Open source on GitHub
MemSQL Streamliner for IoT Applications
33

Simple Deployment Process
Application
34

Cluster
1. Deploy MemSQL
In-Memory | Distributed | Relational
Application
35

Cluster
2. Deploy Spark
Application
36

Cluster
Kafka Connects to Each Node
Application
37

Streamliner Architecture
First of many integrated Apache Spark solutions
Other
Real-Time Data
Sources Application
Apache Spark
Future Solution
Future Machine
Learning Solution
STREAMLINER
38

Streamliner ETL Detail
Other
Real-Time Data
Sources Application
Apache Spark
Future Solution
Future Machine
Learning Solution
STREAMLINER
Custom
Future Extractor
JSON
Custom
Future Transformer
STREAMLINER
Extract Transform Load
39

Extending Analytics with Lambda Architecture
Real-Time Analytics Streaming
Analytic Applications
Not Excel Reports
 Financial Services
 Adtech
 eCommerce
 IoT
 Consumer Internet
 Energy
 Federal
Lambda Architecture
New Real-Time Processing
Existing Batch Processing
Msg
Queue

45
 Multi-TB on commodity
hardware
 Store the “state of the
model”
 Easily build applications
 Avoid direct disk at all
cost
In-Memory Databases Rise Up

Comprehensive Architecture
Transactions
46

Real Time
Speed/Streaming Layer
Fast Updates
Rowstore
Transactions
47

Real Time
Fast Updates
Rowstore
Analytics
Transactions
48

Real Time
Fast Updates
Rowstore
Historical
Batch Layer
Fast Appends
Columnstore
Analytics
Transactions
49

Real Time
Fast Updates
Rowstore
Historical
Batch Layer
Fast Appends
Columnstore
Analytics
Transactions
Execution engine that spans the data spectrum
50

Real Time
Fast Updates
Rowstore
Historical
Batch Layer
Fast Appends
Columnstore
Analytics
Transactions
51

Simplified Lambda Architectures with MemSQL
Layer Traditional Lambda MemSQL Lambda
Batch Hadoop MemSQL Column Store
Speed Storm, Spark Kafka > Spark > MemSQL
Serving Cassandra, HBase MemSQL
52

Lambda Applies to Real-Time Data Pipelines
Message
Queue
Batch
Inputs DatabaseTransformation Application
53

Kafka, Spark, and MemSQL Make it Simple
Batch
Inputs Application
54

Massive Ingest and Concurrent Analytics
55
 Instant accuracy to the latest repin
 Build real-time analytic applications
 1 GB/sec totaling 72 TB/day
Real-time
analytics

Using Real-Time for Personalization
Ad Servers
EC2
Real-time
analytics
PostgreSQL
Legacy reports
Monitoring S3 (replay)
HDFS
Data Science
Vertica
Star Schema MictoStrategy
 Reach overlap and ad optimization
 Over 60,000 queries per second
 Millisecond response times
56

57
300k events/sec
Reduced Latency from 30 minutes to Sub-Second
Real-time
Analytics

Sample Pipeline: Analyzing Twitter Data in Real Time
ApplicationApache Spark
SPARK
STREAMLINER
Public API
“Garden Hose”
</>
Python
Extract Transform Load
SPARK STREAMLINER
58

Install MemSQL and Apache Spark in < 1min
With MemSQL Ops and Streamliner
59

Run Kafka in Docker Container and Create a New
Topic: TWITTER
60

Fill Out Extract, Transform and Load Details to Set
Up Pipeline
61

Use Python Script to Load Tweets into Kafka Topic
and Get Data Flowing
62

Connect to MemSQL Database and Run SQL Queries
Instantly
63

Run Online Alter Table to Optimize Query Performance
64

Streamliner: Dynamic Resource Management
Without Streamliner With Streamliner
Pipeline 1
Spark Worker
Pipeline 2
Spark Worker
Executor
(P2 only)
Executor
(P2 only)
Executor
(P1 only)
Executor
(P1 only)
Driver
(P1 only)
Driver
(P2 only)
All Pipelines
Streamliner Driver
…
…
Spark WorkerSpark Worker
Executor
(P1 or P2)
Executor
(P1 or P2)
Executor
(P1 or P2)
Executor
(P1 or P2)
65

Building Real-Time Data Pipelines
and Predictive Applications
66

Adding Real-Time Scoring to Predictive Applications
Streamliner
Input
User Jar
SAS Generated PMML
Industrial
Equipment
Sensor Data
S1 S2 S3 P1 P2 P3
Scoring Real-Time Data
with Predictive Models
Sensor 1 Predictive Model 1
67

GET YOUR FREE COPY:
memsql.com/oreilly
69

Modeling the Smart and Connected City of the Future with Kafka and Spark

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (9)

Similar a Modeling the Smart and Connected City of the Future with Kafka and Spark

Similar a Modeling the Smart and Connected City of the Future with Kafka and Spark (20)

Más de SingleStore

Más de SingleStore (20)

Último

Último (20)

Modeling the Smart and Connected City of the Future with Kafka and Spark

Notas del editor