Modeling the Smart and Connected City of the Future with Kafka and Spark
1. Modeling the Smart and Connected
City of the Future with Kafka and Spark
Eric Frenkiel, CEO & Co-Founder, MemSQL
@ericfrenkiel
MAKE DATA WORK
DECEMBER 1-3, 2015 SINGAPORE
2. 2
MemSQL at a Glance
Enterprise Focused
Our Mission:
Real-time database for transactions and analytics
Founded in 2011, based in San Francisco
Founders are former Facebook, SQL Server
database engineers
$50 million in funding to date
Make every company a real-time enterprise.
13. 13
City-wide WiFi
City App to report issues
Open-Data Initiatives to
share data with the public
Most importantly, an
adaptive IT department
A Smart City Should Have…
21. Designing the Ideal Real-Time Pipeline
Message Queue Transformation Speed/Serving Layer
End-to-End Data Pipeline Under One Second
21
22. A high-throughput
distributed messaging
system
Publish and subscribe to
Kafka “topics”
Centralized data transport
for the organization
Kafka
22
23. In-memory execution
engine
High level operators for
procedural and
programmatic analytics
Faster than MapReduce
Spark
23
33. One click deployment of
integrated Apache Spark
Put Spark in the Fast Lane
• GUI pipeline setup
• Multiple data pipelines
• Real-time transformation
Eliminates batch ETL
Open source on GitHub
MemSQL Streamliner for IoT Applications
33
38. Streamliner Architecture
First of many integrated Apache Spark solutions
Other
Real-Time Data
Sources Application
Apache Spark
Future Solution
Future Machine
Learning Solution
STREAMLINER
38
44. Extending Analytics with Lambda Architecture
Real-Time Analytics Streaming
Analytic Applications
Not Excel Reports
Financial Services
Adtech
eCommerce
IoT
Consumer Internet
Energy
Federal
Lambda Architecture
New Real-Time Processing
Existing Batch Processing
Msg
Queue
45. 45
Multi-TB on commodity
hardware
Store the “state of the
model”
Easily build applications
Avoid direct disk at all
cost
In-Memory Databases Rise Up
55. Massive Ingest and Concurrent Analytics
55
Instant accuracy to the latest repin
Build real-time analytic applications
1 GB/sec totaling 72 TB/day
Real-time
analytics
56. Using Real-Time for Personalization
Ad Servers
EC2
Real-time
analytics
PostgreSQL
Legacy reports
Monitoring S3 (replay)
HDFS
Data Science
Vertica
Star Schema MictoStrategy
Reach overlap and ad optimization
Over 60,000 queries per second
Millisecond response times
56
58. Sample Pipeline: Analyzing Twitter Data in Real Time
ApplicationApache Spark
SPARK
STREAMLINER
Public API
“Garden Hose”
</>
Python
Extract Transform Load
SPARK STREAMLINER
58
59. Install MemSQL and Apache Spark in < 1min
With MemSQL Ops and Streamliner
59
60. Run Kafka in Docker Container and Create a New
Topic: TWITTER
60
67. Adding Real-Time Scoring to Predictive Applications
Streamliner
Input
User Jar
SAS Generated PMML
Industrial
Equipment
Sensor Data
S1 S2 S3 P1 P2 P3
Scoring Real-Time Data
with Predictive Models
Sensor 1 Predictive Model 1
67