This document provides an overview of how Kafka and modern databases like MemSQL can benefit applications and analytics. It discusses how businesses now require faster data access and intra-day processing to drive real-time decisions. Traditional database solutions struggle to meet these demands. MemSQL is presented as a solution that provides scalable SQL, fast ingestion of streaming data, and high concurrency to enable both transactions and analytics on large datasets. The document demonstrates how MemSQL distributes data and queries across nodes and allows horizontal scaling through its architecture.
4. AT MEMSQL
Sr. Sales Engineer, San Francisco
BEFORE MEMSQL
Worked on Globus project out @
University of Chicago
PREVIOUS TALKS
Real Time, Geospatial, Maps
Image Recognition on Streaming
Real Time w/ Spark & MemSQL
4
Who am I?
6. 6
Organizations want more of their data to
support faster decisions and optimize customer
experiences
This is putting pressure on database
performance and scalability but without
sacrificing familiar tooling and skills
Data Driven Requirements Driving
Database Modernization
7. 7 Businesses Require Intra-Day
Slow Data Loading
Batch processing
Hours to load
Sampled data views
8. 8 Growing Data Slows Performance
Lengthy Query Execution
Slow query responses
Slow reports
No real-time response
9. 9 Data Access Requirements Surging
Limited User Access
Single threaded operations
Challenge with mixed workloads
Single box performance
10. 10 Multi / Hybrid Cloud Strategy
● Existing solutions have unclear path
to cloud
● Data growing exponentially year
over year
● Still managing on-premises data
● Requires database to run anywhere
12. More CPUs
or memory
Specialized
HW racks
Database
Options
Boost hardware or add more DB options introduces cost
12 Double Down on Existing Database
13. Adding data grids, caches, and accelerators introduces complexity
13 Introduce Caching Tiers
Limited data
durability
Weak SQL
coverage
Another layer
To manage
14. 14 Try Object Store based NoSQL Solutions
Slow performing
analytics
Developer
intensive queries
Breaks BI tool
compatibility
15. 15 Latency Holding Back the Enterprise
Lengthy Query Execution
Slow query responses
Slow reports
No real-time response
Limited User Access
Single threaded operations
Challenge with mixed workloads
Single box performance
Slow Data Loading
Batch processing
Hours to load
Sampled data views
16. 16 The Enterprise Requires Performance
Fast Queries
Scalable ANSI SQL
Petabyte scale
Live and historical insights
Scalable User Access
Scale-out for performance
Converged transactions and analytics
Multi-threaded processing
Live Loading
Stream data
On-the-fly transformation
Multiple sources
17. MemSQL: The No Limits Database17
For Every Workload
and Infrastructure
On-premises or any cloud
Transactions and analytics
Familiar, standard
scalable SQL
Distributed architecture
Relational ANSI SQL
Performance for
Demanding
Applications
Fast ingest
Low latent queries
27. 14
MemSQL: The No-Limits Database
● Massive Scale
● Query Performance
● High Concurrency
The transactional scale of
NoSQL with familiar
relational SQL for fast
analytics
38. Apache Kafka38
● Messaging Queue
● Distributed
● Durable
● Publish-Subscribe
● Process
● “Source of Truth”
● Open Source
39. Deliver Faster Insights
● Scalable ANSI SQL
● Full ACID capabilities
● Support for JSON, Geospatial,
and Full-Text Search
● Fast Query Vectorization and
Compilation
● Extensibility with Stored
Procedures, UDFs, UDAs
39
40. Fast Data Ingestion
● Stream ingestion
● Fast parallel bulk loading
● Built-in Create Pipeline
● Transactional Consistency
● Exactly-Once Semantics
● Native integrations with
Kafka, AWS S3, Azure Blob,
HDFS
40
42. 42
1
2
3
4
5
6
7
CREATE PIPELINE twitter_pipeline AS
LOAD DATA KAFKA "public-kafka.memcompute.com:9092/tweets-json"
INTO TABLE tweets
WITH TRANSFORM (‘/path/to/executable’, ‘arg1’, ‘arg2’)
(id, tweet);
START PIPELINE twitter_pipeline;
43. 43
Data Source
(ex: NFS, S3, HDFS,
Kafka)
MemSQLPIPELINE
MemSQL polls for changes from a source system.1
1
44. 44
Data Source
(ex: NFS, S3, HDFS,
Kafka)
MemSQLPIPELINE
MemSQL polls for changes from a source system.
MemSQL pulls the data into it’s memory space (no commit) where a transform can be applied.
1
2
1
2
45. 45
Data Source
(ex: NFS, S3, HDFS,
Kafka)
MemSQLPIPELINE
MemSQL polls for changes from a data source system.
MemSQL pulls the data into it’s memory space (no commit) where a transform can be applied.
The data is committed in a transaction (and in parallel)
1
1
3
3
2
2