As the world moves from batch to online data processing, real-time data pipelines will supercede siloed data warehouse and transaction processing systems as core infrastructure.
While many analytics solutions tout query execution speed, this is only half of the equation.
For real time workloads, stale data renders query speed irrelevant when results and insights are out of date.
Beyond just “online queries,” real-time enterprises need “online datasets” that continuously update and make data accessible across the organization.
This session will cover approaches to building real-time pipelines with MemSQL, Hadoop, and Spark. Topics will include:
Key industry trends and the move to real-time data pipelines
How MemSQL customer Novus built the premier financial portfolio management platform using MemSQL as a real-time data store and query engine.
Operationalizing Spark for Advanced Analytics
Demonstration of how Pinterest is using the MemSQL Spark Connector to derive real-time insights on interesting and meaningful user activity with MemSQL and Spark.
Introduction to the MemSQL Spark Connector
Strategies for integrating Spark and Hadoop with real-time systems for transaction processing and operational analytics.
Presenters include MemSQL CEO Eric Frenkiel, Novus CTO Robert Stepeck, and Pinterest Software Engineer Yu Yang.
In a world of web portals and push notifications, users have developed demanding expectations for a real-time experience. Continuous updates, a responsive interface, and short loading times have become the norm. Most business analysts and data scientists, whose workflows remain bound by legacy tools and complex data pipelines, lack this fast, simple user experience.
From a business perspective, latency and complexity impede revenue by preventing access to the right data at the right time. Businesses that recognize the value of access to real-time data now have options to meet stringent objectives. They understand that serving “always up to date” data for analysis requires converging transactions and analytics in a real-time system. This session will highlight these architectures and customer achievements.
Bringing olap fully online analyze changing datasets in mem sql and spark with pinterest demo
1. Bringing OLAP Fully Online
Analyze Changing Datasets in MemSQL and Spark with Pinterest Demo
Eric Frenkiel, MemSQL CEO
Rob Stepeck, Novus CTO
Yu Yang, Pinterest Software Engineer
Feb 19, 2015 • San Jose, CA
2. What’s in store for this presentation
▸MemSQL: The real-time database for transactions and
analytics
▸Case Study with Novus CTO, Rob Stepeck
▸New Developments in Spark
▸Advanced Analytics with Demo from Pinterest Sofware
Engineer, Yu Yang
4. MemSQL Snapshot
▸Experienced Leadership
• Microsoft, Facebook, Oracle, Fusion-io
▸Inspired by Enterprise architecture gap
▸A real-time database for transactions
and analytics
• In-memory, distributed, SQL
▸Broad customer adoption across verticals
▸Top tier investors
4
5. Four ways your DBMS is holding you back
▸ETL (Extract, Transform, Load)
▸Analytic Latency
▸Synchronization
▸Copies of data
Source: Gartner Hybrid/Transactional/Analytical Processing Will Foster Opportunities for Dramatic Business Innovation
6. The Real-Time Database for Transactions and Analytics
6
MemSQL Cluster
Data Loading and Queries
Aggregator
Nodes
Leaf
Nodes
Availability
Group 1
Availability
Group 2
7. HOW NOVUS ENABLES INVESTORS TO
CONSISTENTLY MAXIMIZE THEIR
PERFORMANCE POTENTIAL USING
MEMSQL
Novus Case Study
8. Quick Background on Novus
Rob Stepeck
Chief Technology Officer
▸Investment acumen, risk, insights
and data management
▸$2 trillion in client assets
▸Used by 100 of the world’s top
investment managers and investors
▸Founded in 2007 by group of
investors, data scientists and
engineers
8
9. Before MemSQL
Problem:
▸ Write operations inefficient
▸ Loading data was a 24 hour
operation
▸ Failures could significantly impact
subsequent processes
▸ Loading client data degraded system
performance
▸ Scaling was non-trivial
▸ Prospect data integration trade-offs
9
10. MemSQL Implementation
Reduce Latency SQL Support
10
Scale with Ease
Novus choose to use MemSQL based on the following
data management requirements
11. After MemSQL
Results:
▸ 24 hour data cycle down to several hours
▸ Scale is achieved by adding/removing
clusters with ease
▸ Learning curve is non existent
▸ Eliminated data ‘hand-holding’ so team
can focus on more important initiatives
▸ Sales are more effective because they can
use a customer’s actual data
11
12. Example: ‘Refresh a Client’
12
Convert to
In-memory
Backing
Store
Before MemSQL:
After MemSQL:
90 Min.
Raw Data
2 Min.
14. Interest in Spark
▸Recent survey of 2100 developers
– 82% of users choose Spark to replace MapReduce
– 78% of users need faster processing of larger datasets
Source: Typesafe, APACHE SPARK - Preparing for the Next Wave of Reactive Big Data
15. Spark Data Processing Framework
▸Intuitive, concise, and expressive operations needed for analytics
15
Spark
SQL
Spark
Streaming
Mllib
(machine
learning)
GraphX
(graph)
Apache Spark
16. Enterprises Seek Simple Ways to Use Spark
▸Spark with operational data stores delivers new use cases
▸In-memory, distributed databases such as MemSQL fit well
18. MemSQL and Spark Use Cases
▸Operationalize models built in Spark
▸Stream and event processing
▸Live dashboards and automated reports
▸Extend MemSQL analytics
18
19. Operationalize Models Built in Spark
▸Process in Spark, persist to MemSQL
▸Go to production and iterate faster
19
MemSQL ClusterSpark Cluster
Enterprise
Consumption
Data into
Spark
Model Creation
Model
Persistence
20. Stream and Event processing
▸Structure event data on the fly
▸Pass to MemSQL for persistent, queryable format
20
MemSQL ClusterSpark Cluster
Enterprise
Consumption
Real-time
Streaming Data
Data
Transformation
Persistent,
Queryable Format
21. Extend MemSQL Analytics
▸The freshest data for analysis in Spark
▸Load from MemSQL to Spark and write results on return
21
MemSQL ClusterSpark Cluster
Applications,
Data Streams
Interactive Analytics,
Machine Learning
MemSQL
Replicated
Cluster
Access to Live
Production Data
Real-time Replica
22. Live Dashboards and Automated Reports
▸Serve live dashboards from MemSQL
▸Run custom reports on live data with Spark
22
MemSQL ClusterSpark Cluster
Live
Dashboards
Custom Reporting
Access to Live
Production Data
SQL Transactions
and Analytics
26. Why Spark
▸Pinterest has high traffic and an active community
▸Always looking for new ways to help users
▸Processing event data presents unique challenges
▸Spark is the leading processing framework for big data
deployments
▸Spark Streaming is ideal for real-time data structuring