Providing true interactive and scalable BI on Hadoop is proven to be one of the biggest challenges that is preventing completion of legacy EDW OLAP system transit to Hadoop. While we have all seen many benchmarks running consecutive queries claiming success, having thousands of concurrent business users sending complicated generated queries from their dashboards over billions of records while delivering interactive speed is yet to be seen.
In this session we will discuss how an architecture that replaces full-scan brute-force approach with adaptive indexing and auto-generated cubes can dramatically reduce the resources and effort per query, resulting in interactive performance for high concurrency workloads and explain how this is achieved with minimum data engineering efforts. We will also discuss how this architecture can be seamlessly integrated with Hive to provide a complete OLAP-on-Hadoop solution.
Session will include live demo of complex business dashboards connected to Hive and accessing billions of rows at interactive speed.
Speaker
Boaz Raufman, CTO and Co-Founder, JethroData
2. Interactive BI is a Unique Use-Case
Data Science,
ETL,
Reporting,
Machine Learning
Interactive
BI
Non
Interactive
Managed Set of
Queries
Few
Concurrent Users
Interactive
Variety of
Generate Queries
Many
Concurrent Users
3. Interactive BI challenges: Performance
• My query is too slow!
• Resolution:
– Data engineering
• Partitioning, Sorting, De-normalize,
Pre-aggregation, Pre-calculation, etc.
– Increase cluster size
• Cost:
– Effort time and costs $$$
– Resources $$$
• Limitations
– Data engineering can’t optimize
all queries
4. Interactive BI challenges: Variety
• My dashboard generates many different
queries
– Multiple dimensions, multiple measures,
complex expressions, various filters, low/high
cardinality filters, various tables relations, …
• Resolution:
– More data engineering
• Cost:
– Effort time and costs $$$
– Delay application development and
deployment $$$
• Limitations:
– Impose limitation on app
– Performance degradation
Manual data engineering is costly and cannot completely
resolve the variety of business needs in timely manner
5. Interactive BI challenges: Concurrency
• Single dashboard interaction can
issue many queries
• I have many concurrent users
• Resolution:
– Increase cluster size
• Cost:
– Resources $$$
– Impact other work loads on my
Hadoop cluster
Resources resizing will never catch up with
business needs
6. SQL on Hadoop Engines don’t fit for Interactive BI
Pros
• General purpose
• Parallel execution
• Scalable resource utilization
• Eventually can resolve
every query via full scan
• Great for ETL, Reporting,
Machine learning, Data
Discovery
Cons
• Resource consuming
• Straggle with concurrency
• Optimizations require
manual data engineering
• Not optimized for variety
and concurrency
requirements of
interactive BI use cases
Interactive BI acceleration tool is complimentary to SQL on Hadoop Engines
7. Solution Requirements
• Consistent interactive response times (<10 sec)
• Handle efficiently variety of BI queries
• Minimal resource utilization per query allowing high
concurrency
• Scalable
• Automatic – data engineering should be handled by the data
platform
In addition:
• Consistent performance upon ingestion of new data
8. The Realm of Queries
Select * from …
Select sum(a),sum(b) Select sum(a), sum(b)
group by c,d
Select sum(a) Select sum(b)
Select a,b,d where e=x
Select sum(a), sum(b)
where c=y group by d
Select sum(a), sum(b)
where e=x group by d
We need to be optimized only for the sub-set of queries
that is relevant for Interactive BI
9. Jethro Adaptive Approach to Interactive BI
• Interactive BI is about visualizing data for humans
• It composed mainly of:
– Aggregations grouped by low cardinality dimensions
– Filters of either low or hi cardinality
• To handle aggregation we use pre-aggregation (cubes)
• To handle hi cardinality filtering we use indexes
• Engine adapts to dashboard queries
– Acceleration object automatically generated based on user
queries
10. Indexes
Cubes or Indexes? You need BOTH!
Type of Query DetailedSummary
good
perf Cubes
Cubes: good for accelerating Aggregated queries
– Poor at detailed queries
poor
perf
Indexes: good for accelerating Granular queries
– Poor at summary queries
Jethro is unique in providing BOTH - accelerates ALL queries
11. Heavy Lifting is done in the Background
Query
Servers
Cubes,
Indexes
Builder
Servers
Live Query
Answer
Queries from
Indexes and
Cubes
Background
Build
Indexes and
Cubes
Performance gain ~5x-50x
Cluster resources ~0.2X
Fully Automated
(stored on Hadoop)
12. LIVE Demo
• Point browser at: tableau.jethrodata.com
– Login: demo / demo
• Point browser at: jethrodata.qlik.com/
– No login needed
Compone
nt
AWS HW Monthly
Cost
Jethro
2x
120GB / 16
cores
$500 (spot)
Storage EFS $200
Data:
• Based on TPC-DS benchmark
• 1TB raw data
• Fact table: ~2.9B rows
• Dimension tables: 6
AWS Servers
13. Customer Row_IDs
1 1,4,9
4 10
6 8
7 2
14 5
23 6,7
32 3
Row_ID Customer Item Price
1 1 … …
2 7 … …
3 32 … …
4 1 … …
5 14 … …
6 23 … …
7 23 … …
8 6 … …
9 1 … …
10 4 … …
Jethro Indexes Accelerate BI Drill Downs
• Efficient
– EVERY column can be indexed
• Effective
– The more you filter, the faster it gets
– Dataset size doesn’t impact filtered query perf
• Efficient
– Multi-level index for direct access, no need for
in-mem
Users NOT dependent on a single partition col for performance
Index Table
14. Auto-Cubes: How it Works
state cust
,
prod
,…
$sale
AL $2.00
…
AK $4.50
…
AZ $1.00
…
… …
… …
WY $4.25
Customer query:
select sum(sales)
… where state=‘AZ’
Process:
use index to find all rows
for ‘AZ’. Sum $sale for
selected rows
Response: $1,643
sales transactions
(5B rows)
sales-by-state (50 rows)
State $sale
AK $256
AZ $1,643
… …
WY $4,654
Jethro auto gen query
(move filter col into group by):
select sum(sales) …
group by state
Subsequent queries served
from auto-cube:
where state=‘AK’
where state in (‘CA’, ‘NY’)
15. Jethro Auto-Cubes Accelerate BI Aggregations
• Automated
– Based on actual BI queries
• Adaptive
– Automatically adjust to changes in apps and
data
• Efficient
– Dozens of small and highly efficient cubes,
matching every aggregation
– Use indexes for granular queries instead of
creating large cubes
state cust
,
prod
,…
$sale
AL $2.00
… …
AK $4.50
AZ $1.00
AZ …
… …
WY $4.25
Jethro Auto Cubes drive uninterrupted self-service BI
sales
transactions
(5B rows)
Stat
e
$sale
AK $256
AZ $1,643
… …
WY $4,654
sales
by State
(50 rows)
16. Jethro Query Optimization Process
1. Result-Cache
• Exact repeat of
prev query
• Results were saved
in storage
2. Auto Cube
• Scan existing
cubes for a match
• Cubes evaluated
from smallest to
largest
3. Index Access
• Apply filters using
indexes
• Fetch and process
ONLY relevant
rows and cols
Optimizer
• Rewrite query: join elimination, partition pruning,
predicate push down…
• Select best execution path: cache, cubes or indexes
The BEST way to speed up a SQL query is have it do LESS work
17. Incremental Updates Do not Impact Performance
Original
Incremental
IndexesCubesData
Background
Incremental update of Indexes and Cubes
ETL
Watch
Folder
18. Scales to 1,000’s of Users
…
• Servers are stateless, data centrally
shared
– Cubes, indexes, results shared by
servers
• Automated load balancing
– Dynamically add / drop Jethro servers
• Minimal sensitivity to cluster load
– Segregate workload by designating
specific servers to specific groups
…
21. Performance, Scale, Cost
• Performance – responds in seconds
– ALL BI queries, 100’s of concurrent users, BB’s of rows
• Self driving – no manual performance engineering
– Cubes and Indexes are fully automated
• Resource efficiency – reduced cluster usage
– All BI compute on Jethro nodes, significantly fewer resources
• App compatibility – “as is”
– No changes to BI apps or data model
EDW Performance at Hadoop Scale & Cost
24. Jethro System Diagram
Client Applications
• Commercial BI Tools
• Homegrown Viz Apps
• SQL Clients
SQL 92 via ODBC / JDBC
• AutoCubes
• Full Indexing
• Intelligent Cache
Source Data
• Hadoop (Hive, Impala,…)
• EDW
• Text Files
Jethro Acceleration Engine
Any ETL
• Cube and Index Builder
Jethro Manager
Network
Storage
25. Interactive BI Market Map
Non interactive
Interactive
Full-Scan Full-Scan
Manual
Cube
Auto
Cube
Auto
Index
Data
Science
Interactive
BI
26. Customer Insights & Profitability
Industry: Car Rental
– Leading global car rental
– Multiple brands, 5,000+ locations,
150+ countries
– MM’s of transactions, BB’s of
marketing and sales data points
Results:
– Performance: dashboards return in
10sec instead of 10min
– Self-Service: end-users are able to
create own analytics without IT
– Data Lake: data for all brands and
geos in one place
Before After
Leading Car Rental Company
Hortonworks HDP
Jethro Acceleration
Hortonworks HDP
Oracle Data Mart
Transactions, marketing
Tableau
Transactions, marketing
Tableau
After
27. Physician Patient Tracking
Industry: Health Care
– Leading data & tech provider in the
health care industry
– 500 healthcare organizations, 850K
physicians, 375K clinical facilities, more
than 230M Americans
Results
– Scale: 1,000’s of concurrent users
– Performance: 85% of interactions
under 5sec
– Security: Access control by user; HIPAA
Before After
Leading Health Data Provider
Hortonworks HDP
Jethro Acceleration
Hortonworks HDP
Teradata Data Mart
Physician / Patient Details
Tableau
Physician / Patient Details
Tableau
After
28. Financial operational apps over
Industry: Banking
– Top 15 global Bank
– Operations in 35+ countries
– Personal, business, public sector and
institutional clients
Results
– Functional: offload BI apps “as-is” from
legacy EDW to Hadoop
– $Savings: eliminate need for annual
EDW expansion
– ROI: increase usage and value of data
lake investment
Before After
Hortonworks HDP
Jethro Acceleration
Hortonworks HDP
Vertica, other EDW Data
Marts
Many data sources
Tableau, other BI
Many data sources
Tableau, other BI
After