SlideShare una empresa de Scribd logo
1 de 28
Enabling Interactive BI on Hadoop
Boaz Raufman
CTO / Co-Founder
Jethro
Interactive BI is a Unique Use-Case
Data Science,
ETL,
Reporting,
Machine Learning
Interactive
BI
Non
Interactive
Managed Set of
Queries
Few
Concurrent Users
Interactive
Variety of
Generate Queries
Many
Concurrent Users
Interactive BI challenges: Performance
• My query is too slow!
• Resolution:
– Data engineering
• Partitioning, Sorting, De-normalize,
Pre-aggregation, Pre-calculation, etc.
– Increase cluster size
• Cost:
– Effort time and costs $$$
– Resources $$$
• Limitations
– Data engineering can’t optimize
all queries
Interactive BI challenges: Variety
• My dashboard generates many different
queries
– Multiple dimensions, multiple measures,
complex expressions, various filters, low/high
cardinality filters, various tables relations, …
• Resolution:
– More data engineering
• Cost:
– Effort time and costs $$$
– Delay application development and
deployment $$$
• Limitations:
– Impose limitation on app
– Performance degradation
Manual data engineering is costly and cannot completely
resolve the variety of business needs in timely manner
Interactive BI challenges: Concurrency
• Single dashboard interaction can
issue many queries
• I have many concurrent users
• Resolution:
– Increase cluster size
• Cost:
– Resources $$$
– Impact other work loads on my
Hadoop cluster
Resources resizing will never catch up with
business needs
SQL on Hadoop Engines don’t fit for Interactive BI
Pros
• General purpose
• Parallel execution
• Scalable resource utilization
• Eventually can resolve
every query via full scan
• Great for ETL, Reporting,
Machine learning, Data
Discovery
Cons
• Resource consuming
• Straggle with concurrency
• Optimizations require
manual data engineering
• Not optimized for variety
and concurrency
requirements of
interactive BI use cases
Interactive BI acceleration tool is complimentary to SQL on Hadoop Engines
Solution Requirements
• Consistent interactive response times (<10 sec)
• Handle efficiently variety of BI queries
• Minimal resource utilization per query allowing high
concurrency
• Scalable
• Automatic – data engineering should be handled by the data
platform
In addition:
• Consistent performance upon ingestion of new data
The Realm of Queries
Select * from …
Select sum(a),sum(b) Select sum(a), sum(b)
group by c,d
Select sum(a) Select sum(b)
Select a,b,d where e=x
Select sum(a), sum(b)
where c=y group by d
Select sum(a), sum(b)
where e=x group by d
We need to be optimized only for the sub-set of queries
that is relevant for Interactive BI
Jethro Adaptive Approach to Interactive BI
• Interactive BI is about visualizing data for humans
• It composed mainly of:
– Aggregations grouped by low cardinality dimensions
– Filters of either low or hi cardinality
• To handle aggregation we use pre-aggregation (cubes)
• To handle hi cardinality filtering we use indexes
• Engine adapts to dashboard queries
– Acceleration object automatically generated based on user
queries
Indexes
Cubes or Indexes? You need BOTH!
Type of Query DetailedSummary
good
perf Cubes
Cubes: good for accelerating Aggregated queries
– Poor at detailed queries
poor
perf
Indexes: good for accelerating Granular queries
– Poor at summary queries
Jethro is unique in providing BOTH - accelerates ALL queries
Heavy Lifting is done in the Background
Query
Servers
Cubes,
Indexes
Builder
Servers
Live Query
Answer
Queries from
Indexes and
Cubes
Background
Build
Indexes and
Cubes
Performance gain ~5x-50x
Cluster resources ~0.2X
Fully Automated
(stored on Hadoop)
LIVE Demo
• Point browser at: tableau.jethrodata.com
– Login: demo / demo
• Point browser at: jethrodata.qlik.com/
– No login needed
Compone
nt
AWS HW Monthly
Cost
Jethro
2x
120GB / 16
cores
$500 (spot)
Storage EFS $200
Data:
• Based on TPC-DS benchmark
• 1TB raw data
• Fact table: ~2.9B rows
• Dimension tables: 6
AWS Servers
Customer Row_IDs
1 1,4,9
4 10
6 8
7 2
14 5
23 6,7
32 3
Row_ID Customer Item Price
1 1 … …
2 7 … …
3 32 … …
4 1 … …
5 14 … …
6 23 … …
7 23 … …
8 6 … …
9 1 … …
10 4 … …
Jethro Indexes Accelerate BI Drill Downs
• Efficient
– EVERY column can be indexed
• Effective
– The more you filter, the faster it gets
– Dataset size doesn’t impact filtered query perf
• Efficient
– Multi-level index for direct access, no need for
in-mem
Users NOT dependent on a single partition col for performance
Index Table
Auto-Cubes: How it Works
state cust
,
prod
,…
$sale
AL $2.00
…
AK $4.50
…
AZ $1.00
…
… …
… …
WY $4.25
Customer query:
select sum(sales)
… where state=‘AZ’
Process:
use index to find all rows
for ‘AZ’. Sum $sale for
selected rows
Response: $1,643
sales transactions
(5B rows)
sales-by-state (50 rows)
State $sale
AK $256
AZ $1,643
… …
WY $4,654
Jethro auto gen query
(move filter col into group by):
select sum(sales) …
group by state
Subsequent queries served
from auto-cube:
where state=‘AK’
where state in (‘CA’, ‘NY’)
Jethro Auto-Cubes Accelerate BI Aggregations
• Automated
– Based on actual BI queries
• Adaptive
– Automatically adjust to changes in apps and
data
• Efficient
– Dozens of small and highly efficient cubes,
matching every aggregation
– Use indexes for granular queries instead of
creating large cubes
state cust
,
prod
,…
$sale
AL $2.00
… …
AK $4.50
AZ $1.00
AZ …
… …
WY $4.25
Jethro Auto Cubes drive uninterrupted self-service BI
sales
transactions
(5B rows)
Stat
e
$sale
AK $256
AZ $1,643
… …
WY $4,654
sales
by State
(50 rows)
Jethro Query Optimization Process
1. Result-Cache
• Exact repeat of
prev query
• Results were saved
in storage
2. Auto Cube
• Scan existing
cubes for a match
• Cubes evaluated
from smallest to
largest
3. Index Access
• Apply filters using
indexes
• Fetch and process
ONLY relevant
rows and cols
Optimizer
• Rewrite query: join elimination, partition pruning,
predicate push down…
• Select best execution path: cache, cubes or indexes
The BEST way to speed up a SQL query is have it do LESS work
Incremental Updates Do not Impact Performance
Original
Incremental
IndexesCubesData
Background
Incremental update of Indexes and Cubes
ETL
Watch
Folder
Scales to 1,000’s of Users
…
• Servers are stateless, data centrally
shared
– Cubes, indexes, results shared by
servers
• Automated load balancing
– Dynamically add / drop Jethro servers
• Minimal sensitivity to cluster load
– Segregate workload by designating
specific servers to specific groups
…
Stressed and Hardened by Customers in Production
Jethro and Integration (Hive 3)
security
Querie
s
Sentry
Performance, Scale, Cost
• Performance – responds in seconds
– ALL BI queries, 100’s of concurrent users, BB’s of rows
• Self driving – no manual performance engineering
– Cubes and Indexes are fully automated
• Resource efficiency – reduced cluster usage
– All BI compute on Jethro nodes, significantly fewer resources
• App compatibility – “as is”
– No changes to BI apps or data model
EDW Performance at Hadoop Scale & Cost
Thanks You
Backup Slides
Jethro System Diagram
Client Applications
• Commercial BI Tools
• Homegrown Viz Apps
• SQL Clients
SQL 92 via ODBC / JDBC
• AutoCubes
• Full Indexing
• Intelligent Cache
Source Data
• Hadoop (Hive, Impala,…)
• EDW
• Text Files
Jethro Acceleration Engine
Any ETL
• Cube and Index Builder
Jethro Manager
Network
Storage
Interactive BI Market Map
Non interactive
Interactive
Full-Scan Full-Scan
Manual
Cube
Auto
Cube
Auto
Index
Data
Science
Interactive
BI
Customer Insights & Profitability
 Industry: Car Rental
– Leading global car rental
– Multiple brands, 5,000+ locations,
150+ countries
– MM’s of transactions, BB’s of
marketing and sales data points
 Results:
– Performance: dashboards return in
10sec instead of 10min
– Self-Service: end-users are able to
create own analytics without IT
– Data Lake: data for all brands and
geos in one place
Before After
Leading Car Rental Company
Hortonworks HDP
Jethro Acceleration
Hortonworks HDP
Oracle Data Mart
Transactions, marketing
Tableau
Transactions, marketing
Tableau
After
Physician Patient Tracking
 Industry: Health Care
– Leading data & tech provider in the
health care industry
– 500 healthcare organizations, 850K
physicians, 375K clinical facilities, more
than 230M Americans
 Results
– Scale: 1,000’s of concurrent users
– Performance: 85% of interactions
under 5sec
– Security: Access control by user; HIPAA
Before After
Leading Health Data Provider
Hortonworks HDP
Jethro Acceleration
Hortonworks HDP
Teradata Data Mart
Physician / Patient Details
Tableau
Physician / Patient Details
Tableau
After
Financial operational apps over
 Industry: Banking
– Top 15 global Bank
– Operations in 35+ countries
– Personal, business, public sector and
institutional clients
 Results
– Functional: offload BI apps “as-is” from
legacy EDW to Hadoop
– $Savings: eliminate need for annual
EDW expansion
– ROI: increase usage and value of data
lake investment
Before After
Hortonworks HDP
Jethro Acceleration
Hortonworks HDP
Vertica, other EDW Data
Marts
Many data sources
Tableau, other BI
Many data sources
Tableau, other BI
After

Más contenido relacionado

La actualidad más candente

Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow ManagerBreathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow ManagerDataWorks Summit
 
An elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkAn elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkDataWorks Summit
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...DataWorks Summit
 
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
How to Ingest 16 Billion Records Per Day into your Hadoop EnvironmentHow to Ingest 16 Billion Records Per Day into your Hadoop Environment
How to Ingest 16 Billion Records Per Day into your Hadoop EnvironmentDataWorks Summit
 
Big Data Analytics from Edge to Core
Big Data Analytics from Edge to CoreBig Data Analytics from Edge to Core
Big Data Analytics from Edge to CoreDataWorks Summit
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDataWorks Summit
 
Schema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeSchema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeDataWorks Summit
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsDataWorks Summit
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolutionDataWorks Summit
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's EvolutionDataWorks Summit
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsDataWorks Summit
 
IoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFiIoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFiDataWorks Summit
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streamsJoey Echeverria
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...DataWorks Summit
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep DiveHortonworks
 
HBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, SolutionsHBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, SolutionsDataWorks Summit
 

La actualidad más candente (20)

Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow ManagerBreathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
 
An elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkAn elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache Flink
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
How to Ingest 16 Billion Records Per Day into your Hadoop EnvironmentHow to Ingest 16 Billion Records Per Day into your Hadoop Environment
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
 
Big Data Analytics from Edge to Core
Big Data Analytics from Edge to CoreBig Data Analytics from Edge to Core
Big Data Analytics from Edge to Core
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
 
Empower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and HadoopEmpower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and Hadoop
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Schema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeSchema Registry - Set Your Data Free
Schema Registry - Set Your Data Free
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
IoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFiIoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFi
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
HBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, SolutionsHBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, Solutions
 

Similar a Enabling real interactive BI on Hadoop

Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectTableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectRemy Rosenbaum
 
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...Remy Rosenbaum
 
Hadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural PatternsHadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural PatternsDataWorks Summit
 
Making Session Stores More Intelligent
Making Session Stores More IntelligentMaking Session Stores More Intelligent
Making Session Stores More IntelligentKyle Davis
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupSri Ambati
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructureSimon Belak
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Group
 
Fast, Powerful and Scalable Analytics
Fast, Powerful and Scalable AnalyticsFast, Powerful and Scalable Analytics
Fast, Powerful and Scalable AnalyticsMariaDB plc
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevAltinity Ltd
 
Delivering fast, powerful and scalable analytics
Delivering fast, powerful and scalable analyticsDelivering fast, powerful and scalable analytics
Delivering fast, powerful and scalable analyticsMariaDB plc
 
Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Remy Rosenbaum
 
Informix & IWA : Operational analytics performance
Informix & IWA : Operational analytics performanceInformix & IWA : Operational analytics performance
Informix & IWA : Operational analytics performanceKeshav Murthy
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
Optimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 MinutesOptimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 MinutesAlexandra Sasha Blumenfeld
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
 
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼Elasticsearch
 

Similar a Enabling real interactive BI on Hadoop (20)

Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectTableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
 
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
 
Hadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural PatternsHadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural Patterns
 
Making Session Stores More Intelligent
Making Session Stores More IntelligentMaking Session Stores More Intelligent
Making Session Stores More Intelligent
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville Meetup
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructure
 
The New Model
The New ModelThe New Model
The New Model
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Fast, Powerful and Scalable Analytics
Fast, Powerful and Scalable AnalyticsFast, Powerful and Scalable Analytics
Fast, Powerful and Scalable Analytics
 
Oracle bi ee architecture
Oracle bi ee architectureOracle bi ee architecture
Oracle bi ee architecture
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
 
Delivering fast, powerful and scalable analytics
Delivering fast, powerful and scalable analyticsDelivering fast, powerful and scalable analytics
Delivering fast, powerful and scalable analytics
 
Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)
 
Informix & IWA : Operational analytics performance
Informix & IWA : Operational analytics performanceInformix & IWA : Operational analytics performance
Informix & IWA : Operational analytics performance
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Optimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 MinutesOptimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 Minutes
 
HDF5 FastQuery
HDF5 FastQueryHDF5 FastQuery
HDF5 FastQuery
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
 
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
 

Más de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Último (20)

Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Enabling real interactive BI on Hadoop

  • 1. Enabling Interactive BI on Hadoop Boaz Raufman CTO / Co-Founder Jethro
  • 2. Interactive BI is a Unique Use-Case Data Science, ETL, Reporting, Machine Learning Interactive BI Non Interactive Managed Set of Queries Few Concurrent Users Interactive Variety of Generate Queries Many Concurrent Users
  • 3. Interactive BI challenges: Performance • My query is too slow! • Resolution: – Data engineering • Partitioning, Sorting, De-normalize, Pre-aggregation, Pre-calculation, etc. – Increase cluster size • Cost: – Effort time and costs $$$ – Resources $$$ • Limitations – Data engineering can’t optimize all queries
  • 4. Interactive BI challenges: Variety • My dashboard generates many different queries – Multiple dimensions, multiple measures, complex expressions, various filters, low/high cardinality filters, various tables relations, … • Resolution: – More data engineering • Cost: – Effort time and costs $$$ – Delay application development and deployment $$$ • Limitations: – Impose limitation on app – Performance degradation Manual data engineering is costly and cannot completely resolve the variety of business needs in timely manner
  • 5. Interactive BI challenges: Concurrency • Single dashboard interaction can issue many queries • I have many concurrent users • Resolution: – Increase cluster size • Cost: – Resources $$$ – Impact other work loads on my Hadoop cluster Resources resizing will never catch up with business needs
  • 6. SQL on Hadoop Engines don’t fit for Interactive BI Pros • General purpose • Parallel execution • Scalable resource utilization • Eventually can resolve every query via full scan • Great for ETL, Reporting, Machine learning, Data Discovery Cons • Resource consuming • Straggle with concurrency • Optimizations require manual data engineering • Not optimized for variety and concurrency requirements of interactive BI use cases Interactive BI acceleration tool is complimentary to SQL on Hadoop Engines
  • 7. Solution Requirements • Consistent interactive response times (<10 sec) • Handle efficiently variety of BI queries • Minimal resource utilization per query allowing high concurrency • Scalable • Automatic – data engineering should be handled by the data platform In addition: • Consistent performance upon ingestion of new data
  • 8. The Realm of Queries Select * from … Select sum(a),sum(b) Select sum(a), sum(b) group by c,d Select sum(a) Select sum(b) Select a,b,d where e=x Select sum(a), sum(b) where c=y group by d Select sum(a), sum(b) where e=x group by d We need to be optimized only for the sub-set of queries that is relevant for Interactive BI
  • 9. Jethro Adaptive Approach to Interactive BI • Interactive BI is about visualizing data for humans • It composed mainly of: – Aggregations grouped by low cardinality dimensions – Filters of either low or hi cardinality • To handle aggregation we use pre-aggregation (cubes) • To handle hi cardinality filtering we use indexes • Engine adapts to dashboard queries – Acceleration object automatically generated based on user queries
  • 10. Indexes Cubes or Indexes? You need BOTH! Type of Query DetailedSummary good perf Cubes Cubes: good for accelerating Aggregated queries – Poor at detailed queries poor perf Indexes: good for accelerating Granular queries – Poor at summary queries Jethro is unique in providing BOTH - accelerates ALL queries
  • 11. Heavy Lifting is done in the Background Query Servers Cubes, Indexes Builder Servers Live Query Answer Queries from Indexes and Cubes Background Build Indexes and Cubes Performance gain ~5x-50x Cluster resources ~0.2X Fully Automated (stored on Hadoop)
  • 12. LIVE Demo • Point browser at: tableau.jethrodata.com – Login: demo / demo • Point browser at: jethrodata.qlik.com/ – No login needed Compone nt AWS HW Monthly Cost Jethro 2x 120GB / 16 cores $500 (spot) Storage EFS $200 Data: • Based on TPC-DS benchmark • 1TB raw data • Fact table: ~2.9B rows • Dimension tables: 6 AWS Servers
  • 13. Customer Row_IDs 1 1,4,9 4 10 6 8 7 2 14 5 23 6,7 32 3 Row_ID Customer Item Price 1 1 … … 2 7 … … 3 32 … … 4 1 … … 5 14 … … 6 23 … … 7 23 … … 8 6 … … 9 1 … … 10 4 … … Jethro Indexes Accelerate BI Drill Downs • Efficient – EVERY column can be indexed • Effective – The more you filter, the faster it gets – Dataset size doesn’t impact filtered query perf • Efficient – Multi-level index for direct access, no need for in-mem Users NOT dependent on a single partition col for performance Index Table
  • 14. Auto-Cubes: How it Works state cust , prod ,… $sale AL $2.00 … AK $4.50 … AZ $1.00 … … … … … WY $4.25 Customer query: select sum(sales) … where state=‘AZ’ Process: use index to find all rows for ‘AZ’. Sum $sale for selected rows Response: $1,643 sales transactions (5B rows) sales-by-state (50 rows) State $sale AK $256 AZ $1,643 … … WY $4,654 Jethro auto gen query (move filter col into group by): select sum(sales) … group by state Subsequent queries served from auto-cube: where state=‘AK’ where state in (‘CA’, ‘NY’)
  • 15. Jethro Auto-Cubes Accelerate BI Aggregations • Automated – Based on actual BI queries • Adaptive – Automatically adjust to changes in apps and data • Efficient – Dozens of small and highly efficient cubes, matching every aggregation – Use indexes for granular queries instead of creating large cubes state cust , prod ,… $sale AL $2.00 … … AK $4.50 AZ $1.00 AZ … … … WY $4.25 Jethro Auto Cubes drive uninterrupted self-service BI sales transactions (5B rows) Stat e $sale AK $256 AZ $1,643 … … WY $4,654 sales by State (50 rows)
  • 16. Jethro Query Optimization Process 1. Result-Cache • Exact repeat of prev query • Results were saved in storage 2. Auto Cube • Scan existing cubes for a match • Cubes evaluated from smallest to largest 3. Index Access • Apply filters using indexes • Fetch and process ONLY relevant rows and cols Optimizer • Rewrite query: join elimination, partition pruning, predicate push down… • Select best execution path: cache, cubes or indexes The BEST way to speed up a SQL query is have it do LESS work
  • 17. Incremental Updates Do not Impact Performance Original Incremental IndexesCubesData Background Incremental update of Indexes and Cubes ETL Watch Folder
  • 18. Scales to 1,000’s of Users … • Servers are stateless, data centrally shared – Cubes, indexes, results shared by servers • Automated load balancing – Dynamically add / drop Jethro servers • Minimal sensitivity to cluster load – Segregate workload by designating specific servers to specific groups …
  • 19. Stressed and Hardened by Customers in Production
  • 20. Jethro and Integration (Hive 3) security Querie s Sentry
  • 21. Performance, Scale, Cost • Performance – responds in seconds – ALL BI queries, 100’s of concurrent users, BB’s of rows • Self driving – no manual performance engineering – Cubes and Indexes are fully automated • Resource efficiency – reduced cluster usage – All BI compute on Jethro nodes, significantly fewer resources • App compatibility – “as is” – No changes to BI apps or data model EDW Performance at Hadoop Scale & Cost
  • 24. Jethro System Diagram Client Applications • Commercial BI Tools • Homegrown Viz Apps • SQL Clients SQL 92 via ODBC / JDBC • AutoCubes • Full Indexing • Intelligent Cache Source Data • Hadoop (Hive, Impala,…) • EDW • Text Files Jethro Acceleration Engine Any ETL • Cube and Index Builder Jethro Manager Network Storage
  • 25. Interactive BI Market Map Non interactive Interactive Full-Scan Full-Scan Manual Cube Auto Cube Auto Index Data Science Interactive BI
  • 26. Customer Insights & Profitability  Industry: Car Rental – Leading global car rental – Multiple brands, 5,000+ locations, 150+ countries – MM’s of transactions, BB’s of marketing and sales data points  Results: – Performance: dashboards return in 10sec instead of 10min – Self-Service: end-users are able to create own analytics without IT – Data Lake: data for all brands and geos in one place Before After Leading Car Rental Company Hortonworks HDP Jethro Acceleration Hortonworks HDP Oracle Data Mart Transactions, marketing Tableau Transactions, marketing Tableau After
  • 27. Physician Patient Tracking  Industry: Health Care – Leading data & tech provider in the health care industry – 500 healthcare organizations, 850K physicians, 375K clinical facilities, more than 230M Americans  Results – Scale: 1,000’s of concurrent users – Performance: 85% of interactions under 5sec – Security: Access control by user; HIPAA Before After Leading Health Data Provider Hortonworks HDP Jethro Acceleration Hortonworks HDP Teradata Data Mart Physician / Patient Details Tableau Physician / Patient Details Tableau After
  • 28. Financial operational apps over  Industry: Banking – Top 15 global Bank – Operations in 35+ countries – Personal, business, public sector and institutional clients  Results – Functional: offload BI apps “as-is” from legacy EDW to Hadoop – $Savings: eliminate need for annual EDW expansion – ROI: increase usage and value of data lake investment Before After Hortonworks HDP Jethro Acceleration Hortonworks HDP Vertica, other EDW Data Marts Many data sources Tableau, other BI Many data sources Tableau, other BI After