Big Data is everywhere. We generate enough data to track every single transactions, sensors, and interactions. But what do we do with this? Using Jethro Data and Qlik you can easily unlock the value of Big Data. Qlik can help you find outliers and trends in Billions rows and make what seems long esoteric process easy. The presentation will go through the problem statement and Qlik like approach to solving the problem. We will also introduce Jethro data a new Qlik Partner to the Our Partners
Qonnections2015 - Why Qlik is better with Big Data
1. Why Qlik® is Better with Big Data
John Park, Qlik
Eli Singer, JethroData
28 April, 2015
2. 2#qonnections
Eli Singer
• CEO & Co-Founder, JethroData
• Data management, Information Security,
e-commerce
• Over 20 Years in leading start-ups
• Twitter: @jethrodata
Presenters
John Park
• Senior Solutions Architect, Partner Engineering
• ETL, data warehousing, software design, *NIX,
architecture
• 2 years Qlik, 7 years data warehouse consultant
• Twitter: @jpark328
4. 4#qonnections
Agenda: Why Qlik® is Better with Big Data
• Introduction
• Notes about Big Data / Hadoop / Data Lakes
• Current state of Data Lake implementations
• JethroData overview
• Demo: Let’s analyse 2.5billion rows from the Data Lake
• JethroData architecture
• Why Qlik is better with Big Data
• Key takeaways
7. 7#qonnections
Big Data / Hadoop / Data Lakes
Collection of services and toolkits working
with HDFS
8. 8#qonnections
Big Data / Hadoop / Data Lakes
Architecture pattern where data is stored/landed for
processing by *
9. 9#qonnections
What is happening ?
• Enterprise are adopting Hadoop for Data Scientist (Science Project -> Real “Production”)
• Hadoop is maturing
• Data Lake Architecture is being adopted
• Companies and startups are building innovative services on Hadoop
• Data is becoming life blood of companies
However there are incredible amount of challenges for using Hadoop as single source of
truth and sole underpinning of BI Platform:
Image source: Gartner's 2014 Hype Cycle for Emerging Technologies Maps the Journey to Digital Business, Gartner (August 2014)
10. Data Lake vision vs. reality
Scalability, low cost and performance were promised but…
11. Data Lake vision vs. reality
Scalability, Low Cost, and Performance were promised but…
It’s been a challenge working with BI Tools
Is the Data Lake frozen ?
12. 12#qonnections
Reasons why we are struggling
• Data Lakes were not designed to run ad-hoc / interactive queries
• Hadoop was designed to run Batch process(ML, ETL, Predictive, Canned Reporting)
processes
• Designed for data scientist not business analyst
• Our trade off: cost vs. scale vs. performance
13. 13#qonnections
Strategies to bridging the gap
• Use best of technology to overcome the gap
• Solve technical problems one issue at a time
• Let’s do Interactive Query on Hadoop!
15. 15#qonnections
Who is Jethrodata?
• A Qlik Technology Partner
• Developing next-generation analytical database
• Focused on making interactive BI on Big Data a reality
• Combine analytical DB design with full-indexing
technology
• JethroData 1.0 went GA Apr 7, 2015
• Offices in NY and Israel
• Backed by world-class VCs
16. 16#qonnections
Common use cases
• Challenge: BI on Hadoop is slow
• Current solution: Replicate data
into a separate EDW system for
fast BI
• Challenge: EDW is expensive
• Current solution: Migrate batch
processes to Hadoop, keep BI
on EDW
• Pain: Maintaining two separate data systems is
expensive and an operational nightmare
• With Jethro: Run fast BI directly on data in Hadoop
17. 17#qonnections
Why Is Hadoop So Slow?
Architecture:
MPP / Full-Scan
(All SQL-on-Hadoop)
Query: list books by
author “Stephen King”
Process: each librarian is
assigned a rack, they
then view each book,
check if author is
“Stephen King”, if so, get
book title
Result: too slow, costly,
unscalable
18. 18#qonnections
JethroData and Big Data: Index Access
Architecture:
Index Access
(Only JethroData)
Query 1: list books by
author “Stephen King”
Process: go-to Author
index, entry of
“Stephen King”, get
list of books, fetch
only these books
Result: Fast, minimal
resources, scalable
20. • Hadoop Environment - CDH 5.3 / Severs: NN: 1x r3.large; DN: 9x m1.large
• Jethrodata Server 1X- r3.8xlarge: 244GB RAM; 320GB SSD (SPOT)
• Qlik Sense 1.10 - r3.2xlarge 4 vCPU 30 Gb RAM
• Demo 1 TPC-DS
• Based on TPC-DS, Replication Factor 1,000
• Sales_demo table based on store_sales fact table with dimension data
added
• 2.5B rows, 33 columns
• 600GB raw data
• Demo 2 Airline
• Airline Data 123 Million Commercial Flight Data + CSV(Hybrid
Architecture)
Qlik and Jethro Demo Setup
Business Discovery Apps
Qlik Server
Big Data Platform
Big Data Indexing
HTTP, HTTPS
Protocol
Hadoop
Client
HDFS
Everything hosted on Amazon Web Services
Direct Discovery
ODBC Protocol
21. Data
Node
Client: SELECT state, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY state
Data
Node
Data
Node
Data
Node
Data
Node
Query
Executor
Query
Executor
Query
Executor
Query
Executor
Query
Executor
Query
Planner/
Mgr
Query
Planner/
Mgr
Query
Planner/
Mgr
Query
Planner/
Mgr
Query
Planner/
Mgr
Performance and resources based on the size of the dataset
SQL-on-Hadoop Architecture: MPP/Full-Scan
23. 23#qonnections
SQL on Hadoop: competitive landscape
• Hive
• Impala
• Presto
• SparkSQL
• Drill
• Pivotal/HAWQ
• IBM/Big SQL
• Actian
• Teradata/SQL-H
• Microsoft/PDW
Full-Scan Based Solutions
Read all rows. Every Time.
• JethroData
Index Based Solution
Read ONLY needed rows.
Use-Case Comparison:
Full-Scan: Optimal for ETL, Predictive
Index: Optimal for Interactive BI
24. JethroLoad
er
JethroServ
er
JethroData: system overview
Jethro
Loader
Jethro
Server
Hadoop
Data Source
• Hadoop
• EDW
• Streams
• …
1. Initial load –
extract data from
relevant sources
and load through
JethroLoader.
Incremental data
can be loaded at
short intervals
1
2
Queries
• BI Tools
• SQL client
• …
3
4
2. Index and
column files are
stored in HDFS
(or S3).
Typical size is
35% of raw data
4. JethroServer
communicates
directly with HDFS
to retrieve relevant
data.
No MapReduce /
Spark are used
3. Queries are
sent via standard
ODBC/JDBC
interface.
Automatic load-
balance across
servers
25. 25#qonnections
Jethrodata: architecture highlights
Every Column is Indexed!
Allow users to slice & dice any way they
choose and always got fast response.
Scales to any size
From 100M to 100B, columns and indexes are
compressed and partitioned.
Super-easy to implement
Compatible with every Hadoop distribution.
Installs on separate server(s) from Hadoop
cluster.
26. 26#qonnections
Jethro Indexes: innovative technology
• Fast to read
Simple: Inverted-list indexes
map each column value to a
list of rows
Fast: Direct access O(1) to
each value entry
Scale: Distributed, highly
hierarchical compressed
bitmaps
• Fast to write
– Index files are appended,
duplicate entries allowed
– Incremental – new data
indexed as it comes in
– No locks, no random
read/write
http://www.google.com/patents/WO2013001535A3?cl=enPatent Pending:
27. Intelligent caching yechnology
• Reuse of intermediate/final query results
– Repeat queries in sub-seconds
• Addresses wide top-of-the-funnel queries
– Analysis starts with queries with no/few
filters
– Those queries are often repeated in
dashboard scenarios
• Transparently adapts to incremental loads
– Execution on delta data + merge saved
results
Query Speed
Query
Selectivity
Fast
Slow
Few More
Query Repeat
Query
Selectivity
Hi
Low
Few More
Query
speed
Query
Selectivity
Fast
Slow
Few More
28. Advantage of Qlik and Jethro Data
• Now, we can get scale, cost, and performance
• Faster queries with more selection criteria
• Faster Direct Discovery load time due to dimensional unique values
already available in JethroData indexes
• Use of system-wide optimization and smart caching
• Creates an interactive Hadoop for Business Discovery
• Allows Qlik Customers have same experience across all data sources including
Hadoop.
Qlik
APPLICATION
DIRECT DISCOVERY QUERIES
29. 29#qonnections
Why Qlik Is Better for Big Data
• Allows users to analyze your big data the way
Business users want to.
• Scalability, Extensibility and Beautiful
Visualizations to Tell your Story.
30. 30#qonnections
Key takeaways
• Qlik can do Big Data well with right complementary technology
• Qlik associative interface can open up new use cases with Big Data
• Qlik and Jethro can provide true interactive BI with Hadoop
31. 31#qonnections
Follow-up resources
• Open to public Qlik Sense hub: http://jethrodata.qlik.com
• Free evaluation version of Jethro available at http://jethrodata.com/download