Big Data Developers Moscow Meetup 1 - sql on hadoop
1. Big Data Developers Meetup #1 Aug 2014
Andrey.vykhodtsev@ru.ibm.com
Central & Eastern Europe BigData Tech Sales
2. Первый Meetup 2014
•Про SQL on hadoop
•По возможности объективный обзор и конструктивный диалог
•Основан на уважении к другим технологиям, в т.ч конкурирующим
•Без holywar
•Скромные закуски – угощайтесь
•Время с 19-00 до 22-00, в 21-00 заканчиваем программу, в 22-00 нужно покинуть здание
3. Agenda
•What is this Hadoop thing?
•Why SQL on Hadoop?
•What is Hive?
•SQL-on-Hadoop landscape
•InfoSphere BigInsights for Hadoop with Big SQL
•What is it?
•SQL capabilities
•Architecture
•Application portability and integration
•Enterprise capabilities
•Performance
•Conclusion
4. Big Data Scenarios Span Many Industries – and rely on Hadoop
•Optimize existing EDW environment – size, performance, and TCO
•Capture, off load, analyze massive amounts of data to get new insights
Data Warehouse Modernization
•Text analytics on social media commentary around life events
•Link social media profiles to actual customers
360 View of the Customer
•Analyze massive volumes of data that can’t be handled by existing SIEM systems
•Internet drug trafficking, prostitution, monitoring all the web, email traffic to identify potential threats
Cyber Security
5. The Goal of Hadoop
Manage large volumes of data
Scalable to any volume
Off-load from the warehouse
Identify unique customers
Reduce Costs
Commodity hardware
Common tools
In-house skills
Analyze new data types
Improve business decisions
Understand sentiment
Analyze data-in-motion
6. What is Hadoop?
6
split 0
split 1
split 2
split 3
split 4
split 5
Map
Map
Map
Reduce
Reduce
Reduce
C
Client
output 0
output 1
output 2
M
Master
Input
Files
Map
Phase
Intermediate
Files
Reduce
Phase
Output
Files
•Framework to process big data in parallel on a cluster
•What's new/different?
•Free, open source
•Uses commodity hardware
•“Move programs to the data”
•Scale both processing and storage by simply adding nodes
•Makes big data processing accessible to everyone
•Two key things to understand Hadoop:
•How files are stored
•How files are processed
7. How files are stored: HDFS
•Key ideas:
•Divide big files in blocks and store blocks randomly across cluster
•Provide API to ask: where are the pieces of this file?
•=> Programs can be shipped to nodes for parallel distributed processing
101101001010010011100111111001010011101001010010110010010101001100010100101110101110101111011011010101101001010100101010101011100100110101110100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
3
4
4
4
8. How Files are Processed: MapReduce
•Common pattern in data processing: apply a function, then aggregate
grep "World Cup” *.txt | wc -l
•User simply writes two pieces of code: “mapper” and “reducer”
•Mapper code executes on every split of every file
•Reducer consumes/aggregates mapper outputs
•The Hadoop MR framework takes care of the rest (resource allocation, scheduling, coordination, temping of intermediate results, storage of final result on HDFS)
1011010010100100111001111110010100111010010100101100100101010011000101001011101011101011110110110101011010010101
1
2
3
Logical File
Splits
1
Cluster
3
2
Map
Map
Map
Reduce
Result
9. SQL on Hadoop and Hive
•Hadoop can process data of any kind (as long as it's splittable, etc)
•A very common scenario:
•Tabular data
•Programs that “query” the data
•Java Hadoop APIs are the wrong tool for this
•Too low level, steep learning curve
•Require strong programming expertise
•Universally accepted solution: SQL
•Enter Hive ...
1.Impose relational structure on plain files
2.Translate SELECT statements to MapReduce jobs
3.Hide all the low level details
10. Why SQL on Hadoop?
Hadoop stores large volumes and varieties of data
SQL gets information and insight out of Hadoop
SQL leverages existing IT skills resulting in quicker time to value and lower cost
11. Hive
•One of the most popular Hadoop-related technologies
•Ships with all major Hadoop distributions
•Hive opens up Hadoop to anyone with SQL skills
•Simplified and shortened development cycle
•Little Java/MapReduce knowledge required
•Three key concepts
•Hive SerDe
•Hive Table
•Hive Metastore
12. Hive SerDes
•SerDe = Serializer + Deserializer
•Deserializer = Java code that implements mapping from Hadoop “record” to Hive “row”
•A Hadoop record is just a byte array
•A Hive row has columns with names and data types
•Serializer maps Hive row to Hadoop record (for writing)
•Many built-in SerDes
•Delimited text files
•JSON
•XML
•REGEX
•AVRO
•Can add your own custom serdes
13. Hive Tables
•A Hive table imposes a relational “schema” (list of column names and types) on a file
•Schema is purely logical
•Data in the file is not altered in any way
•“Schema on read” (as opposed to SOW of traditional RDBMSs)
•Hive table = Metadata + Data
•CREATE TABLE statement (metadata)
•A directory containing one or more files (data)
CREATE TABLE logEvents
(ipaddress STRING, eventtime TIMESTAMP, message STRING) ROW FORMAT SERDE 'org.apache.hive…LazySimpleSerde'
WITH SERDEPROPERTIES ( 'field.delim' = '|' )
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.mapred.TextOutputFormat'
LOCATION '/user/hive/warehouse/sample.db/logevents';
14. Hive MetaStore
•The Hive metastore stores metadata about all the tables
•Usually backed by a conventional relational db (not on HDFS)
•Default: Derby
•MySQL, DB2, Oracle
•Table metadata
•Schema (column names and types)
•Location (directory on HDFS)
•SerDe
•Hadoop InputFormat/OutputFormat
•Partition information
•Properties (column and row delimiters, etc)
•Security (access control)
15. Hadoop Latency and Hive SQL Features
•Hive was not designed to be an RDBMS, but to hide the low-level details of MapReduce
•But the inevitable questions came up …
•Hadoop Latency
•Why is my query so slow compared to XYZ?
•Why does it take so long to retrieve a few rows?
•Hive SQL Features
•How do I define a view, stored procedure, …?
•What’s wrong with this subquery ?
•No DATE, DECIMAL, VARCHAR data types?
16. SQL-on-Hadoop landscape
•The SQL-on-Hadoop landscape changes constantly!
•Being relatively new to the SQL game, they have all generally meant compromising one or more of….
•Speed
•Robust SQL
•Enterprise features
•Interoperability with the Hadoop ecosystem
•IBM InfoSphere BigInsights for Hadoop with Big SQL is based upon tried and true IBM relational technology, addressing all of these areas
17. Introducing Big SQL 3.0
•Goal: bring SQL on Hadoop to the next level
•Low-latency HDFS-based parallelism
•Move programs to the data
•No MapReduce
=> MPP engine
•Avoid unnecessary temping
=> Message passing
•Avoid process startup/teardown
=> Daemon processes
•Full SQL support
SQL-based
Application
Big SQL Engine
HDFS
IBM data server client
SQL MPP Run-time
CSV
Seq
Parquet
RC
ORC
Avro
Custom
JSON
19. Big SQL highlights
•Full support for subqueries
•In SELECT, FROM, WHERE and HAVING clauses
•Correlated and uncorrelated
•Equality, non-equality subqueries
•EXISTS, NOT EXISTS, IN, ANY, SOME, etc.
•All standard join operations
•Standard and ANSI join syntax
•Inner, outer, and full outer joins
•Equality, non-equality, cross join support
•Multi-value join
•UNION, INTERSECT, EXCEPT
SELECT
s_name,
count(*) AS numwait
FROM
supplier,
lineitem l1,
orders,
nation
WHERE
s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT
*
FROM
lineitem l2
WHERE
l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey
)
AND NOT EXISTS (
SELECT
*
FROM
lineitem l3
WHERE
l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate >
l3.l_commitdate
)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY
s_name
ORDER BY
numwait desc,
s_name;
20. Big SQL in the Hadoop Ecosystem
• Fully integrated with ecosystem
– Hive Metastore
– Hive Tables
– Hive SerDes
– Hive partitioning
– Hive Statistics
– Columnar formats
• ORC
• Parquet
• RCFile
• Completely open, without compromises
• No proprietary storage format
Hive
Hive
Metastore
Hadoop
Cluster
Pig
Hive APIs
Sqoop
Hive APIs
Big SQL
Hive APIs
21. Architected for performance
•Architected from the ground up for low latency and high throughput
•MapReduce replaced with a modern MPP architecture
•Compiler and runtime are native code (not java)
•Big SQL worker daemons live directly on cluster
•Continuously running (no startup latency)
•Processing happens locally at the data
•Message passing allows data to flow directly between nodes
•Operations occur in memory with the ability to spill to disk
•Supports aggregations and sorts larger than available RAM
Head Node
Big SQL
Head Node
Hive Metastore
Compute Node
Task Tracker
Data Node
Big SQL
Compute Node
Task Tracker
Data Node
Big SQL
Compute Node
Task Tracker
Data Node
Big SQL
Compute Node
Task Tracker
Data Node
Big SQL
HDFS/GPFS
22. Extreme parallelism
•Massively parallel SQL engine that replaces MR
•Shared-nothing architecture that eliminates scalability and networking issues
•Engine pushes processing out to data nodes to maximize data locality. Hadoop data accessed natively via C++ and Java readers and writers.
•Inter- and intra-node parallelism where work is distributed to multiple worker nodes and on each node multiple worker threads collaborate on the I/O and data processing (scale out horizontally and scale up vertically)
•Intelligent data partition elimination based on SQL predicates
•Fault tolerance through active health monitoring and management of parallel data and worker nodes
24. Big SQL 3.0 – Architecture (cont.)
24
•Big SQL's runtime execution engine is all native code
•For common table formats a native I/O engine is utilized
•e.g. delimited, RC, SEQ, Parquet, …
•For all others, a java I/O engine is used
•Maximizes compatibility with existing tables
•Allows for custom file formats and SerDe's
•All Big SQL built-in functions are native code
•Customer built UDx's can be developed in C++ or Java
•Maximize performance without sacrificing extensibility
Mgmt Node
Big SQL
Compute Node
Task Tracker
Data Node
Big SQL
Big SQL Worker
Native I/O Engine
Java I/O Engine
SerDe
I/O Fmt
Runtime
Java UDFs
Native UDFs
25. Resource management
•Big SQL doesn't run in isolation
•Nodes tend to be shared with a variety of Hadoop services
•Task tracker
•Data node
•HBase region servers
•MapReduce jobs
•etc.
•Big SQL can be constrained to limit its footprint on the cluster
•% of CPU utilization
•% of memory utilization
•Resources are automatically adjusted based upon workload
•Always fitting within constraints
•Self-tuning memory manager that re-distributes resources across components dynamically
•default WLM concurrency control for heavy queries
Compute Node
Task Tracker
Data Node
Big SQL
HBase
MR Task
MR Task
MR Task
26. Performance
•Query rewrites
•Exhaustive query rewrite capabilities
•Leverages additional metadata such as constraints and nullability
•Optimization
•Statistics and heuristic driven query optimization
•Query optimizer based upon decades of IBM RDBMS experience
•Tools and metrics
•Highly detailed explain plans and query diagnostic tools
•Extensive number of available performance metrics
SELECT ITEM_DESC, SUM(QUANTITY_SOLD), AVG(PRICE), AVG(COST)
FROM PERIOD, DAILY_SALES, PRODUCT, STORE
WHERE
PERIOD.PERKEY=DAILY_SALES.PERKEY AND
PRODUCT.PRODKEY=DAILY_SALES.PRODKEY AND
STORE.STOREKEY=DAILY_SALES.STOREKEY AND
CALENDAR_DATE BETWEEN AND
'01/01/2012' AND '04/28/2012' AND
STORE_NUMBER='03' AND
CATEGORY=72
GROUP BY ITEM_DESC
Access plan generation
Query transformation
Dozens of query
transformations
Hundreds or thousands
of access plan options
Store
Product
Product
Store
NLJOIN
Daily Sales
NLJOIN
Period
NLJOIN
Product
NLJOIN
Daily Sales
NLJOIN
Period
NLJOIN
Store
HSJOIN
Daily Sales
HSJOIN
Period
HSJOIN
Product
Store
ZZJOIN
Daily Sales
HSJOIN
Period
27. •Table statistics:
•Cardinality (count)
•Number of Files
•Total File Size
•Column statistics (this applies to column group stats also):
•Minimum value
•Maximum value
•Cardinality (non-nulls)
•Distribution (Number of Distinct Values)
•Number of null values
•Average Length of the column value (for string columns)
•Histogram
•Frequent Values (MFV)
Statistics are key to performance
28. Application portability and integration
•Big SQL 3.0 adopts IBM's standard Data Server Client Drivers
•Robust, standards compliant ODBC, JDBC, and .NET drivers
•Same driver used for DB2 LUW, DB2/z and Informix
•Expands support to numerous languages (Python, Ruby, Perl, etc.)
•Putting the story together….
•Big SQL shares a common SQL dialect with DB2
•Big SQL shares the same client drivers with DB2
•Data warehouse augmentation just got significantly easier
Compatible SQL
Compatible Drivers
Portable Application
29. Application portability and integration (cont.)
•This compatibility extends beyond your own applications
•Open integration across Business Analytic Tools
•IBM Optim Data Studio performance tool portfolio
•Superior enablement for IBM Software – e.g. Cognos
•Enhanced support by 3rd party software – e.g. Microstrategy
30. Query federation
•Data never lives in isolation
•Either as a landing zone or a queryable archive it is desirable to query data across Hadoop and active data warehouses
•Big SQL provides the ability to query heterogeneous systems
•Join Hadoop to other relational databases
•Query optimizer understands capabilities of external system
•Including available statistics
•As much work as possible is pushed to each system to process
Head Node
Big SQL
Compute Node
Task Tracker
Data Node
Big SQL
Compute Node
Task Tracker
Data Node
Big SQL
Compute Node
Task Tracker
Data Node
Big SQL
Compute Node
Task Tracker
Data Node
Big SQL
31. Enterprise security
•Users may be authenticated via
•Operating system
•Lightweight directory access protocol (LDAP)
•Kerberos
•User authorization mechanisms include
•Full GRANT/REVOKE based security
•Group and role based hierarchical security
•Object level, column level, or row level (fine-grained) access controls
•Auditing
•You may define audit policies and track user activity
•Transport layer security (TLS)
•Protect integrity and confidentiality of data between the client and Big SQL
32. Monitoring
•Comprehensive runtime monitoring infrastructure that helps answer the question: what is going on in my system?
•SQL interfaces to the monitoring data via table functions
•Ability to drill down into more granular metrics for problem determination and/ or detailed performance analysis
•Runtime statistics collected during the execution of the section for a (SQL) access plan
•Support for event monitors to track specific types of operations and activities
•Protect against and discover unknown or unacceptable behaviors by monitoring data access via Audit facility.
Reporting Level (Example: Service Class)
Big SQL 3.0
Worker Threads
Connection Control Blocks
Worker Threads Collect Locally Push Up Data Incrementally
Extract Data Directly From Reporting level
Monitor Query
33. •Performance matters to customers
•Benchmarking appeals to Engineers to drive product innovation
•Benchmarketing used to convey performance in a memorable and appealing way
•SQL over Hadoop is in the “Wild West” of Benchmarketing
•100x claims! Compared to what? Conforming to what rules?
•The TPC (Transaction Processing Performance Council) is the grand-daddy of all multi-vendor SQL-oriented organizations
•Formed in August, 1988
•TPC-H and TPC-DS are the most relevant to SQL over Hadoop
–R/W nature of workload not suitable for HDFS
•Big Data Benchmarking Community (BDBC) formed
Performance, Benchmarking, Benchmarketing
34. Power of Standard SQL
•Everyone loves performance numbers, but that's not the whole story
•How much work do you have to do to achieve those numbers?
•A portion of our internal performance numbers are based upon industry standard benchmarks
•Big SQL is capable of executing
•All 22 TPC-H queries without modification
•All 99 TPC-DS queries without modification
SELECT s_name, count(*) AS numwait
FROM supplier, lineitem l1, orders, nation
WHERE s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT *
FROM lineitem l2
WHERE l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey)
AND NOT EXISTS (
SELECT *
FROM lineitem l3
WHERE l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate > l3.l_commitdate)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY s_name
ORDER BY numwait desc, s_name
JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM orders o
JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM nation n
JOIN supplier s
ON s.s_nationkey = n.n_nationkey
AND n.n_name = 'INDONESIA'
JOIN lineitem l
ON s.s_suppkey = l.l_suppkey
WHERE l.l_receiptdate > l.l_commitdate) l1
ON o.o_orderkey = l1.l_orderkey
AND o.o_orderstatus = 'F') l2
ON l2.l_orderkey = t1.l_orderkey) a
WHERE (count_suppkey > 1) or ((count_suppkey=1)
AND (l_suppkey <> max_suppkey))) l3
ON l3.l_orderkey = t2.l_orderkey) b
WHERE (count_suppkey is null)
OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c
GROUP BY s_name
ORDER BY numwait DESC, s_name
SELECT s_name, count(1) AS numwait
FROM
(SELECT s_name FROM
(SELECT s_name, t2.l_orderkey, l_suppkey,
count_suppkey, max_suppkey
FROM
(SELECT l_orderkey,
count(distinct l_suppkey) as count_suppkey,
max(l_suppkey) as max_suppkey
FROM lineitem
WHERE l_receiptdate > l_commitdate
GROUP BY l_orderkey) t2
RIGHT OUTER JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM
(SELECT s_name, t1.l_orderkey, l_suppkey,
count_suppkey, max_suppkey
FROM
(SELECT l_orderkey,
count(distinct l_suppkey) as count_suppkey,
max(l_suppkey) as max_suppkey
FROM lineitem
GROUP BY l_orderkey) t1
Original Query
Re-written for Hive
35. 35
Comparing Big SQL and Hive 0.12 for Ad-Hoc Queries
*Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Classic BI Workload" in a controlled laboratory environment. The 1TB Classic BI Workload is a workload derived from the TPC-H Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no update functions are performed. TPC Benchmark and TPC-H are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014
36. Big SQL is 10x faster than Hive 0.12
(total workload elapsed time)
36
Comparing Big SQL and Hive 0.12 for Decision Support Queries
* Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Modern BI Workload" in a controlled laboratory environment. The 1TB Modern BI Workload is a workload derived from the TPC-DS Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no updates are performed, and only 43 out of 99 queries are executed. The test measured sequential query execution of all 43 queries for which Hive syntax was publically available. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC).
Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014
37. How many times faster is Big SQL than Hive 0.12?
* Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Modern BI Workload" in a controlled laboratory environment. The 1TB Modern BI Workload is a workload derived from the TPC-DS Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no updats are performed, and only 43 out of 99 queries are executed. The test measured sequential query execution of all 43 queries for which Hive syntax was publically available. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC).
Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014
Max Speedup
of 74x
37
Queries sorted by speed up ratio (worst to best)
Avg Speedup
of 20x
38. Conclusion
•Today, it seems, performance numbers are the name of the game
•But in reality there is so much more…
•How rich is the SQL?
•How difficult is it to (re-)use your existing SQL?
•How secure is your data?
•Is your data still open for other uses on Hadoop?
•Can your queries span your enterprise?
•Can other Hadoop workloads co-exist in harmony?
•…
•With Big SQL 3.0 performance doesn't mean compromise
39. Try it now! InfoSphere for BigInsights Quick Start
Free, no limit, non-production version of BigInsights
Features Big SQL, BigSheets, Text Analytics, Big R, management console, development tools
Tutorials and education available
ibm.co/QuickStart
40. Please Note
IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
41. Темы для следующих митапов
•R on Hadoop
•Файловые системы
•Движки MapReduce/Spark/etc
•Hadoop Security
•Spreadsheet analysis
•Text analysis
•?