Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Ibm db2 big sql

867 visualizaciones

Publicado el

IBM Db2 Big SQL and Open Source

Publicado en: Tecnología
  • A professional Paper writing services can alleviate your stress in writing a successful paper and take the pressure off you to hand it in on time. Check out, please ⇒ www.WritePaper.info ⇐
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí
  • To get professional research papers you must go for experts like ⇒ www.HelpWriting.net ⇐
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí
  • Want to preview some of our plans? You can get 50 Woodworking Plans and a 440-Page "The Art of Woodworking" Book... Absolutely FREE ▶▶▶ https://t.cn/A62Ye5eM
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí
  • Nothing short of a miracle! I'm writing on behalf of my husband to send you a BIG THANK YOU!! The improvement has been amazing. Peter's sleep apnea was a huge worry for both of us, and it left us both feeling tired and drowsy every morning. What you've discovered here is nothing short of a miracle. God bless you. ■■■ http://t.cn/AigiCT7Q
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí

Ibm db2 big sql

  1. 1. IBM Db2 Big SQL and Open Source Hebert Pereyra Big SQL and Data Virtualization Chief Architect pereyra@ca.ibm.com
  2. 2. IBM Cloud Legal Disclaimer 2 Copyright © IBM Corporation 2018 All rights reserved. U.S. Government Users Restricted Rights - Use, duplication, or disclosure restricted by GSA ADP Schedule Contract with IBM Corporation THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON CURRENT THINKING REGARDING TRENDS AND DIRECTIONS, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. FUNCTION DESCRIBED HEREIN MY NEVER BE DELIVERED BY I BM. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR SOFTWARE. IBM, the IBM logo, ibm.com and Db2 are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
  3. 3. IBM Cloud IBM and Hortonworks Focus on extending data science and machine learning to analyze the data in Apache Hadoop systems. Consumers get the best in class open technology • #1 Rank by Gartner 2017 Data Science Magic Quadrant • Leader in SQL technology for Hadoop (www.tpc.org) • Leader in data and analytics solutions for Hybrid Cloud • Provides Data Science & Machine Learning • Leader in Hadoop Open Source Distribution • 1000+ customers and 2100+ ecosystem partners • Hadoop original architects, developers employed by Hortonworks • Provides Open Hadoop Data Platform Commitment to progressing advanced analytics through open source + 2017-18: 100s of deals closed together. Over 40 joint events across 17 countries. Integration with IHAH, Big SQL, DSX, Truata GDPR offering. Working on embedded offerings for IoT, Watson. Monthly Exec Interlocks. Weekly Sales Exec Meetings. Engineering & OM Interlocks.
  4. 4. IBM Cloud IBM and Hortonworks – High Value
  5. 5. IBM Cloud SQL Ad-hoc queries, data preparation Federation Operational with fast lookups High performance and scalability Integrated Spark and Machine Learning Complex SQL, Deep analytics, Many users Application portability Db2 Big SQL – For all DWH needs in Hadoop SQL-based Application Common SQL Engine Client driver Db2 Big SQL Data Storage SQL MPP Run-time DFS Hadoop
  6. 6. © 2017 IBM Corporation DB2 Warehouse (DPF) drives all nodes to read table partitioned across multiple nodes SELECT SUM (…) FROM some_table DB2 Coordinator node 1 node 2 node 3 node 4 node n Sum(..) Sum(..) Sum(..) Sum(..) Sum(..)
  7. 7. © 2017 IBM Corporation DB2 Warehouse (DPF) drives all nodes to read table partitioned across multiple nodes Query Result DB2 Coordinator node 1 node 2 node 3 node 4 node n Sum(..) Sum(..) Sum(..) Sum(..) Sum(..)
  8. 8. © 2017 IBM Corporation Let’s extend the MPP processing concept now to Hadoop… Database Client Big SQL (head) node 1 node 2 node 3 node 4 node n HDFS A A A B B B A + BComplete Table = Big SQL Scheduler NameNode
  9. 9. © 2017 IBM Corporation Big SQL Query Execution Database Client Big SQL (head) node 1 node 2 node 3 node 4 node n HDFS A A A B B B A + BComplete Table = Big SQL Scheduler NameNode
  10. 10. IBM Cloud Db2 Big SQL V5.0 CoreCapabilitiesApplications Core SQL Engine SecurityAdministration Comprehensive ANSI SQL coverage Advanced cost-based optimizer SQL based RBAC Ranger Automatic workload management WLM Automatic memory management Performance Query rewrite for optimized execution Elastic boost – logical worker nodes SQL compatibility – Db2, Oracle, Netezza Federation MQTs Integration Spark Integration Batch SQL (minutes to hours) Interactive SQL (seconds to minutes) Self-service / Interactive BI (Sub-second) Data augmentation (Spark integration) Application portability • ETL • Reporting • Data mining • Deep analytics • Reporting • Complex queries • BI Tools: Cognos, Tableau, etc • Ad-hoc, exploratory • BI tools: Cognos, Tableau, etc • Query EDW • Join data • Use ML • Reuse applications • Reuse skills Roles DSM, Ambari SQL and NoSQL Structured & Unstructured www.tpc.org – check out TPC-H and TPC-DS – Big SQL vs Impala vs Hive Db2 Big SQL 5.0 2X faster than Hive LLAP with Tez (gap increases rapidly with higher concurrency) Db2 Big SQL 5.0 3X faster than Spark SQL (more optimal resource utilization and execution engine) Db2 Big SQL 5.0 faster than Cloudera Impala 2.9.0 (scalability hindered by architecture)
  11. 11. IBM Cloud No Vendor Lock-in & Integration with Hive • Db2 Big SQL preserves open source foundation • Separation of compute and storage - Data is part of Hadoop • Alternate execution engine to MapReduce • Hive and Db2 Big SQL share common metadata • Db2 Big SQL is optimized to push projections and predicate filters down to the storage I/O engine • Db2 Big SQL can invoke native Hive UDFs efficiently using Parameter Style Hive • Work on Hive ACID tables from Big SQL SQL Execution Engines Open Source Storage Model CSV Parquet ORC Others … Tab Delim. Hive Metastore (open source) Db2Big SQL (IBM) Hive (Open Source) 5.0.3 Ecosystem Integration
  12. 12. IBM Cloud New & Improved Reader/Writer for all File Formats ORC HBASE ANALYZE CUSTOM SERDES DATE STORED AS DATE OBJECT STORE EVERY OTHER FORMAT Formats supported by the C++ Reader and more COMPLEX TYPES PARQUET TEXT RCFILE AVRO SEQUENCE FILE Db2 Big SQL 5.0.3 Java I/O ▪Improved Performance on all tables / file types ▪Particularly text files ▪Product Stability ▪Reduced Complexity in Db2 Big SQL architecture ▪Reduced Cost −Personnel and Processes to maintain 2 I/O engines −Customer PMRs and critical situations ▪Reduced Out-of-Memory and instability issues ▪Reduced Resource allocation ▪Better Interoperability with Open Source ▪Java reader/writer for all file formats VARBINARY Enhanced stability and robustness to the product 5.0.3
  13. 13. IBM Cloud Big SQL High Availability (Head Node) Scheduler Big SQL Master (Primary) Catalogs Scheduler Big SQL Master (Standby) Catalogs … HDFS Data Worker Node Worker Node Worker Node Database Logs + Data Shipping HDFS Data HDFS Data ▪ Big SQL master node high availability − Scheduler automatically restarted upon failure − Catalog changes (metadata) replicated real time to “warm” standby instance − Standby automatically takes over if the primary fails − Worker nodes automatically detect and re-connect to acting “primary” − Automatic Client Re-route (ACR): clients automatically re-connect to acting “primary”
  14. 14. IBM Cloud PERFORMANCE: 6-streams Db2 Big SQL 2.3X FASTER HADOOP-DS @ 10TB 85 COMMON QUERIES WORKING COMPLIANT QUERIES: 6-streams WORKLOAD SCALE FACTOR: 10 TB FILE FORMAT: ORC (ZLIB) CONCURRENCY: 6 STREAMS QUERY SUBSET: 85 QUERIES RESOURCE UTILIZATION: 6-STREAMS 1.5x FEWER CPU CYCLES USED STACK HDP 2.6.1 Db2 Big SQL 5.0.1 HIVE 2.1 LLAP ON TEZ INTERESTING FACTS FASTEST QUERY 5.4X FASTER (Db2 Big SQL: 1.5 SEC, HIVE: 8.1 SEC) SLOWEST QUERY (QUERY 67) 1.7X FASTER (Db2 Big SQL: 6827 SEC, HIVE: 11830 SEC) Db2 Big SQL FASTER FOR 80% OF QUERIES RUN PERFORMANCE: 1-stream Db2 Big SQL 1.8X FASTER hrs hrs Query Performance at a Glance – vs Hive LLAP with Tez
  15. 15. IBM Cloud Combining Hadoop Technologies Not Mutually Exclusive. Hive, Db2 Big SQL & Spark SQL can co-exist and complement and leverage each other in a cluster Hive Db2 Big SQL Spark SQL Geospatial analytics ACID capabilities Fast ingest Federation Complex Queries High Concurrency Enterprise ready Application portability All open source files Machine learning Data exploration Simpler SQL Leveraged for metastore and UDFs IBM’s proprietary SQL engine for Hadoop IBM’s open source SQL engine for Hadoop
  16. 16. IBM Cloud PERFORMANCE Db2 Big SQL 5.0 is 3.2x faster than Spark SQL 2.1 (4 Concurrent Streams) SNAPSHOT OF 100TB HADOOP-DS I/O (vs Spark) Db2 Big SQL reads 12x less data Db2 Big SQL writes 30x less data COMPRESSION 60% SPACE SAVED WITH PARQUET AVERAGE CPU USAGE 76.4% MAX I/O THROUGHPUT READ 4.4 GB/SEC WRITE 2.8 GB/SEC WORKING QUERIES Leads performance metrics on high volumes of data and concurrent streams Blog on benchmark: https://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/ Query Performance at a Glance – Db2 Big SQL & Spark SQL
  17. 17. IBM Cloud Materialized Query Tables (MQTs) on Hadoop Tables User executes a complex query Db2 Big SQL Query rewrite Generate plan Generate plan Plans are compared and the best one is picked Query results Base tables MQTs MQTs Base • MQTs have results pre-computed and stored • Enables sub-second response times for complex queries with aggregates and joins on dimension tables • Db2 Big SQL optimizer automatically recognizes the MQTS for faster response 5.0.3
  18. 18. IBM Cloud MQT Performance – Star Schema Benchmark Queries Quick metric queries Product insight queries Customer insight queries Using Scale Factor 1000, tested 13 queries that join 1 fact with 4 dimension tables 6 Billion Lineitems & 30 Million Customers rows Response time in secs Query performance on non-MQT table 5.0.3
  19. 19. IBM Cloud MS SQL Server Netezza (PDA) Oracle PostgreS QL Teradata DB2 LUW, Db2z, DB2 on i Informix WebHDFS Object Store (S3) Hive HBase HDFS Hortonworks Data Platform (HDP) Db2 Big SQL NoSQL ML Model Federation – Virtualize Heterogeneous Data Db2 Big SQL queries heterogeneous systems in a single query Only SQL-on-Hadoop that virtualizes more than 10 different data sources: RDBMS, NoSQL, HDFS or Object Store Transparent ▪ Appears to be one source ▪ Programmers don’t need to know how / where data is stored High Function ▪ Full query support against all data ▪ Capabilities of sources as well Autonomous ▪ Non-disruptive to data sources, existing applications, systems. High Performance ▪ Optimization of distributed queries
  20. 20. IBM Cloud ✓ Easily access information on demand ✓ Combine data in Hadoop with disparate sources to form a data lake ✓ Quickly extend your data warehouse by enriching it Connect Query Monitor Data Placement ▪ QuickaccesstoDatavalue ▪ CommonFramework ▪ ODBC/JDBC ▪ Spark integration enables new data sources ▪ Connect all data sources in single query ▪ Intelligent Query Routing ▪ Cost-based optimizer ▪ SQL pushdown ▪ Local data caching ▪ ANSI-compliant SQL ▪ Easily define & manage through a common UI ▪ Simple point & click to discover and query ▪ Monitor and visualize active queries ▪ Schema conversion when moving data ▪ Bulk data copy to Hadoop ▪ Filtered subsets of data Federation - Rich Capabilities that Brings Data Together Think 2018 /9071A - Live Data Analytics using Db2 Big SQL and Big Replicate / March 19, 2018 / © 2018 IBM Corporation
  21. 21. IBM Cloud Access data from new data sources Extend the federation capabilities to data sources like: MySQL, PostgreSQL, MariaDB, & MongoDb, with enriched capabilities of query pushdown, best execution plan, secured access and optimized execution time Federation – Db2 Big SQL 5.0.3 Enhancements Function mapping for best query results When a function in one data source can be mapped to same or similar function in the remote data source, the results returned from the query pushdown is refined rather than returning large result sets with no function mapping Create local cache for federated data When federated data does not change frequently, it can be locally cached by creating Hadoop MQTs or regular MQTs to have local access to get best performance when compared to accessing remote data Use computational group to improve performance Computational partition group enables dynamically redistributing nickname data to parallelize processing in Hadoop. This improves performance especially when there’s large nickname data or when queries get complex. 5.0.3
  22. 22. IBM Cloud Offload data Data warehouse offload to Hadoop is now made easy: • Write one, run anywhere… • Easy porting of applications • Reuse skills of DBAs/ developers who know ANSI SQL Db2 Big SQL is the best platform for offloading Oracle Data Marts and Warehouses to Hadoop Application Portability: Move Applications without Re-tooling
  23. 23. IBM Cloud Here’s why Db2 Big SQL can get you the best execution for complex queries and many concurrent users with high performance Db2 Big SQL - Query Execution Self Tuning Memory Manager World Class Cost Based Optimizer Query rewrite Advanced Statistics Native Row & Columnar stores Elastic Boost SQL Compatibility Hardened runtime Advanced Workload manager Materialized Query Tables Performance Concurrent users Complex query
  24. 24. IBM Cloud For more details check the blog: https://developer.ibm.com/hadoop/2017/11/07/ibm-big-sql-machine-learning-demo/ Operationalize Machine Learning Models using SQL
  25. 25. IBM Cloud Db2 Big SQL - Security
  26. 26. IBM Cloud Db2 Big SQL - Security and Governance Combining Hadoop Technologies Added to SQL level security for row-level and column- level access control, Db2 Big SQL integrates with other components With Apache Ranger Db2 Big SQL plugin, you can setup policies for access to Db2 Big SQL tables: Create, alter, analyze, load, truncate, drop, insert, select, update, and delete. Supports Ranger Audit Big SQL also integrates with Information Governance Catalog by enabling easy shared imports to InfoSphere Metadata Asset Manager, which allows: Analyze assets Utilize assets in jobs Designate stewards for the assets Apache Ranger InfoSphere Metadata Asset Manager 5.0.3
  27. 27. IBM Cloud Db2 Big SQL - Advanced Workload Management Benefits • Identification and control of applications • Direct control of the execution environment • Detection and control of rogue queries – prevent bad queries from executing • Query concurrency – optimize query throughput • Advanced monitoring Avoid under-utilizing or over saturating the resources Resources Assigned: CPU I/O Memory Service Classes categorize work and set goals: Response time Velocity System Discretionary WLM Checks every 10 secs 5.0.3
  28. 28. IBM Cloud Custom WLM Db2 Big SQL - Advanced Workload Management 5.0.3
  29. 29. IBM Cloud Query streaming data in HBase with data at-rest • HBase is a columnar NoSQL data store for Hadoop • HBase offers a flexible schema, low latency key lookups and small key scans. However, HBase is approximately 3x to 5x slower than Hive for large table scans • HBase tables are updateable (ACID), and could be used for versioned dimensions (though native Db2 tables local to the Head Node are generally a better choice here) • Db2 Big SQL supports the full range of SQL operations against HBase tables • Combine streaming data in HBase with relational data in Hive using Big SQL Db2 Big SQL HBaseHive Analytical SQL NZ RDBMS Offloaded Data High Data Ingest Rates from External Applications Db2 Big SQL and Hbase Support
  30. 30. IBM Cloud Db2 Big SQL - Tables over S3 Object Storage Create Tables over Data residing in Object Store directly (no copy required into Hadoop) Once configured, Object Store tables work like any other table in Big SQL Benefits: No need to copy data into Hadoop first! Query data where it resides. Partitioning supported! Tradeoff: Expect reduced performance relative to HDFS local tables CREATE HADOOP TABLE staff ( … ) LOCATION 's3a://s3atables/staff'; LOAD FROM Object Store also supported!
  31. 31. IBM Cloud Db2 Big SQL - Tables over WebHDFS Transparently access data on any platform implementing WebHDFS Examples: Microsoft Azure Data Lake (ADL) service Once setup, WebHDFS tables work like any other table in Big SQL Technical Preview Limitations: WebHDFS via Knox not supported Performance not well understood. Reduce performance expected. Db2 Big SQL Local Hadoop Cluster Remote Hadoop Cluster or WebHDFS enabled Storage CREATE HADOOP TABLE staff ( … ) PARTITIONED BY (JOB VARCHAR(5)) LOCATION 'webhdfs://namenode.acme.com:50070/path/to/table/staff'; LOAD FROM WebHDFS also supported! 5.0.3
  32. 32. IBM Cloud Db2 Big SQL – Integration with Yarn and Spark
  33. 33. IBM Cloud HDFS Big SQL Head Node Big SQL Worker Big SQL Worker Big SQL Worker Big SQL Worker Spark Exec. Spark Exec. Spark Exec. Spark Exec. = Fast data transfer over shared memory Db2 Big SQL – Deep Integration with Spark
  34. 34. IBM Cloud Exploit Db2 Big SQL from Spark Requirements: db2jcc.jar must be added to the classpath of the Spark application (found in /home/bigsql/java/) import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; … Dataset<Row> tableDf = sqlCtx.read() .format("jdbc") .option("driver", "com.ibm.db2.jcc.DB2Driver") .option("url", "jdbc:db2://server1.foo.bar.com:32051/BIGSQL") .option("user", "joe") .option("password", "joespwd") .option("dbtable", "myshcema.mytable") .load(); tableDf.createOrReplaceTempView("myTable"); Dataset<Row> queryDF = spark.sql("SELECT col2, col3 FROM myTable WHERE col1 > 100"); Big SQL secures data for self-service data exploration. Used this way, Spark users are subject to Big SQL row/column security
  35. 35. IBM Cloud Exploit Spark from Big SQL Example: Spark Schema Discovery for JSON Bring the best of Spark into Big SQL! Machine Learning Cache remote tables (Spark has rich library of connectors) Graph Processing General in memory processing SELECT doc.* FROM TABLE( SYSHADOOP.EXECSPARK( class => 'DataSource', load => 'hdfs://host.port.com:8020/user/bigsql/demo.json') ) AS doc WHERE doc.language = 'English'; Structure of JSON document determined at run time
  36. 36. © 2018 IBM Corporation36 Big SQL Worker Big SQL Worker Big SQL Worker Big SQL Worker Big SQL Worker Big SQL Worker Big SQL Elastic Boost – new in v5.x (and 4.2.5) Multiple Logical Workers per Host Big SQL Head HDFS Container Big SQL Components Users Big SQL Worker Big SQL Worker Big SQL Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker
  37. 37. © 2018 IBM Corporation37 Big SQL Elastic Boost – new in v5.x (and 4.2.5) Logical Workers allows multiple Yarn containers per Host Big SQL Head NM NM NM NM NM NM HDFS Slider Client YARN Resource Manager & Scheduler Big SQL AM Big SQL Worker Big SQL Worker Big SQL Worker Container YARN components Slider Components Big SQL Components Big SQL Slider package implements Slider Client APIs Users Worker Worker Worker Worker Worker Worker Worker Worker Worker
  38. 38. © 2018 IBM Corporation38 Elastic Big SQL Capacity • Remember that with Big SQL, 1 container (1 Worker) can service hundreds of concurrent SQL jobs. It accomplishes such by being a long running service that stays resident in memory. So what does 50% mean..?
  39. 39. © 2018 IBM Corporation39 What does 50% mean? (default post enablement) ▪ YARN (not Big SQL) decides where containers/workers are started. These situations (and others) are possible. ▪ As capacity target increases above 50%, opportunity of skew is reduced. Worker Worker Worker Worker Worker Worker Worker Worker WorkerWorker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker WorkerWorker Ideal Minor Skew Significant Skew Significant Skew
  40. 40. IBM Cloud Db2 Big SQL w/ Elastic Boost - INSERT Performance For both 1 and 10 TB TPC-DS dataset 2 Workers/Node: 1.6x speedup 4 Workers/Node: 2.2x speedup In each scenario, the same TOTAL CPU/memory is used INSERT…SELECT performance with Elastic Boost # Workers / Node
  41. 41. IBM Cloud Db2 Big SQL 5.0 – How it fits with Hortonworks Big SQL deploys on top of Hortonworks Data Platform(HDP) Capability Apache Hadoop and Apache Spark and Ecosystem IBM Big SQL Support Community Support Hortonworks Data Platform for IBM Support Offering for HDP includes support for Big SQL BUY Hortonworks Data Platform

×