Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Pivotal Big Data Suite: A Technical Overview

1.480 visualizaciones

Publicado el

How and why are companies like Uber, Netflix and AirBnB so successful, what you need to in order to become successful in the same way that they are and how Pivotal can help you with that.

Speaker: Les Klein, EMEA CTO Data, Pivotal

Publicado en: Tecnología
  • Sé el primero en comentar

Pivotal Big Data Suite: A Technical Overview

  1. 1. TECHNICAL OVERVIEW: Pivotal Big Data Suite Les Klein Field CTO Data Pivotal @LesKlein #PivotalForum #Istanbul #BigData #Analytics
  2. 2. Forward Looking Statements This presentation contains “forward-looking statements” as defined under the Federal Securities Laws. Actual results could differ materially from those projected in the forward-looking statements as a result of certain risk factors, including but not limited to: (i) adverse changes in general economic or market conditions; (ii) delays or reductions in information technology spending; (iii) the relative and varying rates of product price and component cost declines and the volume and mixture of product and services revenues; (iv) competitive factors, including but not limited to pricing pressures and new product introductions; (v) component and product quality and availability; (vi) fluctuations in VMware’s Inc.’s operating results and risks associated with trading of VMware stock; (vii) the transition to new products, the uncertainty of customer acceptance of new product offerings and rapid technological and market change; (viii) risks associated with managing the growth of our business, including risks associated with acquisitions and investments and the challenges and costs of integration, restructuring and achieving anticipated synergies; (ix) the ability to attract and retain highly qualified employees; (x) insufficient, excess or obsolete inventory; (xi) fluctuating currency exchange rates; (xii) threats and other disruptions to our secure data centers and networks; (xiii) our ability to protect our proprietary technology; (xiv) war or acts of terrorism; and (xv) other one-time events and other important factors disclosed previously and from time to time in the filings EMC Corporation, the parent company of Pivotal, with the U.S. Securities and Exchange Commission. EMC and Pivotal disclaim any obligation to update any such forward-looking statements after the date of this release.
  3. 3. 4© 2016 Pivotal Software, Inc. All rights reserved. Pivotal Big Data Suite Complete platform Hadoop Native SQL Deployment options Based on open source Flexible licensing Advanced data services PIVOTAL GREENPLUM DATABASE Data warehouse database based on open source Greenplum Database PIVOTAL HDB Open source analytical database for Apache Hadoop based on Apache HAWQ PIVOTAL GEMFIRE Open source application and transaction data grid based on Apache Geode Pivotal Big Data Suite Open source data management portfolio
  4. 4. Great software companies leverage Big Data to fundamentally change the consumer experience and pioneer entirely new business models
  5. 5. 6© 2016 Pivotal Software, Inc. All rights reserved. $4BN Financial Services $26BN Hospitality $50BN Transportation $54BN Entertainment $30BN Automotive $3.2BN Industrial Products CLOUD NATIVE SOFTWARE IS CHANGING INDUSTRIES Data is Fueling Software
  6. 6. 7© Copyright 2015 Pivotal. All rights reserved. Hundreds of thousands of “trip” events each day 400+ billion of viewing-related events per day Five billion training data points for Price Tip feature Disruptors Use a LOT of Data
  7. 7. 8© Copyright 2015 Pivotal. All rights reserved. “We’ve found that when a host selects a price that’s within 5% of their tip, they’re nearly 4 times more likely to get booked” “The importance of accuracy and efficiency […], will continue to rise as we expand and improve products like uberPOOL and beyond.” “Over 75% of what people watch come from our recommendations” Data manifests as features in an app
  8. 8. 9© Copyright 2015 Pivotal. All rights reserved. (Data) Microservices Loosely coupled services architecture, bounded by context Cloud-Native Platforms Enabling continuous delivery & automated operations Open Source Database Innovation Extreme scale & performance advantages, built for the cloud Machine Learning Use of predictive analytics to build smart apps How are they accomplishing this?
  9. 9. 10© Copyright 2015 Pivotal. All rights reserved. These companies… Release new features in minutes, multiple times a day Support a micro-services architecture Consume a wide range of data sources and protocols Store and Analyze all their data Update algorithms and predictive models daily Continuously ask lots of questions of their data Modify data pipelines and add processing steps daily
  10. 10. 11© 2016 Pivotal Software, Inc. All rights reserved. …but most enterprises are not quite there yet 11 Applications scalability limited by databases Real-time data insights limited by disconnected OLTP and OLAP systems Data services are not ready for cloud platforms App 2 App 1 App 3 Bottleneck Transactional Database App App App Transactional Database ETL / ELT Batches Δt TRANSACTIONS ANALYTICS Analytic Database Continuous Delivery
  11. 11. 12© 2016 Pivotal Software, Inc. All rights reserved. Stream + Batch Processing Programming + Operating Model Cloud-Native Platform Microservices FrameworkPlatform Runtime Hadoop DW Spark Microservices and Polyglot Persistence IMDG K/V Store Relational DB Big Data & Machine Learning Modern Cloud-Native Data Architecture Cloud Infrastructure
  12. 12. 13© 2016 Pivotal Software, Inc. All rights reserved. New pressures are breaking fragile systems 13 Applications scalability limited by databases Real-time data insights limited by disconnected OLTP and OLAP systems Data services are not ready for cloud platforms App 2 App 1 App 3 Bottleneck Transactional Database App App App Transactional Database ETL / ELT Batches Δt TRANSACTIONS ANALYTICS Analytic Database Continuous Delivery
  13. 13. 14© 2016 Pivotal Software, Inc. All rights reserved. Apps scalability limited by scalability of databases 14 DB scalability limitations are aggravated by additional devices, clients and apps App 2 App 1 App 3 Existing Applications New devices And clients New cloud native scalable data apps App 2 App 1 App 3 Bottleneck Transactional Database Scale-out applications vs Scale-up databases
  14. 14. 15© 2016 Pivotal Software, Inc. All rights reserved. GemFire: 15 Cloud-scale high performance transactional data • Horizontally scalable • Ultra fast, low-latency in-memory transactions • Fully configurable data consistency • Reliable eventing and notification model • Highly Available, auto-healing • Inter-cluster WAN replication Custom Apps App 1App 1App 1 App 2App 2App 2 Push Updates Transactional Native API Rest / HTTP Pivotal GemFire
  15. 15. 16© 2016 Pivotal Software, Inc. All rights reserved. Batch-mode latency prevents real-time analysis 16 Applications scalability limited by databases Real-time data insights limited by disconnected OLTP and OLAP systems Data services are not ready for cloud platforms App 2 App 1 App 3 Bottleneck Transactional Database App App App Transactional Database ETL / ELT Batches Δt TRANSACTIONS ANALYTICS Analytic Database Continuous Delivery
  16. 16. 17© 2016 Pivotal Software, Inc. All rights reserved. Data Temperature Hot Hot Real-time data analytics is limited by data integration batches 17 Overnight ETL / ELT jobs expose data that is already outdated App 1 App 3 App 2 Transactional Database ETL / ELT Batches Δt TRANSACTIONS ANALYTICS• Analytical processes don’t have access to the latest data • ETL/ELT processes are expensive and hard to maintain • Batch process windows limits data scalability MPP Cold
  17. 17. 18© 2016 Pivotal Software, Inc. All rights reserved. Operationalized data insights need an event-driven architecture 18 Combination of SQL Analytics and NoSQL event-driven transactions is needed App 1 App 3 App 2 Transactional Database TRANSACTIONS ANALYTICS• Data Insights must be immediately pushed to applications • Apps should be able to react in real-time to analytical findings MPP Machine Learning Advanced Analytics ANSI SQL APIs / NoSQL Data Insights
  18. 18. 19© 2016 Pivotal Software, Inc. All rights reserved. DataTemperatureWarmHot GemFire and GPDB - Big Data meets Fast Data 19 Custom Apps App 1App 1App 1 App 2App 2App 2 Pivotal GemFire Data science, analytics & ML Transactional Native API Rest / HTTP Analytical ANSI SQL Push Updates Pivotal Greenplum Parallel Configurable Data Load Transactional data Write behind Analytical Data to cache
  19. 19. 22© 2016 Pivotal Software, Inc. All rights reserved. …but most enterprises are not quite there yet 22 Applications scalability limited by databases Real-time data insights limited by disconnected OLTP and OLAP systems Data services are not ready for cloud platforms App 2 App 1 App 3 Bottleneck Transactional Database App App App Transactional Database ETL / ELT Batches Δt TRANSACTIONS ANALYTICS Analytic Database Continuous Delivery
  20. 20. 24© 2016 Pivotal Software, Inc. All rights reserved. Cloud Native apps are better suitable for NoSQL 24 Enabling fast and scalable event-driven data services Unidirectional, request-response SQL Bidirectional, event-driven APIs Monolithic apps needed complex schema- based, SQL databases Micro-services need much simpler schemas, but much better scalability SQL API API API
  21. 21. 26© 2016 Pivotal Software, Inc. All rights reserved. PivotalCloudFoundry GemFire for Pivotal Cloud Foundry 26 Lightning fast in-memory persistence for cloud native apps • One-click provisioning • Pre-packaged configuration • Embedded monitoring by Pulse • Auto application binding • Multi-cloud support • Reliable data replication between PCF sites Pivotal GemFire Click to Deploy
  22. 22. 27© 2016 Pivotal Software, Inc. All rights reserved. Cloud-ready, infra-structure agnostic Next-generation databases must keep up to cloud native apps 27 Can your database do all of this? GemFire IMDG DOES. Horizontal Scalability Automatic fail-over Reliable eventing model Multi-site High Availability Seamless integration to analytical databases App 1 App 3App 2
  23. 23. 29© 2016 Pivotal Software, Inc. All rights reserved. Pivotal Greenplum World’s First Open Source Massively Parallel Data Warehouse
  24. 24. 30© 2016 Pivotal Software, Inc. All rights reserved. • Relational database system for big data and data warehousing • • Mission critical & system of record product with supporting tools and ecosystem • • Fully open source with a global community of developers and users • • Large industrial focused system • • PostgreSQL based • • Multi-platform technology • On-premise, Cloud, Enterprise Appliance • • It’s a Software product Greenplum Database Mission & Strategy
  25. 25. 31© 2016 Pivotal Software, Inc. All rights reserved. Government Tax & benefits fraud detection Economic statistics research Financial Services Wealth management data science and product development for Commercial Banking Risk and trade repositories reporting 401K providers analytics on investment choices Pharmaceutical Vaccine potency prediction based on manufacturing sensors IoT Predictive maintenance for auto manufacturer, industrial equipment and government agencies Semiconductor Fab sensor analytics and reporting Highlighted Greenplum Successes Cyber Security & Surveillance Internal email and communication surveillance and reporting Corporate network anomalous behavior and intrusion detections Oil & Gas Drilling equipment predictive maintenance Communications Mobile telephone company enterprise data warehouse Network performance and availability analytics Retail Customer purchases analytics Transportation Airlines loyalty program analytics
  26. 26. 32© 2016 Pivotal Software, Inc. All rights reserved. POLYMORPHIC STORAGE HEAP, Append Only, Columnar, External, Compression MULTI-VERSION CONCURRENCY CONTROL (MVCC) Greenplum Overview Greenplum DBSYSTEM ACCESS DATA PROCESSING DATA STORAGE CLIENT ACCESS PSQL, ODBC, JDBC BULK LOAD/UNLOAD GPLoad, GPFdist, External Tables, GPHDFS ADMIN TOOLS GP Perfmon, GP Support 3rd PARTY TOOLS Compatible with Industry Standard BI & ETL Tools SQL STANDARD COMPLIANCE MASSIVELY PARALLEL PROCESSING (MPP) IN-DATABASE PROGRAMMING LANGUAGES PL/pgSQL, PL/Python, PL/R, PL/Perl, PL/Java, PL/C IN-DATABASE ANALYTICS & EXTENSIONS MADlib, PostGIS, PGCrypto FULLY ACID COMPLIANT TRANSACTIONAL DATABASE INDEXES B-Tree, Bitmap, GiST BIG DATA QUERY OPTIMIZER
  27. 27. 34© 2016 Pivotal Software, Inc. All rights reserved. PostgreSQL Heritage Greenplum Open Source Launch • Widely used • Open Source • PostgreSQL License • Enterprise class open source relational engine
  28. 28. 35© 2016 Pivotal Software, Inc. All rights reserved. MPP Shared Nothing Architecture Flexible framework for processing large datasets … Master Host SQL Master Host and Standby Master Host Master coordinates work with Segment Hosts Segment Host with one or more Segment Instances Segment Instances process queries in parallel Segment Hosts have their own CPU, disk and memory (shared nothing) High speed interconnect for continuous pipelining of data processing Interconnect Segment Host Segment Instance Segment Instance Segment Instance Segment Instance Segment Host Segment Instance Segment Instance Segment Instance Segment Instance node1 Segment Host Segment Instance Segment Instance Segment Instance Segment Instance node2 Segment Host Segment Instance Segment Instance Segment Instance Segment Instance node3 Segment Host Segment Instance Segment Instance Segment Instance Segment Instance nodeN Greenplum DB
  29. 29. 36© 2016 Pivotal Software, Inc. All rights reserved. Greenplum DB External Sources Loading, streaming, etc. Network Interconnect ... ... ...... Master Servers Query planning & dispatch Segment Servers Query processing & data storage ETL File Systems Fast Parallel Load & Unload No Master Node bottleneck 10+ TB/Hour per Rack Linear scalability Low Latency Data immediately available No intermediate stores No data “reorganization” Load/Unload To & From: File Systems Any ETL Product Hadoop & Amazon S3 Loading: Massively-Parallel Ingest Extreme speed and immediate usability from files, ETL, Hadoop & S3
  30. 30. 39© 2016 Pivotal Software, Inc. All rights reserved. Polymorphic Storage™ User Definable Storage Layout Columnar storage compresses better Optimized for retrieving a subset of the columns when querying Compression can be set differently per column: gzip (1-9), quicklz, delta, RLE  Row oriented faster when returning all columns  HEAP for many updates and deletes  Use indexes for drill through queries TABLE ‘SALES’ Jun Column-orientedRow-oriented Oct Year - 1 Year - 2 External HDFS or S3  Less accessed partitions on external and seamlessly query all data  All major Hadoop distributions  Amazon S3 storage  Others in development Nov DecJul Aug Sep
  31. 31. 40© 2016 Pivotal Software, Inc. All rights reserved. Parent table Feb 2014 RETExternal Dec 2014Jan2013 Jan 2014 Partitions and External Partitions ... • Hash Distribution to evenly spread data across all segment instances • Range Partition within a segment instance to minimize scan work • Partitioned Tables Support for External Tables as a Partition – Readable external table – Host file system, NFS mount, HDFS or Amazon S3 Greenplum DB
  32. 32. 41© 2016 Pivotal Software, Inc. All rights reserved. Hybrid Queries: Pivotal External Tables • Readable Ext-Table MVP • Readable Gzip Files • Writable Ext-Table • Investigation: Enhanced Security/Roles • Investigation: Additional File Formats S3 External Tables Gemfire External Tables • Hi Speed Ingestion • Hi Concurrency Query Cache GPHDFS Roadmap
  33. 33. 42© 2016 Pivotal Software, Inc. All rights reserved. Greenplum Database Features for Data Scientists • Window functions: Perform calculations across a set of table rows that are somehow related to the current row • Analytics extensions: In-database machine learning at scale using MADlib • Procedural language extensions: Extended functionality using non-SQL programming languages and packages (e.g. Python and R) • Client Access: ODBC and JDBC access to support connections to 3rd party tools * Only a subset of Greenplum Database features
  34. 34. 43© 2016 Pivotal Software, Inc. All rights reserved. Procedural Languages • User Defined Types • User Defined Functions • User Defined Aggregates • Import of libraries from open source
  35. 35. 44© 2016 Pivotal Software, Inc. All rights reserved. Scalable, In-Database Machine Learning • Open source https://github.com/apache/incubator-madlib • Downloads and docs http://madlib.incubator.apache.org/ • Wiki https://cwiki.apache.org/confluence/display/MADLIB/
  36. 36. 45© 2016 Pivotal Software, Inc. All rights reserved. Functions Linear Systems • Sparse and Dense Solvers • Linear Algebra Matrix Factorization • Singular Value Decomposition (SVD) • Low Rank Generalized Linear Models • Linear Regression • Logistic Regression • Multinomial Logistic Regression • Ordinal Regression • Cox Proportional Hazards Regression • Elastic Net Regularization • Robust Variance (Huber-White), Clustered Variance, Marginal Effects Other Machine Learning Algorithms • Principal Component Analysis (PCA) • Association Rules (Apriori) • Topic Modeling (Parallel LDA) • Decision Trees • Random Forest • Support Vector Machines • Conditional Random Field (CRF) • Clustering (K-means) • Cross Validation • Naïve Bayes • Support Vector Machines (SVM) Descriptive Statistics Sketch-Based Estimators • CountMin (Cormode-Muth.) • FM (Flajolet-Martin) • MFV (Most Frequent Values) Correlation and Covariance Summary Utility Modules Array and Matrix Operations Sparse Vectors Random Sampling Probability Functions Data Preparation PMML Export Conjugate Gradient Stemming Inferential Statistics Hypothesis Tests Time Series • ARIMA April 2016 Path Functions • Operations on Pattern Matches
  37. 37. 46© 2016 Pivotal Software, Inc. All rights reserved. GPDB Geospatial Current Key Features: • Points, Lines, Polygons, Perimeter, Area, Intersection, Contains, Distance, Long/Lat, Spatial Indexes & Bounding Boxes Round earth calculations Ability to store geospatial data and query with with joins and operators Raster Image Processing
  38. 38. 47© 2016 Pivotal Software, Inc. All rights reserved. Pivotal HDB Hadoop Native SQL Database
  39. 39. 48© 2016 Pivotal Software, Inc. All rights reserved.
  40. 40. 49© 2016 Pivotal Software, Inc. All rights reserved. Enabling data science and machine learning at scale Making the Hadoop Data Lake More Consumable 2) Data scientists still have to resort to sampling if they can't run analytics in-database at scale 3) There are multiple data sets and formats within Hadoop SQL App BUSINESS ANALYSTS DATA SCIENTISTS DATA LAKE DATA LAKE Hive, HBase, etc. DATA LAKE 1) Important people and tools are cut-off because of SQL completeness or performance.
  41. 41. 50© 2016 Pivotal Software, Inc. All rights reserved. As the lingua franca of analytics, SQL can't be ignored. Neither can performance. Making the Hadoop Data Lake More Consumable 2) Data scientists still have to resort to sampling if they can't run analytics in-database at scale 3) There are multiple data sets and formats within Hadoop SQL App BUSINESS ANALYSTS DATA SCIENTISTS DATA LAKE DATA LAKE Hive, HBase, etc. DATA LAKE 1) Important people and tools are cut-off because of SQL completeness or performance.
  42. 42. 51© 2016 Pivotal Software, Inc. All rights reserved. Lack of interactive, ANSI SQL capabilities inhibits adoption and value Hadoop data lakes sit underutilized Producing complex queries, large joins, interactive queries Existing investments in visualization and BI tools Large population of users with SQL skills DATA LAKE DATA SCIENTISTS BUSINESS ANALYSTS SQL App
  43. 43. 52© 2016 Pivotal Software, Inc. All rights reserved. High performance, interactive SQL queries on Hadoop HDB: The Hadoop Native SQL Database ● Highly efficient MPP (massively parallel processing) ● Low-latency ● Petabyte scalability ● ACID transaction support ● SQL-92, 99, 2003 compatibility ● Advanced cost-based optimizer DATA LAKE SQL App BUSINESS ANALYSTS DATA SCIENTISTS
  44. 44. 53© 2016 Pivotal Software, Inc. All rights reserved. Integrate SQL and data science tools into an interactive, operationalized environment Making the Hadoop Data Lake More Consumable 2) Data scientists still have to resort to sampling if they can't run analytics in-database at scale 3) There are multiple data sets and formats within Hadoop SQL App BUSINESS ANALYSTS DATA SCIENTISTS DATA LAKE DATA LAKE Hive, HBase, etc. DATA LAKE 1) Important people and tools are cut-off because of SQL completeness or performance.
  45. 45. 54© 2016 Pivotal Software, Inc. All rights reserved. Using traditional, single-node Python or R for analytics means using subsets because of the lack of parallelization Predictive analytics not scaling with Python or R <...> Implications • Time-consuming data movement • Working with small sample sizes requires extra testing cycles against larger data sets • Slow feature generation limits algorithm development DATA LAKE DATA LAKE DATA LAKE SAMPLE 1 SAMPLE 2 SAMPLE n
  46. 46. 55© 2016 Pivotal Software, Inc. All rights reserved. ApacheTM MADlib® (incubating) is an open-source library for scalable in-database analytics In-database analytics speeds predictive modeling Scale-out mathematical, statistical and machine learning methods for structured and unstructured data • SQL-based • Analyze without sampling • Open source • Runs on HDB, Greenplum, and Postgres • Compliments support for procedural languages: PL/R, PL/Python, PL/Java Train a model... Predict for new data... DATA LAKE
  47. 47. 56© 2016 Pivotal Software, Inc. All rights reserved. Overcome complexity Making the Hadoop Data Lake More Consumable 2) Data scientists still have to resort to sampling if they can't run analytics in-database at scale 3) There are multiple data sets and formats within Hadoop SQL App BUSINESS ANALYSTS DATA SCIENTISTS DATA LAKE DATA LAKE Hive, HBase, etc. DATA LAKE 1) Important people and tools are cut-off because of SQL completeness or performance.
  48. 48. 57© 2016 Pivotal Software, Inc. All rights reserved. Schema Read HDB’s Pivotal eXtension Framework (PXF) and HCatalog integration Simplifying the data lake with data federation • Enables connectivity between Pivotal HDB and other stores (Hive, HBase, HDFS files). • Provides an extensible framework to add support for custom services • Low latency on large data sets • Considers cost model of federated sources HDFS DATA LAKE HCatalog CSV TXT Avro Custom Extensions
  49. 49. 59© 2016 Pivotal Software, Inc. All rights reserved. CUSTOMER APP Providing information in context with the right architecture and the right algorithms HDB as part of an architecture: Next Likely Purchase INTERNAL APP PURCHASE NEXT OFFER REAL-TIME VIEW OF TRANSACTIONS AND OFFERS REPORTS
  50. 50. 60© 2016 Pivotal Software, Inc. All rights reserved. CUSTOMER APP Providing information in context with the right architecture and the right algorithms HDB as part of an architecture: Next Likely Purchase INTERNAL APP PURCHASE NEXT OFFER REAL-TIME VIEW OF TRANSACTIONS AND OFFERS TRANSACTIONS PMML Model Creation & Training HDB Tables HDFS Staging 1. Ingest, transform, and land data into HDFS 2. Score streaming data and serve to application DATA SCIENCE & AD HOC QUERIES REPORTS
  51. 51. 61© 2016 Pivotal Software, Inc. All rights reserved. Advanced Analytics Performance Exceptional MPP performance, low latency, petabyte scalability, ACID reliability, fault tolerance Most Complete Language Compliance Higher degree of SQL compatibility, SQL-92, 99, 2003, OLAP, leverage existing SQL skills Advanced Query Optimizer Maximize performance and do advanced queries with confidence Elastic Architecture for Scalability Scale-up/down or scale-in/out, expand/shrink clusters on the fly Integrated w/MADlib Machine Learning Advanced MPP analytics, data science at scale, directly on Hadoop data MAD Pivotal HDB Advantages
  52. 52. 62© Copyright 2015 Pivotal. All rights reserved. “Companies need to learn how to catch people or things in the act of doing something and affect the outcome“ PAUL MARITZ Executive Chairman, Pivotal Real-time and Personalised Information in Context is what Wins!

×