Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon QuickSight - AWS Online Tech Talks

2.961 visualizaciones

Publicado el

Learning Objectives:
- Understand how to build a serverless big data solution quickly and easily
- Learn how to discover and prepare all your data for analytics
- Learn how to query and visualize analytics on all your data to create actionable insights

Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon QuickSight - AWS Online Tech Talks

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ben Snively, Specialist Solutions Architect – Data and Analytics October 12, 2017 Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon QuickSight
  2. 2. Agenda • What is Serverless? • Enterprise Data Warehouse on AWS (Amazon Redshift) • Serverless Queries from your Data Warehouse (Redshift Spectrum) • Serverless Data Catalog (AWS Glue) • Serverless ETL (AWS Glue) • Serverless BI (Amazon QuickSight) • Demonstration • Wrap up
  3. 3. What is Serverless Virtualized Managed Serverless You can easily provision servers and focus on OS and above. You focus higher in the stack but still need to consider servers, how much CPU is needed, and how much RAM. AWS manages based the customer configuration. Build applications and services without thinking of servers. Don’t be concerned about provisioning, scaling, and maintaining servers for fault tolerance and availability. AWS does all of this for you.
  4. 4. No servers to provision or manage Scales with usage Never pay for idle resources Availability and fault tolerance built in Serverless characteristics
  5. 5. • Managed Massively Parallel Petabyte Scale Data Warehouse • Streaming Backup/Restore to S3 • Load data from S3, DynamoDB and EMR • Extensive Security Features • Online Scaling from 160 GB -> 2 PB Amazon Redshift Enterprise Data Warehouse a lot faster a lot simpler a lot cheaper
  6. 6. Selected Amazon Redshift customers
  7. 7. We innovate quickly Well over 140 new features added since launch Release every two weeks Automatic patching Service Launch (2/14) PDX (4/2) Temp Credentials (4/11) DUB (4/25) SOC1/2/3 (5/8) Unload Encrypted Files NRT (6/5) JDBC Fetch Size (6/27) Unload logs (7/5) SHA1 Builtin (7/15) 4 byte UTF-8 (7/18) Sharing snapshots (7/18) Statement Timeout (7/22) Timezone, Epoch, Autoformat (7/25) WLM Timeout/Wildcards (8/1) CRC32 Builtin, CSV, Restore Progress (8/9) Resource Level IAM (8/9) PCI (8/22) UTF-8 Substitution (8/29) JSON, Regex, Cursors (9/10) Split_part, Audit tables (10/3) SIN/SYD (10/8) HSM Support (11/11) Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count Distinct, SNS Alerts, Cross Region Backup (11/13) Distributed Tables, Single Node Cursor Support, Maximum Connections to 500 (12/13) EIP Support for VPC Clusters (12/28) New query monitoring system tables and diststyle all (1/13) Redshift on DW2 (SSD) Nodes (1/23) Compression for COPY from SSH, Fetch size support for single node clusters, new system tables with commit stats, row_number(), strotol() and query termination (2/13) Resize progress indicator & Cluster Version (3/21) Regex_Substr, COPY from JSON (3/25) 50 slots, COPY from EMR, ECDHE ciphers (4/22) 3 new regex features, Unload to single file, FedRAMP(5/6) Rename Cluster (6/2) Copy from multiple regions, percentile_cont, percentile_disc (6/30) Free Trial (7/1) pg_last_unload_count (9/15) AES-128 S3 encryption (9/29) UTF-16 support (9/29)
  8. 8. Amazon Redshift Spectrum Run SQL queries directly against data in S3 using thousands of nodes Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query High concurrency: Multiple clusters access same data No ETL: Query data in-place using open file formats Full Amazon Redshift SQL support S3 SQL
  9. 9. Redshift Spectrum ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage AWS Glue Data Catalog Apache Hive Metastore 10 GigE (HPC) Ingestion Backup Restore Customer VPC Internal VPC JDBC/ODBC Leverages Amazon Redshift’s advanced cost- based optimizer Pushes down projections, filters, aggregations and join reduction Dynamic partition pruning to minimize data processed Automatic parallelization of query execution against S3 data Efficient join processing within the Amazon Redshift cluster Spectrum Nodes Redshift Nodes
  10. 10. Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY… Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Glue Data Catalog Apache Hive Metastore 1
  11. 11. Query is optimized and compiled at the leader node. Determine what gets run locally and what goes to Amazon Redshift Spectrum Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Glue Data Catalog Apache Hive Metastore 2
  12. 12. Query plan is sent to all compute nodes Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Glue Data Catalog Apache Hive Metastore 3
  13. 13. Compute nodes obtain partition info from Data Catalog; dynamically prune partitions Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Glue Data Catalog Apache Hive Metastore 4
  14. 14. Each compute node issues multiple requests to the Amazon Redshift Spectrum layer Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage 5 Glue Data Catalog Apache Hive Metastore
  15. 15. Amazon Redshift Spectrum nodes scan your S3 data Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage 6 Glue Data Catalog Apache Hive Metastore
  16. 16. 7 Amazon Redshift Spectrum projects, filters, and aggregates Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Glue Data Catalog Apache Hive Metastore
  17. 17. Final aggregations and joins with local Amazon Redshift tables done in-cluster Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage 8 Glue Data Catalog Apache Hive Metastore
  18. 18. Result is sent back to client Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage 9 Glue Data Catalog Apache Hive Metastore
  19. 19. Glue’ing it together
  20. 20. AWS Glue Automatically discovers and categorizes your data to make it immediately searchable and queryable Generates code to clean, enrich, and reliably move data between data stores; you can also use their favorite tools to build ETL jobs Runs your jobs on a serverless, fully managed, scale-out environment without needing to provision or manage compute resources Discover Develop Deploy
  21. 21. AWS Glue: Components Data Catalog  Apache Hive Metastore compatible with enhanced functionality  Crawlers automatically extract metadata and create tables  Integrated with Amazon Athena, Amazon Redshift Spectrum Job Execution  Runs jobs on a serverless Spark platform  Provides flexible scheduling  Handles dependency resolution, monitoring, and alerting Job Authoring  Auto-generates ETL code  Built on open frameworks – Python and Spark  Developer-centric – editing, debugging, sharing
  22. 22. AWS Glue Data Catalog Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single categorized list that is searchable
  23. 23. Crawlers: Classifiers IAM Role Glue Crawler Data Lakes Data Warehouse Databases Amazon RDS Amazon Redshift Amazon S3 JDBC Connection Object Connection Built-In Classifiers MySQL MariaDB PostreSQL Aurora SQL Server / Oracle Redshift Avro Parquet ORC JSON & BJSON Logs (Apache, Linux, MS, Ruby, Redis, and many others) Delimited (comma, pipe, tab, semicolon) Compressed Formats (ZIP, BZIP, GZIP, LZ4, Snappy) Create additional Custom Classifiers with Grok!
  24. 24. Building your Data Catalog
  25. 25. Job authoring in AWS Glue  Python code generated by AWS Glue  Connect a notebook or IDE to AWS Glue  Existing code brought into AWS Glue You have choices on how to get started
  26. 26. 1. Customize the mappings 2. Glue generates transformation graph and Python code 3. Connect your notebook to development endpoints to customize your code Job authoring: Automatic code generation
  27. 27. Job authoring: Relationalize() transform Semi-structured schema Relational schema • Transforms and adds new columns, types, and tables on-the-fly • Tracks keys and foreign keys across runs • SQL on the relational schema is orders of magnitude faster than JSON processing F K A B B C.X C. Y P K Valu e Offs et A C D [ ] X Y B B
  28. 28. Serverless ETL to populate your warehouse
  29. 29. Amazon QuickSight is a Business Analytics Service that lets business users quickly and easily visualize, explore, and share insights from their data.
  30. 30. Basic Concepts Retail Data Ops Data Marketing Data Relational Databases Flat Files More data sources coming soon! Microsoft Active DirectoryLocal User Definition
  31. 31. QuickSight is deeply integrated with AWS data sources like Redshift, RDS, S3, Athena and others, as well as third-party sources like Excel, Salesforce, as well as on-premises databases. Deep Integration with AWS Data Sources Amazon RDS, Aurora Amazon Redshift Amazon Athena Amazon S3 Flat Files
  32. 32. Super-fast Performance with SPICE
  33. 33. Putting the pieces together
  34. 34. Demonstration
  35. 35. What did we cover… Automatically discovers and categorizes your data to make it immediately searchable and queryable Business Analytics Service that lets business users quickly and easily visualize, explore, and share insights from their data. extend the analytic power beyond data stored your data warehouse to query vast amounts of unstructured data in your Amazon S3 “data lake” Runs your jobs on a serverless, fully managed, scale-out environment without needing to provision or manage compute resources
  36. 36. Thank you

×