Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Building a Modern Data Platform in the Cloud

1.046 visualizaciones

Publicado el

Modern data is massive, quickly evolving, unstructured, and increasingly hard to catalog and understand from multiple consumers and applications. This presentation will guide you though the best practices for designing a robust data architecture, highlightning the benefits and typical challenges of data lakes and data warehouses. We will build a scalable solution based on managed services such as Amazon Athena, AWS Glue, and AWS Lake Formation.

  • Inicia sesión para ver los comentarios

Building a Modern Data Platform in the Cloud

  1. 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. STEVEN BRYEN | AWS TECHNICAL & DEVELOPER EVANGELISM | @steven_bryen sbryen@amazon.com LONDON – MARCH 2019 DAT1 Building a Modern Data Platform in the Cloud
  2. 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Organizations that successfully generate business value from their data, will outperform their peers. An Aberdeen survey saw organizations who implemented a Data Lake outperforming similar companies by 9% in organic revenue growth.* 24% 15% Leaders Followers Organic revenue growth *Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence To Become a Leader, Data is Your Differentiator
  3. 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. For Data to Be a Differentiator, Customers Need to Be Able to… • Capture and store new non-relational data at PB-EB scale in real time • New type of analytics that go beyond batch reporting to incorporate real-time, predictive, voice, and image recognition • Democratize access to data in a secure and governed way New types of analytics Dashboards Predictive Image Recognition VoiceReal-time New types of data
  4. 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Traditionally, Analytics Used to Look Like This OLTP ERP CRM LOB Data Warehouse Business Intelligence • Relational data • TBs–PBs scale • Schema defined prior to data load • Operational reporting and ad hoc • Large initial CAPEX + $10K–$50K/TB/Year
  5. 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lakes Extend the Traditional Approach Data Warehouse Business Intelligence OLTP ERP CRM LOB • Relational and non-relational data • TBs–EBs scale • Diverse analytical engines • Low-cost storage & analytics Devices Web Sensors Social Big Data processing, real-time, Machine Learning Data Lake
  6. 6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale
  7. 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. More data lakes & analytics on AWS than anywhere else
  8. 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lakes and Analytics from AWS Cost-effective Scalable and durable Secure Open and comprehensiveAnalyticsMachine Learning Real-time Data Movement On-premises Data Movement Data Lake on AWS
  9. 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lakes, Analytics, and ML Portfolio from AWS Broadest, deepest set of analytic services Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch service Amazon Kinesis Amazon QuickSight Analytics Machine Learning AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS Storage Gateway AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Real-time Data Movement On-premises Data Movement Data Lake on AWS Storage | Archival Storage | Data Catalog
  10. 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3—Object Storage Security and Compliance Three different forms of encryption; encrypts data in transit when replicating across regions; log and monitor with CloudTrail, use ML to discover and protect sensitive data with Macie Flexible Management Classify, report, and visualize data usage trends; objects can be tagged to see storage consumption, cost, and security; build lifecycle policies to automate tiering, and retention Durability, Availability & Scalability Built for eleven nine’s of durability; data distributed across 3 physical facilities in an AWS region; automatically replicated to any other AWS region Query in Place Run analytics & ML on data lake without data movement; S3 Select can retrieve subset of data, improving analytics performance by 400%
  11. 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amaz on S 3 Amaz on G laci e r AW S G lu e Store Data in the Format You Want Open and comprehensive • Store data in the format you want: • Text files like CSV • Columnar like Apache Parquet, and Apache ORC • Logstash like Grok • JSON (simple, nested), AVRO • And more… CSV ORC Grok Avro Parquet JSON
  12. 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Glacier—Backup and Archive Durability, Availability & Scalability Built for eleven nine’s of durability; data distributed across 3 physical facilities in an AWS region; automatically replicated to any other AWS region Secure Log and monitor with CloudTrail, Vault Lock enables WORM storage capabilities, helping satisfy compliance requirements Retrieves data in minutes Three retrieval options to fit your use case; expedited retrievals with Glacier Select can return data in minutes Inexpensive Lowest cost AWS object storage class, allowing you to archive large amounts of data at a very low cost $
  13. 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Storing is Not Enough, Data Needs to Be Discoverable Dark data are the information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). CRM ERP Data warehouse Mainframe data Web Social Log files Machine data Semi- structured Unstructured “ ”Gartner IT Glossary, 2018 https://www.gartner.com/it-glossary/dark-data
  14. 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Preparation Accounts for ~80% of the Work Building training sets Cleaning and organizing data Collecting data sets Mining data for patterns Refining algorithms Other
  15. 15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use AWS Glue to cleanse, prep, and catalog AWS Glue Data Catalog - a single view across your data lake Automatically discovers data and stores schema Makes data searchable, and available for ETL Contains table definitions and custom metadata Use AWS Glue ETL jobs to cleanse, transform, and store processed data Serverless Apache Spark environment Use Glue ETL libraries or bring your own code Write code in Python or Scala Call any AWS API using the AWS boto3 SDK Amazon S3 (Raw data) Amazon S3 (Staging data) Amazon S3 (Processed data) AWS Glue Data Catalog Crawlers Crawlers Crawlers
  16. 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue—ETL Service Make ETL scripting and deployment easy • Automatically generates ETL code • Code is customizable with Python and Spark • Endpoints provided to edit, debug, test code • Jobs are scheduled or event-based • Serverless
  17. 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue—Data Catalog Make data discoverable • Automatically discovers data and stores schema • Catalog makes data searchable, and available for ETL • Catalog contains table and job definitions • Computes statistics to make queries efficient Glue Data Catalog Discover data and extract schema Compliance
  18. 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Crawlers automatically build your Data Catalog and keep it in sync. Automatically discover new data, extracts schema definitions Detect schema changes and version tables Detect Hive style partitions on Amazon S3 Built-in classifiers for popular types; custom classifiers using Grok expression Run ad hoc or on a schedule; serverless – only pay when crawler runs AWS Glue Crawlers Crawlers Automatically catalog your data
  19. 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lakes, Analytics, and ML Portfolio from AWS Broadest, deepest set of analytic services Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch service Amazon Kinesis Amazon QuickSight Analytics Machine Learning AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS Storage Gateway AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Real-time Data Movement On-premises Data Movement Data Lake on AWS Storage | Archival Storage | Data Catalog
  20. 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Movement From On-premises Datacenters AWS Snowball, Snowball Edge and Snowmobile Petabyte and Exabyte- scale data transport solution that uses secure appliances to transfer large amounts of data into and out of the AWS cloud AWS Direct Connect Establish a dedicated network connection from your premises to AWS; reduces your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet- based connections AWS Storage Gateway Lets your on-premises applications to use AWS for storage; includes a highly-optimized data transfer mechanism, bandwidth management, along with local cache AWS Database Migration Service Migrate database from the most widely-used commercial and open- source offerings to AWS quickly and securely with minimal downtime to applications
  21. 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Movement From Real-time Sources Amazon Kinesis Video Streams Securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing Amazon Kinesis Data Firehose Capture, transform, and load data streams into AWS data stores for near real-time analytics with existing business intelligence tools. Amazon Kinesis Data Streams Build custom, real-time applications that process data streams using popular stream processing frameworks AWS IoT Core Supports billions of devices and trillions of messages, and can process and route those messages to AWS endpoints and to other devices reliably and securely
  22. 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lakes, Analytics, and ML Portfolio from AWS Broadest, deepest set of analytic services Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch service Amazon Kinesis Amazon QuickSight Analytics Machine Learning AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS Storage Gateway AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Real-time Data Movement On-premises Data Movement Data Lake on AWS Storage | Archival Storage | Data Catalog
  23. 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EMR—Big Data Processing Low cost Flexible billing with per- second billing, EC2 spot, reserved instances and auto-scaling to reduce costs 50–80% $ Easy Launch fully managed Hadoop & Spark in minutes; no cluster setup, node provisioning, cluster tuning Latest versions Updated with the latest open source frameworks within 30 days of release Use S3 storage Process data directly in the S3 data lake securely with high performance using the EMRFS connector Data Lake 100110000100101011100 101010111001010100000 111100101100101010001 100001
  24. 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift—Data Warehousing Fast at scale Columnar storage technology to improve I/O efficiency and scale query performance Secure Audit everything; encrypt data end-to-end; extensive certification and compliance Open file formats Analyze optimized data formats on the latest SSD, and all open data formats in Amazon S3 Inexpensive As low as $1,000 per terabyte per year, 1/10th the cost of traditional data warehouse solutions; start at $0.25 per hour $
  25. 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift Spectrum Extend the data warehouse to exabytes of data in S3 data lake S3 data lakeRedshift data Redshift Spectrum query engine • Exabyte Redshift SQL queries against S3 • Join data across Redshift and S3 • Scale compute and storage separately • Stable query performance and unlimited concurrency • CSV, ORC, Grok, Avro, & Parquet data formats • Pay only for the amount of data scanned
  26. 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Kinesis—Real Time time Load data streams into AWS data stores Kinesis Data Firehose Build custom applications that analyze data streams Kinesis Data Streams Capture, process, and store video streams for analytics Kinesis Video Streams Analyze data streams with SQL Kinesis Data Analytics SQL
  27. 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Example - Real-time Log Analytics With SQL
  28. 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena—Interactive Analysis Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Ability to run SQL queries on data archived in Amazon Glacier (coming soon) Query Instantly Zero setup cost; just point to S3 and start querying SQL Open ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types Easy Serverless: zero infrastructure, zero administration Integrated with QuickSight Pay per query Pay only for queries run; save 30–90% on per-query costs through compression $
  29. 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon QuickSight easy Empower everyone Seamless connectivity Fast analysis Serverless
  30. 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lakes, Analytics, and ML Portfolio from AWS Broadest, deepest set of analytic services Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch service Amazon Kinesis Amazon QuickSight Analytics Machine Learning AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS Storage Gateway AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Real-time Data Movement On-premises Data Movement Data Lake on AWS Storage | Archival Storage | Data Catalog
  31. 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lakes from AWS Data Lake on AWS Cost-effective Scalable and durable Secure Open and comprehensiveAnalyticsMachine Learning Real-time Data Movement On-premises Data Movement
  32. 32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Provides Highest Levels of Security Secure Compliance AWS Artifact Amazon Inspector Amazon Cloud HSM Amazon Cognito AWS CloudTrail Security Amazon GuardDuty AWS Shield AWS WAF Amazon Macie VPC Encryption AWS Certification Manager AWS Key Management Service Encryption at rest Encryption in transit Bring your own keys, HSM support Identity AWS IAM AWS SSO Amazon Cloud Directory AWS Directory Service AWS Organizations Customer need to have multiple levels of security, identity and access management, encryption, and compliance to secure their data lake
  33. 33. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Security: Machine Learning-Powered Security Secure • Machine learning to discover, classify, and protect data • Continuously monitors data access for anomalies • Generates alerts when it detects unauthorized access • Recognizes PII or intellectual propertyAmazon Macie
  34. 34. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Encryption: Data-at-Rest and in Motion Secure • Only cloud that offers three forms of encryption • Server-side encryption • Encryption with keys managed by the AWS Key Management Service • Encryption with keys that customers manage • Only cloud that encrypts data in transit when replicating across regions • Data movement services can use the same Key Management Service • SSL endpoints
  35. 35. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Compliance: Log and Audit all AWS Activity Secure • Log and continuously monitor every account activity and API calls with CloudTrail • Increase visibility into your user and resource activity • Enables governance, compliance, and operational and risk auditing Store data in S3 Account event occurs generating API activity CloudTrail captures and records the API activity A log of API calls is delivered to S3 bucket and optionally delivered to CloudWatch Logs and CloudWatch Events
  36. 36. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Compliance: Virtually Every Regulatory Agency CSA Cloud Security Alliance Controls ISO 9001 Global Quality Standard ISO 27001 Security Management Controls ISO 27017 Cloud Specific Controls ISO 27018 Personal Data Protection PCI DSS Level 1 Payment Card Standards SOC 1 Audit Controls Report SOC 2 Security, Availability, & Confidentiality Report SOC 3 General Controls Report Global United States CJIS Criminal Justice Information Services DoD SRG DoD Data Processing FedRAMP Government Data Standards FERPA Educational Privacy Act FIPS Government Security Standards FISMA Federal Information Security Management GxP Quality Guidelines and Regulations ISO FFIEC Financial Institutions Regulation HIPPA Protected Health Information ITAR International Arms Regulations MPAA Protected Media Content NIST National Institute of Standards and Technology SEC Rule 17a-4(f) Financial Data Standards VPAT/Section 508 Accountability Standards Asia Pacific FISC [Japan] Financial Industry Information Systems IRAP [Australia] Australian Security Standards K-ISMS [Korea] Korean Information Security MTCS Tier 3 [Singapore] Multi-Tier Cloud Security Standard My Number Act [Japan] Personal Information Protection Europe C5 [Germany] Operational Security Attestation Cyber Essentials Plus [UK] Cyber Threat Protection G-Cloud [UK] UK Government Standards IT-Grundschutz [Germany] Baseline Protection Methodology X P G
  37. 37. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lakes from AWS Data Lake on AWS Cost-effective Scalable and durable Secure Open and comprehensiveAnalyticsMachine Learning Real-time Data Movement On-premises Data Movement
  38. 38. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. For example: Amazon S3 holds trillions of objects and regularly peaks at millions of requests per second TIME CUSTOMERDATA “…the scale at which AWS operates its public cloud storage services dwarfs the other vendors in this Magic Quadrant.” - Gartner Magic Quadrant for Public Cloud Storage Services, Worldwide Raj Bala, Arun Chandrasekaran, John McArthur, July 24, 2017 AWS Runs the Largest Global Cloud Infrastructure Scalable and durable
  39. 39. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Any Scale Scalable and durable • S3 has trillions of objects and exabytes of data • Built to store any amount of data • Run analytic engines at largest scale by spinning up any amount of compute resources in minutes • Runs on the world’s largest global cloud infrastructure
  40. 40. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Unmatched Durability and Availability Scalable and durable • Designed to deliver 99.999999999% durability • Geographic redundancy & automatic replication • Store data in multiple data centers across 3 AZs in a single region • Seamlessly replicates data between any region
  41. 41. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lakes from AWS Data Lake on AWS Lowest cost Scalable and durable Secure Open and comprehensiveAnalyticsMachine Learning Real-time Data Movement On-premises Data Movement
  42. 42. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Tiered Storage to Optimize Price/Performance Lowest Cost • Tiered storage to optimize price/performance • S3 Standard • S3 Standard—Infrequent Access • S3 One Zone—Infrequent Access • Amazon Glacier • Migrate between tiers based on lifecycle policies • Store data at $0.023/GB/month with S3 • Store data at $0.004/GB/month with Glacier S3 Standard S3 Standard Infrequent Access S3 One Zone-IA Glacier
  43. 43. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Pay Only for the Resources You Use as you Scale Lowest Cost • Pay-as-you-go for the resources you consume • As low as $0.05/GB scanned with Athena • EMR and Athena can automatically scale down resources after job completes, saving you costs • Commit to a set term and save up to 75% with Reserved Instance • Run on spare compute capacity with EMR and save up to 90% with Spot Traditional approach leads to wasted capacity Traditional: Rigid AWS: Elastic Capacity Demand Demand Servers Unmet demand upset players missed revenue Excess capacity wasted $$$ AWS approach: pay for the capacity you use
  44. 44. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS databases and analytics Broad and deep portfolio, built for builders AWS Marketplace Amazon Redshift Data warehousing Amazon EMR Hadoop + Spark Athena Interactive analytics Kinesis Analytics Real-time Amazon Elasticsearch service Operational Analytics RDS MySQL, PostgreSQL, MariaDB, Oracle, SQL Server Aurora MySQL, PostgreSQL Amazon QuickSight Amazon SageMaker DynamoDB Key value, Document ElastiCache Redis, Memcached Neptune Graph Timestream Time Series QLDB Ledger Database S3/Amazon Glacier AWS Glue ETL & Data Catalog Lake Formation Data Lakes Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Data Pipeline | Direct Connect Data Movement AnalyticsDatabases Business Intelligence & Machine Learning Data Lake Managed Blockchain Blockchain Templates Blockchain Amazon Comprehend Amazon Rekognition Amazon Lex Amazon Transcribe AWS DeepLens 250+ solutions 730+ Database solutions 600+ Analytics solutions 25+ Blockchain solutions 20+ Data lake solutions 30+ solutions RDS on VMWare
  45. 45. CHALLENGE Need to create constant feedback loop for designers Gain up-to-the-minute understanding of gamer satisfaction to guarantee gamers are engaged, thus resulting in the most popular game played in the world Fortnite | 125+ million players
  46. 46. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Epic Games uses Data Lakes and analytics Entire analytics platform running on AWS S3 leveraged as a Data Lake All telemetry data is collected with Kinesis Real-time analytics done through Spark on EMR, DynamoDB to create scoreboards and real-time queries Use Amazon EMR for large batch data processing Game designers use data to inform their decisions Game clients Game servers Launcher Game services N E A R R E A L T I M E P I P E L I N E N E A R R E A L T I M E P I P E L I N E Grafana Scoreboards API Limited Raw Data (real time ad-hoc SQL) User ETL (metric definition) Spark on EMR DynamoDB NEAR REALTIME PIPELINES BATCH PIPELINES ETL using EMR Tableau/BI Ad-hoc SQLS3 (Data Lake) Kinesis APIs Databases S3 Other sources
  47. 47. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  48. 48. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo Overview
  49. 49. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  50. 50. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Typical steps of building a data lake Setup Storage1 Move data2 Cleanse, prep, and catalog data 3 Configure and enforce security and compliance policies 4 Make data available for analytics 5
  51. 51. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building data lakes can still take months
  52. 52. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Sample of steps required Find sources Create Amazon Simple Storage Service (Amazon S3) locations Configure access policies Map tables to Amazon S3 locations ETL jobs to load and clean data Create metadata access policies Configure access from analytics services Rinse and repeat for other: data sets, users, and end-services And more: manage and monitor ETL jobs update metadata catalog as data changes update policies across services as users and permissions change manually maintain cleansing scripts create audit processes for compliance … Manual | Error-prone | Time consuming
  53. 53. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Lake Formation (join the preview) Build, secure, and manage a data lake in days Build a data lake in days, not months Build and deploy a fully managed data lake with a few clicks Enforce security policies across multiple services Centrally define security, governance, and auditing policies in one place and enforce those policies for all users and all applications Combine different analytics approaches Empower analyst and data scientist productivity, giving them self- service discovery and safe access to all data from a single catalog
  54. 54. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. How it works: AWS Lake Formation S3 IAM KMS OLTP ERP CRM LOB Devices Web Sensors Social Kinesis Build Data Lakes quickly • Identify, crawl, and catalog sources • Ingest and clean data • Transform into optimal formats Simplify security management • Enforce encryption • Define access policies • Implement audit login Enable self-service and combined analytics • Analysts discover all data available for analysis from a single data catalog • Use multiple analytics tools over the same data Athena Amazon Redshift AI Services Amazon EMR Amazon QuickSight Data Catalog
  55. 55. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Customer interest in AWS Lake Formation “We are very excited about the launch of AWS Lake Formation, which provides a central point of control to easily load, clean, secure, and catalog data from thousands of clients to our AWS-based data lake, dramatically reducing our operational load. … Additionally, AWS Lake Formation will be HIPAA compliant from day one …” - Aaron Symanski, CTO, Change Healthcare “I can’t wait for my team to get our hands on AWS Lake Formation. With an enterprise-ready option like Lake Formation, we will be able to spend more time deriving value from our data rather than doing the heavy lifting involved in manually setting up and managing our data lake.” - Joshua Couch, VP Engineering, Fender Digital
  56. 56. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. STEVEN BRYEN | AWS TECHNICAL & DEVELOPER EVANGELISM | @steven_bryen sbryen@amazon.com LONDON – MARCH 2019 Thank You!

×