Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Siva Raghupathy, Senior Manager
Big Data Solutio...
Agenda
Data Lake Concepts
Simplify Data Lake
What technologies should you use?
• Why?
• How?
Reference architecture
Design...
What is a Data Lake?
It is an architecture that allows you to collect, store,
process, analyze and consume all data that f...
Why Data Lake?
• Leverage all data that flows into your organization
• Customer centricity
• Business agility
• Better pre...
Data Lake Enablers
• Big Data technology evolution
• Cloud services evolution/economics
• Big Data + Cloud architecture co...
Big Data Evolution
Batch
processing
Stream
processing
Artificial
Intelligence
Cloud Services Evolution
Virtual
machines
Managed
services
Serverless
Plethora of Tools
Amazon
Glacier
S3 DynamoDB
RDS
EMR
Amazon
Redshift
Data Pipeline
Amazon
Kinesis
Lambda Amazon ML
SQS
Ela...
Data Lake Challenges
Why?
How?
What tools should I use?
Is there a reference architecture?
Architectural Principles
Build decoupled systems
• Data → Store → Process → Store → Analyze → Answers
Use the right tool f...
Simplify Data Lake
COLLECT STORE PROCESS/
ANALYZE
CONSUME
Time to answer (Latency)
Throughput
Cost
Types of DataCOLLECT
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Applications
In-memory data structures
D...
What Is the Temperature of Your Data ?
Hot Warm Cold
Volume MB–GB GB–TB PB–EB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–high High Ver...
Store
STORE
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoT
COLLECT
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS...
In-memory
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Apache Kafka
Amazon DynamoDB
Streams
Amazon SQS
Amazon SQS
• Mana...
Why Stream Storage?
Decouple producers & consumers
Persistent buffer
Collect multiple streams
Preserve client ordering
Par...
What About Amazon SQS?
• Decouple producers & consumers
• Persistent buffer
• Collect multiple streams
• No client orderin...
Which Stream/Message Storage Should I Use?
Amazon
DynamoDB
Streams
Amazon
Kinesis
Streams
Amazon
Kinesis
Firehose
Apache
K...
In-memory
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Database
AWS Import/Export
Snowball
L...
Why Is Amazon S3 the Fabric of Data Lake?
• Natively supported by big data frameworks (Spark, Hive, Presto, etc.)
• Decoup...
What About HDFS & Data Tiering?
• Use HDFS for very frequently accessed
(hot) data
• Use Amazon S3 Standard for frequently...
In-memory
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS Database
AWS Import/Export
Snowball
L...
Anti-Pattern
Data Tier
Best Practice: Use the Right Tool for the Job
Data Tier
Search
Amazon Elasticsearch
Service
In-memory
Amazon ElastiCache
R...
Materialized Views & Immutable Log
Views
Immutable log
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowball
Logging
Amazon
Cloud...
Which Data Store Should I Use?
Data structure → Fixed schema, JSON, key-value
Access patterns → Store data in the format y...
Data Structure and Access Patterns
Access Patterns What to use?
Put/Get (key, value) In-memory, NoSQL
Simple relationships...
In-memory
SQL
Request rate
High Low
Cost/GB
High Low
Latency
Low High
Data volume
Low High
Amazon
Glacier
Structure
NoSQL
...
Amazon ElastiCache Amazon
DynamoDB
Amazon
RDS/Aurora
Amazon
ES
Amazon S3 Amazon Glacier
Average
latency
ms ms ms, sec ms,s...
Cost-Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
“I’m currently scoping out a project. The design...
https://calculator.s3.amazonaws.com/index.html
Simple Monthly
Calculator
Cost-Conscious Design
Example: Should I use Amazo...
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per
month
300 2,048 1,483 777,600,000
Amazon S...
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per
month
Scenario
1
300 2,048 1,483 777,600,0...
PROCESS /
ANALYZE
Batch
Takes minutes to hours
Example: Daily/weekly/monthly reports
Amazon EMR (MapReduce, Hive, Pig, Spark)
Interactive
Ta...
Which Stream & Message Processing Technology Should I Use?
Amazon
EMR (Spark
Streaming)
Apache
Storm
KCL Application Amazo...
Which Analysis Tool Should I Use?
Amazon Redshift Amazon Athena Amazon EMR
Presto Spark Hive
Use case Optimized for data
w...
What About ETL?
https://aws.amazon.com/big-data/partner-solutions/
ETLSTORE PROCESS / ANALYZE
Data Integration Partners
Re...
CONSUME
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon Elasticsearch
Service
Apache Kafka
Amazon SQS
Amazon Kinesis
Streams
Amazon ...
STORE CONSUMEPROCESS / ANALYZE
Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksIDEAPI
Applications & API
...
Putting It All Together
Streaming
Amazon Kinesis
Analytics
KCL
apps
AWS Lambda
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon Elasticsearch
Service...
What about Metadata?
• Amazon Athena Catalog
• An internal data catalog for tables/schemas on S3
• Glue Catalog
• Hive Met...
Security & Governance
• AWS Identity and Access Management (IAM)
• Amazon Cognito
• Amazon CloudWatch & AWS CloudTrail
• A...
Security &
Governance IAM AWS
STS
Amazon
CloudWatch
AWS
CloudTrail
AWS
KMS
AWS
CloudHSM
AWS Directory
Service
Data
Catalog...
Design Patterns
Spark Streaming
Apache Storm
AWS Lambda
KCL apps
Amazon
Redshift
Amazon
Redshift
Hive
Spark
Presto
Amazon Kinesis Amazon
D...
Amazon EMR
Real-time Analytics
Amazon
Kinesis
KCL app
AWS Lambda
Spark
Streaming
Amazon
SNS
Amazon
AI
Notifications
Amazon...
Interactive
&
Batch
Analytics
Amazon S3
Amazon EMR
Hive
Pig
Spark
Amazon
AI
process
store
Consume
Amazon Redshift
Amazon E...
Interactive
&
Batch
Amazon S3
Amazon
Redshift
Amazon EMR
Presto
Hive
Pig
Spark
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
R...
Summary
Build decoupled systems
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data...
Security
&
Governance IAM Amazon
CloudWatch
AWS
CloudTrail
AWS
KMS
AWS
CloudHSM
AWS Directory
Service
Data
Catalog
Amazon ...
Resources
• https://aws.amazon.com/blogs/big-data/introducing-the-data-
lake-solution-on-aws/
• AWS re:Invent 2016: Netfli...
Thank you!
Próxima SlideShare
Cargando en…5
×

Deploying a Data Lake on AWS - AWS Online Tech Talks March 2017

5.956 visualizaciones

Publicado el

Increasing demands to collect,store, and analyze massive amounts of data often mean that the same tools and approaches that worked in the past don’t work anymore. That’s why many organizations are shifting to a data lake, which is an architectural approach that allows you to store massive amounts of data into a central location, so it’s readily available to be categorized, processed, analyzed and consumed by diverse groups within an organization. In this tech talk, we introduce key concepts for a data lake and present aspects related to its implementation. We highlight the core components of a data lake, such as storage, compute, analytics, databases, stream processing, data management, and security. We discuss how to choose the right technologies for each component of the data lake, based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. We also provide a reference architecture and recommendations to get started with a data lake implementation on AWS.

Learning Objectives:
· Understand key concepts and architectural components of a data lake architecture
· Describe how and when to use a broad set of analytic and data management tools in a data lake architecture
· Get insights on how to get started with a data lake implementation on AWS

Publicado en: Tecnología
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí

Deploying a Data Lake on AWS - AWS Online Tech Talks March 2017

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Siva Raghupathy, Senior Manager Big Data Solutions Architecture, AWS March 21, 2017 Deploying a Data Lake in AWS
  2. 2. Agenda Data Lake Concepts Simplify Data Lake What technologies should you use? • Why? • How? Reference architecture Design patterns
  3. 3. What is a Data Lake? It is an architecture that allows you to collect, store, process, analyze and consume all data that flows into your organization.
  4. 4. Why Data Lake? • Leverage all data that flows into your organization • Customer centricity • Business agility • Better predictions via Machine Learning • Competitive advantage
  5. 5. Data Lake Enablers • Big Data technology evolution • Cloud services evolution/economics • Big Data + Cloud architecture convergence
  6. 6. Big Data Evolution Batch processing Stream processing Artificial Intelligence
  7. 7. Cloud Services Evolution Virtual machines Managed services Serverless
  8. 8. Plethora of Tools Amazon Glacier S3 DynamoDB RDS EMR Amazon Redshift Data Pipeline Amazon Kinesis Lambda Amazon ML SQS ElastiCache DynamoDB Streams Amazon Elasticsearch Service Amazon Kinesis Analytics Amazon QuickSight
  9. 9. Data Lake Challenges Why? How? What tools should I use? Is there a reference architecture?
  10. 10. Architectural Principles Build decoupled systems • Data → Store → Process → Store → Analyze → Answers Use the right tool for the job • Data structure, latency, throughput, access patterns Leverage AWS managed services • Scalable/elastic, available, reliable, secure, no/low admin Use log-centric design patterns • Immutable logs, materialized views Be cost-conscious • Big data ≠ big cost
  11. 11. Simplify Data Lake COLLECT STORE PROCESS/ ANALYZE CONSUME Time to answer (Latency) Throughput Cost
  12. 12. Types of DataCOLLECT Mobile apps Web apps Data centers AWS Direct Connect RECORDS Applications In-memory data structures Database records AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES LoggingTransport Search documents Log files Messaging Message MESSAGES Messaging Messages Devices Sensors & IoT platforms AWS IoT STREAMS IoT Data streams Transactions Files Events
  13. 13. What Is the Temperature of Your Data ?
  14. 14. Hot Warm Cold Volume MB–GB GB–TB PB–EB Item size B–KB KB–MB KB–TB Latency ms ms, sec min, hrs Durability Low–high High Very high Request rate Very high High Low Cost/GB $$-$ $-¢¢ ¢ Hot data Warm data Cold data Data Characteristics: Hot, Warm, Cold
  15. 15. Store
  16. 16. STORE Devices Sensors & IoT platforms AWS IoT STREAMS IoT COLLECT AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES LoggingTransport Messaging Message MESSAGES MessagingApplications Mobile apps Web apps Data centers AWS Direct Connect RECORDS Types of Data Stores Database SQL & NoSQL databases Search Search engines File store File systems Queue Message queues Stream storage Pub/sub message queues In-memory Caches, data structure servers
  17. 17. In-memory Amazon Kinesis Firehose Amazon Kinesis Streams Apache Kafka Amazon DynamoDB Streams Amazon SQS Amazon SQS • Managed message queue service Apache Kafka • High throughput distributed streaming pla Amazon Kinesis Streams • Managed stream storage + processing Amazon Kinesis Firehose • Managed data delivery Amazon DynamoDB • Managed NoSQL database • Tables can be stream-enabled Message & Stream Storage Devices Sensors & IoT platforms AWS IoT STREAMS IoT COLLECT STORE Mobile apps Web apps Data centers AWS Direct Connect RECORDS Database Applications AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES Search File store LoggingTransport Messaging Message MESSAGES Messaging Message Stream
  18. 18. Why Stream Storage? Decouple producers & consumers Persistent buffer Collect multiple streams Preserve client ordering Parallel consumption Streaming MapReduce 443322114 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 44332211 shard 1 / partition 1 shard 2 / partition 2 Consumer 1 Count of red = 4 Count of violet = 4 Consumer 2 Count of blue = 4 Count of green = 4 DynamoDB stream Amazon Kinesis stream Kafka topic
  19. 19. What About Amazon SQS? • Decouple producers & consumers • Persistent buffer • Collect multiple streams • No client ordering (Standard) • FIFO queue preserves client ordering • No streaming MapReduce • No parallel consumption • Amazon SNS can publish to multiple SNS subscribers (queues or ʎ functions) Publisher Amazon SNS topic function ʎ AWS Lambda function Amazon SQS queue queue Subscriber Consumers 4 3 2 1 12344 3 2 1 1234 2134 13342 Standard FIFO
  20. 20. Which Stream/Message Storage Should I Use? Amazon DynamoDB Streams Amazon Kinesis Streams Amazon Kinesis Firehose Apache Kafka Amazon SQS (Standard) Amazon SQS (FIFO) AWS managed Yes Yes Yes No Yes Yes Guaranteed ordering Yes Yes No Yes No Yes Delivery (deduping) Exactly-once At-least-once At-least-once At-least-once At-least-once Exactly-once Data retention period 24 hours 7 days N/A Configurable 14 days 14 days Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ 3 AZ Scale / throughput No limit / ~ table IOPS No limit / ~ shards No limit / automatic No limit / ~ nodes No limits / automatic 300 TPS / queue Parallel consumption Yes Yes No Yes No No Stream MapReduce Yes Yes N/A Yes N/A N/A Row/object size 400 KB 1 MB Destination row/object size Configurable 256 KB 256 KB Cost Higher (table cost) Low Low Low (+admin) Low-medium Low-medium Hot Warm New
  21. 21. In-memory COLLECT STORE Mobile apps Web apps Data centers AWS Direct Connect RECORDS Database AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES Search Messaging Message MESSAGES Devices Sensors & IoT platforms AWS IoT STREAMS Apache Kafka Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Streams Hot Stream Amazon S3 Amazon SQS Message Amazon S3 File LoggingIoTApplicationsTransportMessaging File Storage
  22. 22. Why Is Amazon S3 the Fabric of Data Lake? • Natively supported by big data frameworks (Spark, Hive, Presto, etc.) • Decouple storage and compute • No need to run compute clusters for storage (unlike HDFS) • Can run transient Hadoop clusters & Amazon EC2 Spot Instances • Multiple & heterogeneous analysis clusters can use the same data • Unlimited number of objects and volume of data • Very high bandwidth – no aggregate throughput limit • Designed for 99.99% availability – can tolerate zone failure • Designed for 99.999999999% durability • No need to pay for data replication • Native support for versioning • Tiered-storage (Standard, IA, Amazon Glacier) via life-cycle policies • Secure – SSL, client/server-side encryption at rest • Low cost
  23. 23. What About HDFS & Data Tiering? • Use HDFS for very frequently accessed (hot) data • Use Amazon S3 Standard for frequently accessed data • Use Amazon S3 Standard – IA for less frequently accessed data • Use Amazon Glacier for archiving cold data • Use Amazon S3 Analytics for storage class analysis New
  24. 24. In-memory COLLECT STORE Mobile apps Web apps Data centers AWS Direct Connect RECORDS Database AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES Search Messaging Message MESSAGES Devices Sensors & IoT platforms AWS IoT STREAMS Apache Kafka Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Streams Hot Stream Amazon SQS Message Amazon S3 File LoggingIoTApplicationsTransportMessaging In-memory, Database, Search
  25. 25. Anti-Pattern Data Tier
  26. 26. Best Practice: Use the Right Tool for the Job Data Tier Search Amazon Elasticsearch Service In-memory Amazon ElastiCache Redis Memcached SQL Amazon Aurora Amazon RDS MySQL PostgreSQL Oracle SQL Server NoSQL Amazon DynamoDB Cassandra HBase MongoDB
  27. 27. Materialized Views & Immutable Log Views Immutable log
  28. 28. COLLECT STORE Mobile apps Web apps Data centers AWS Direct Connect RECORDS AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES Messaging Message MESSAGES Devices Sensors & IoT platforms AWS IoT STREAMS Apache Kafka Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Streams Hot Stream Amazon SQS Message Amazon Elasticsearch Service Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS SearchSQLNoSQLCacheFile LoggingIoTApplicationsTransportMessaging Amazon ElastiCache • Managed Memcached or Redis service Amazon DynamoDB • Managed NoSQL database service Amazon RDS • Managed relational database service Amazon Elasticsearch Service • Managed Elasticsearch service
  29. 29. Which Data Store Should I Use? Data structure → Fixed schema, JSON, key-value Access patterns → Store data in the format you will access it Data characteristics → Hot, warm, cold Cost → Right cost
  30. 30. Data Structure and Access Patterns Access Patterns What to use? Put/Get (key, value) In-memory, NoSQL Simple relationships → 1:N, M:N NoSQL Multi-table joins, transaction, SQL SQL Faceting, search Search Data Structure What to use? Fixed schema SQL, NoSQL Schema-free (JSON) NoSQL, Search (Key, value) In-memory, NoSQL
  31. 31. In-memory SQL Request rate High Low Cost/GB High Low Latency Low High Data volume Low High Amazon Glacier Structure NoSQL Hot data Warm data Cold data Low High
  32. 32. Amazon ElastiCache Amazon DynamoDB Amazon RDS/Aurora Amazon ES Amazon S3 Amazon Glacier Average latency ms ms ms, sec ms,sec ms,sec,min (~ size) hrs Typical data stored GB GB–TBs (no limit) GB–TB (64 TB max) GB–TB MB–PB (no limit) GB–PB (no limit) Typical item size B-KB KB (400 KB max) KB (64 KB max) B-KB (2 GB max) KB-TB (5 TB max) GB (40 TB max) Request Rate High – very high Very high (no limit) High High Low – high (no limit) Very low Storage cost GB/month $$ ¢¢ ¢¢ ¢¢ ¢ ¢4/10 Durability Low - moderate Very high Very high High Very high Very high Availability High 2 AZ Very high 3 AZ Very high 3 AZ High 2 AZ Very high 3 AZ Very high 3 AZ Hot data Warm data Cold data Which Data Store Should I Use?
  33. 33. Cost-Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB? “I’m currently scoping out a project. The design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month…” Request rate (Writes/sec) Object size (Bytes) Total size (GB/month) Objects per month 300 2048 1483 777,600,000
  34. 34. https://calculator.s3.amazonaws.com/index.html Simple Monthly Calculator Cost-Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?
  35. 35. Request rate (Writes/sec) Object size (Bytes) Total size (GB/month) Objects per month 300 2,048 1,483 777,600,000 Amazon S3 or DynamoDB?
  36. 36. Request rate (Writes/sec) Object size (Bytes) Total size (GB/month) Objects per month Scenario 1 300 2,048 1,483 777,600,000 Scenario 2 300 32,768 23,730 777,600,000 Amazon S3 Amazon DynamoDB use use
  37. 37. PROCESS / ANALYZE
  38. 38. Batch Takes minutes to hours Example: Daily/weekly/monthly reports Amazon EMR (MapReduce, Hive, Pig, Spark) Interactive Takes seconds Example: Self-service dashboards Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark) Message Takes milliseconds to seconds Example: Message processing Amazon SQS applications on Amazon EC2 Stream Takes milliseconds to seconds Example: Fraud alerts, 1 minute metrics Amazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL, Storm, AWS Lambda Artificial Intelligence Takes milliseconds to minutes Example: Fraud detection, forecast demand, text to speech Amazon AI (Lex, Polly, ML, Rekognition), Amazon EMR (Spark ML), Deep Learning AMI (MXNet, TensorFlow, Theano, Torch, CNTK and Caffe) Analytics Types & Frameworks PROCESS / ANALYZE Message Amazon SQS apps Amazon EC2 Streaming Amazon Kinesis Analytics KCL apps AWS Lambda Stream Amazon EC2 Amazon EMR Fast Amazon Redshift Presto Amazon EMR FastSlow Amazon Athena BatchInteractive Amazon AI AI
  39. 39. Which Stream & Message Processing Technology Should I Use? Amazon EMR (Spark Streaming) Apache Storm KCL Application Amazon Kinesis Analytics AWS Lambda Amazon SQS Application AWS managed Yes (Amazon EMR) No (Do it yourself) No ( EC2 + Auto Scaling) Yes Yes No (EC2 + Auto Scaling) Serverless No No No Yes Yes No Scale / throughput No limits / ~ nodes No limits / ~ nodes No limits / ~ nodes Up to 8 KPU / automatic No limits / automatic No limits / ~ nodes Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ Multi-AZ Programming languages Java, Python, Scala Almost any language via Thrift Java, others via MultiLangDaemo n ANSI SQL with extensions Node.js, Java, Python AWS SDK languages (Java, .NET, Python, …) Uses Multistage processing Multistage processing Single stage processing Multistage processing Simple event-based triggers Simple event based triggers Reliability KCL and Spark checkpoints Framework managed Managed by KCL Managed by Amazon Kinesis Analytics Managed by AWS Lambda Managed by SQS Visibility Timeout
  40. 40. Which Analysis Tool Should I Use? Amazon Redshift Amazon Athena Amazon EMR Presto Spark Hive Use case Optimized for data warehousing Ad-hoc Interactive Queries Interactive Query General purpose (iterative ML, RT, ..) Batch Scale/throughput ~Nodes Automatic / No limits ~ Nodes AWS Managed Service Yes Yes, Serverless Yes Storage Local storage Amazon S3 Amazon S3, HDFS Optimization Columnar storage, data compression, and zone maps CSV, TSV, JSON, Parquet, ORC, Apache Web log Framework dependent Metadata Amazon Redshift managed Athena Catalog Manager Hive Meta-store BI tools supports Yes (JDBC/ODBC) Yes (JDBC) Yes (JDBC/ODBC & Custom) Access controls Users, groups, and access controls AWS IAM Integration with LDAP UDF support Yes (Scalar) No Yes Slow
  41. 41. What About ETL? https://aws.amazon.com/big-data/partner-solutions/ ETLSTORE PROCESS / ANALYZE Data Integration Partners Reduce the effort to move, cleanse, synchronize, manage, and automatize data related processes. AWS Glue AWS Glue is a fully managed ETL service that makes it easy to understand your data sources, prepare the data, and move it reliably between data stores New
  42. 42. CONSUME
  43. 43. COLLECT STORE CONSUMEPROCESS / ANALYZE Amazon Elasticsearch Service Apache Kafka Amazon SQS Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Amazon ElastiCache Amazon RDS Amazon DynamoDB Streams HotHotWarm FileMessage Stream Mobile apps Web apps Devices Messaging Message Sensors & IoT platforms AWS IoT Data centers AWS Direct Connect AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS DOCUMENTS FILES MESSAGES STREAMS LoggingIoTApplicationsTransportMessaging ETL SearchSQLNoSQLCache Streaming Amazon Kinesis Analytics KCL apps AWS Lambda Fast Stream Amazon EC2 Amazon EMR Amazon SQS apps Amazon Redshift Presto Amazon EMR FastSlow Amazon EC2 Amazon Athena BatchMessageInteractiveAI Amazon AI Amazon S3
  44. 44. STORE CONSUMEPROCESS / ANALYZE Amazon QuickSight Apps & Services Analysis&visualizationNotebooksIDEAPI Applications & API Analysis and visualization Notebooks IDE Business users Data scientist, developers COLLECT ETL
  45. 45. Putting It All Together
  46. 46. Streaming Amazon Kinesis Analytics KCL apps AWS Lambda COLLECT STORE CONSUMEPROCESS / ANALYZE Amazon Elasticsearch Service Apache Kafka Amazon SQS Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Amazon ElastiCache Amazon RDS Amazon DynamoDB Streams HotHotWarm Fast Stream SearchSQLNoSQLCacheFileMessageStream Amazon EC2 Mobile apps Web apps Devices Messaging Message Sensors & IoT platforms AWS IoT Data centers AWS Direct Connect AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS DOCUMENTS FILES MESSAGES STREAMS Amazon QuickSight Apps & Services Analysis&visualizationNotebooksIDEAPI LoggingIoTApplicationsTransportMessaging ETL Amazon EMR Amazon SQS apps Amazon Redshift Presto Amazon EMR FastSlow Amazon EC2 Amazon Athena BatchMessageInteractiveAI Amazon AI Amazon S3
  47. 47. What about Metadata? • Amazon Athena Catalog • An internal data catalog for tables/schemas on S3 • Glue Catalog • Hive Metastore compliant • Crawlers - Detect new data, schema, partitions • Search - Metadata discovery • EMR Hive Metastore (Presto, Spark, Hive, Pig) • Can be hosted on Amazon RDS Data Catalog Amazon Athena Catalog RDS Amazon EMR Hive Metastore Amazon RDS Glue Catalog Presto Amazon EMR Amazon Athena
  48. 48. Security & Governance • AWS Identity and Access Management (IAM) • Amazon Cognito • Amazon CloudWatch & AWS CloudTrail • Amazon KMS • AWS Directory Service • Apache Ranger Security & Governance IAM Amazon CloudWatch AWS CloudTrail AWS KMS AWS CloudHSM AWS Directory Service Amazon Cognito
  49. 49. Security & Governance IAM AWS STS Amazon CloudWatch AWS CloudTrail AWS KMS AWS CloudHSM AWS Directory Service Data Catalog Amazon Athena Catalog RDS Hive Metastore EMR RDS Glue Catalog Data Lake Reference Architecture
  50. 50. Design Patterns
  51. 51. Spark Streaming Apache Storm AWS Lambda KCL apps Amazon Redshift Amazon Redshift Hive Spark Presto Amazon Kinesis Amazon DynamoDB Amazon S3data Hot Cold Data temperature Processingspeed Fast Slow Answers Hive Native apps KCL apps AWS Lambda Amazon Athena
  52. 52. Amazon EMR Real-time Analytics Amazon Kinesis KCL app AWS Lambda Spark Streaming Amazon SNS Amazon AI Notifications Amazon ElastiCache (Redis) Amazon DynamoDB Amazon RDS Amazon ES Alerts App state or Materialized View Real-time prediction KPI process store Amazon Kinesis Analytics Amazon S3 Log Amazon KinesisFan out
  53. 53. Interactive & Batch Analytics Amazon S3 Amazon EMR Hive Pig Spark Amazon AI process store Consume Amazon Redshift Amazon EMR Presto Spark Batch Interactive Batch prediction Real-time prediction Amazon Kinesis Firehose Amazon Athena Files Amazon Kinesis Analytics
  54. 54. Interactive & Batch Amazon S3 Amazon Redshift Amazon EMR Presto Hive Pig Spark Amazon ElastiCache Amazon DynamoDB Amazon RDS Amazon ES AWS Lambda Storm Spark Streaming on Amazon EMR Applications Amazon Kinesis App state or Materialized View KCL Amazon AI Real-time Amazon DynamoDB Amazon RDS Change Data Capture Transactions Stream Files Data Lake Amazon Kinesis Analytics Amazon Athena Amazon Kinesis Firehose
  55. 55. Summary Build decoupled systems • Data → Store → Process → Store → Analyze → Answers Use the right tool for the job • Data structure, latency, throughput, access patterns Leverage AWS managed services • Scalable/elastic, available, reliable, secure, no/low admin Use log-centric design patterns • Immutable log, batch, interactive & real-time views Be cost-conscious • Big data ≠ big cost
  56. 56. Security & Governance IAM Amazon CloudWatch AWS CloudTrail AWS KMS AWS CloudHSM AWS Directory Service Data Catalog Amazon Athena Catalog RDS Hive Metastore EMR RDS Glue Catalog Data Lake Reference Architecture Amazon Cognito
  57. 57. Resources • https://aws.amazon.com/blogs/big-data/introducing-the-data- lake-solution-on-aws/ • AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306) • AWS re:Invent 2016: Deep Dive on Amazon S3 (STG303) • https://aws.amazon.com/blogs/big-data/reinvent-2016-aws-big- data-machine-learning-sessions/ • https://aws.amazon.com/blogs/big-data/implementing- authorization-and-auditing-using-apache-ranger-on-amazon-emr/
  58. 58. Thank you!

×