SlideShare una empresa de Scribd logo
1 de 32
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2019, Amazon Web Services, Inc. or its Affiliates.
WWPS EMEA Tech Business Development
Abir Roychoudhury, TechBD Database and Analytics
Data Lifecycle
Preparing Your Data for Cloud Analytics & AI/ML
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
• Public Sector Situation
• Data Lifecycle Walkthrough
• Demonstration around Redshift Analytics + Machine Learning
• Customer References
• Architectural Principles
• Q&A
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What do we observe in Public Sector?
• Data is dispersed and difficult to access
• Limited views on what is going in the business
• Resource constraints limit business value activities
• Governance and compliance
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is the Big Data Challenge?
Challenge Characteristic Use Case Solution Requirement to Address
Challenge
Volume Ranges from Tb to Pb Large data set required
for accurate data model
• Offline processing of large data set
• Transportation
• Extraction (key/value pairs)
Variety Different sources and
formats
Bring siloed data sources
together different formats
• Consolidate disparate sources
(structured, unstructured, semi, rest and
motion)
Velocity stringent requirements
from the time data is
generated, to the time
actionable insights
Stream data created at
high speed, only relevant
for short period.
• Capturing stream data
• Cataloguing the data, safe for offline
• Real-time analytics, ad-hoc queries
https://aws.amazon.com/big-data/what-is-big-data/
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What do we observe in Public Sector?
According to Forbes:
82% of enterprises are prioritizing analytics and BI as part of their
budgets for new technologies and cloud-based services.
Data warehouse or mart in the cloud (41%), data lake in the cloud
(39%) and BI platform in the cloud (38%) are the top three types
of technologies enterprises are planning to use..
42% are seeking to improve user experiences by automating
discovery of data insights and 26% are using AI to provide user
recommendations.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lifecycle
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Ingest
Mechanism for data
movement from
external sources into
your data system
Questions to ask:
a) What are my data sources?
b) What is the format of the data?
c) Is the data source immutable?
d) Is it real-time or batch?
e) Where is the destination?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Ingestion:
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Amazon Managed Streaming for Kafka
Real-time Data SourcesTraditional Data Sources
Media and Log Files
ERP Systems
Databases (SQL/NoSQL)
Data Warehouses (EDW)
IoT Sensors
Clickstream
Telemetry
Business Activities
Data Lake
Database
Data Warehouse
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon
Kinesis Data
Firehose
Real-time data movement and Data Lakes on AWS
AWS Glue
Data Catalog
Amazon
S3 Data
Data Lake
on AWS
Amazon
Kinesis Data
Streams
Data definitionKinesis Agent
Apache Kafka
AWS SDK
LOG4J
Flume
Fluentd
AWS Mobile SDK
Kinesis Producer Library
1)
2)
3)
4a)
4b)
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
VS
Single / monolithic Purpose-built / micro-services
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of purpose-built architectures
Better
performance Better scale
More
functionality
Easier to
debug
Independence
between teams
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is the data
structure?
Access Patterns What to use?
Put/Get (key, value) In-memory, NoSQL
Simple relationships → 1:N, M:N NoSQL
Multi-table joins, transaction, SQL SQL
Faceting, Search Search
Graph traversal GraphDB
Data Structure What to use?
Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, Search
Key/Value In-memory, NoSQL
Graph GraphDB
Time Interval Time Series
Ledger Ledger
How will the data be
accessed?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon
QLDB
Amazon
DynamoDB
Amazon
RDS / Aurora
Amazon
Timestream
Amazon
Elasticsearch
Amazon
Neptune
Amazon S3 +
Glacier
Use Cases Immutable
Ledger
Key Value with
GSI/LSI
Indexes
OLTP,
Transactional
stores and
processes this
data by time
intervals
Log Analysis,
Reverse
Indexing
Graph Data Lake /
File and
Object store
Performance Very High
Performance
Ultra High
request rate,
Ultra low to
low latency
Very high
request rate,
low latency
High request
rate, low
latency
Medium
request rate,
low latency
Medium
request rate,
low latency
High
Throughput
Shape Ledger K/V and
Document
Relational Time Series Documents Node/Edges Files
Size TB, PB (no
limits)
GB, Mid TB GB, Low TB GB, TB GB, Mid TB GB, TB, PB,
EB (no limits)
Cost / GB $ ¢¢ - $$ $ $ $$ $ ¢- ¢4/10
VPC Support Inside VPC VPC Endpoint Inside VPC Outside or
Inside VPC
Inside VPC VPC Endpoint
Database Characteristics
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Staging
Validate, Verify,
Catalog the incoming
Raw Data
Perform common
housekeeping tasks
Questions to ask:
Which validation checks?
How will the raw dataset catalog be populated?
Automated Tagging of data?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Cleansing
Transform and
Process data for
downstream
analytics
Questions to ask:
Which users and analytics will consume data?
Is there a common data model?
Optimize for reads/queries or writes?
How will data cleanup over time be performed?
(compaction, etc..)
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ELT/ETL
Preparing Raw, Staging, and Cleansed Data Lakes
Raw
Ingestion
Staged
Datasets
Optimized
ML Datasets
Optimized
ML Datasets
Data Lake
on AWS
ELT/ETL
Cleansed “views” of the data
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demonstration Setting
• Use data from AWS Open Data:
https://aws.amazon.com/opendata/
• Cornell University has created a public data
lake of climate data in ORC* format
• Get Data into S3, AWS Glue Catalogue
• Look at the structure
• Move to Redshift Data Warehouse analyse
temperature development by min/max and
location
• Analyse, basic prediction in advanced
analytics using ML in Sagemaker (using
DEEPAR Forecast)
• *Redshift supports ORC and Parquet
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demonstration SettingCornell Open Data
provides climate data
Data is copied to local S3
or can be queried directly
from Cornell Data Lake
Glue is cataloguing data
Early insight into data
structure
Redshift loads data for
queries on temperature by
period and location
Data enriched by ML
model (DEEPAR) for
forecast
User can query report with
QuickSight visualisation
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demonstration
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Analytics & Visualization
Deliver decisions makers the
insights to transform an
organization by identifying
unmet needs within the
customers or by optimizing
operational processes
Questions to ask:
What business question is being answered?
Does the data support answering them?
Who are the users driving the insights?
What skills do those users have?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Customer References
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Petabytes of data generated
on-premises, brought to AWS,
and stored in S3
Thousands of analytical
queries performed on EMR
and Amazon Redshift.
Stringent security requirements
met by leveraging VPC, VPN,
encryption at-rest and in-
transit, CloudTrail, and
database auditing
Flexible
Interactive
Queries
Predefined
Queries
Surveillance
Analytics
Web Applications
Analysts; Regulators
FINRA: Migrating to AWS
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Hearst’s Serverless Data Pipeline
cosmopolitan.com
caranddriver.com
sfchronicle.com
elle.com
Ingestion proxy
(Node.js)
Serverless data
pipeline
Offline
analysis and
archive
Real-time
analysis
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
1. Process high variety or volume structured or unstructured datasets
• Big Data Processing
2. Power Business Users to drive Insights
• Data Warehousing
3. Interactively query and explore datasets
• Ad Hoc Querying
4. Analyze what’s happening now
• Streaming Analytics
5. Drive operational and security understanding.
• Log Analysis
Common Types of Data Analytics
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Which Analytics Should I Use? PROCESS / ANALYZE
Batch
Takes minutes to hours
Example: Daily/weekly/monthly reports
Amazon EMR (MapReduce, Hive, Pig, Spark)
Interactive
Takes seconds
Example: Self-service dashboards
Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark)
Stream
Takes milliseconds to seconds
Example: Fraud alerts, 1 minute metrics
Amazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL,
AWS Lambda, etc.
Predictive
Takes milliseconds (real-time) to hours (batch)
Example: Fraud detection, Forecasting demand, Speech
recognition
Amazon SageMaker, Polly, Rekognition, Transcribe, Translate,
Amazon EMR (Spark ML), Deep Learning AMI (MXNet, TensorFlow,
Theano, Torch, CNTK and Caffe)
FastSlow
Amazon Redshift
& Spectrum
Amazon Athena
BatchInteractive
Amazon ES
Presto
Amazon
EMR
Predictive
AmazonML
KCL
Apps
AWS Lambda
Amazon Kinesis
Analytics
Stream
Streaming
Fast
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Which Analytics Tool Should I Use?
Amazon Redshift Amazon Redshift
Spectrum
Amazon Athena Amazon EMR
Presto Spark Hive
Use case Optimized for data
warehousing
Query S3 data from
Redshift
Interactive Queries
over S3 data
Interactive
Query
General purpose Batch
Scale/Throughput ~Nodes ~Nodes Automatic ~ Nodes
Managed Service Yes Yes Yes, Serverless Yes
Storage Local storage Amazon S3 Amazon S3 Amazon S3, HDFS
Optimization Columnar storage,
data compression,
and zone maps
AVRO, PARQUET
TEXT, SEQ
RCFILE, ORC, etc.
AVRO, PARQUET
TEXT, SEQ
RCFILE, ORC, etc.
Framework dependent
Metadata Redshift Catalog Glue Catalog Glue Catalog Glue Catalog or
Hive Meta-store
Auth/Access controls IAM, Users, groups,
and access controls
IAM, Users, groups,
and access controls
IAM IAM, LDAP & Kerberos
UDF support Yes (Scalar) Yes (Scalar) No Yes
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Which Stream Processing Technology Should I Use?
Amazon EMR
(Spark
Streaming)
KCL Application Amazon Kinesis
Analytics
AWS Lambda
Managed Service Yes No (EC2 + Auto
Scaling)
Yes Yes
Serverless No No Yes Yes
Scale / Throughput No limits /
~ nodes
No limits /
~ nodes
No Limits /
automatic
No limits /
automatic
Availability Single AZ Multi-AZ Multi-AZ Multi-AZ
Programming
Languages
Java, Python,
Scala
Java, others via
MultiLangDaemon
ANSI SQL or
Java/Flink
Node.js, Java, Python, .Net Core
Sliding Window
Functions
Build-in App needs to
implement
Built-in No
Reliability KCL and Spark
checkpoints
Managed by KCL Managed by
Amazon Kinesis
Analytics
Managed by AWS Lambda
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Enforce security policies
across multiple services
Gain and manage new
insights
Identify, ingest, clean,
and transform data
Build a secure data lake in days
AWS Lake
Formation
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Archiving
Makes the archival process easy
to manage, and allows you to
focus on the storage of your
data, rather than the
management of your tape
systems and library.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Securing, Protecting and Managing Data
• Access policy options and AWS IAM (resource and user base policies)
• Data Encryption with Amazon S3 and AWS KMS
• S3 protects against corruption, loss and accidental overwrites,
modifications or deletions
• Managing Data with Object Tagging
• S3 includes certs PCI-DSS, SOC123, HIPAA/HITECH, FedRAMP, SEC Rule
17, FISMA, EU Data Protection Directive
https://docs.aws.amazon.com/en_pv/whitepapers/latest/building-data-lakes/securing-protecting-managing-data
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Architectural Principles
1. Build decoupled systems
• Data → Store → Process → Store → Analyze → Answers
2. Use the right tool for the job
• Data structure, latency, throughput, access patterns
3. Leverage managed and serverless services
• Scalable/elastic, available, reliable, secure, no/low admin
4. Use event-journal design patterns
• Immutable datasets (data lake), materialized views
5. Be cost-conscious
• Big data ≠ big cost
6. Machine Learning (ML) enable your applications
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you
& Questions

Más contenido relacionado

La actualidad más candente

AWS Multi-Account Architecture and Best Practices
AWS Multi-Account Architecture and Best PracticesAWS Multi-Account Architecture and Best Practices
AWS Multi-Account Architecture and Best PracticesAmazon Web Services
 
Education : Digital transformation & AWS Foundations
Education : Digital transformation & AWS FoundationsEducation : Digital transformation & AWS Foundations
Education : Digital transformation & AWS FoundationsAmazon Web Services
 
Running Mission Critical Workloads on AWS
Running Mission Critical Workloads on AWSRunning Mission Critical Workloads on AWS
Running Mission Critical Workloads on AWSAmazon Web Services
 
Architecting security and governance across your AWS environment
Architecting security and governance across your AWS environmentArchitecting security and governance across your AWS environment
Architecting security and governance across your AWS environmentAmazon Web Services
 
Virtual AWSome Day October 2018 - Amazon Web Services
Virtual AWSome Day October 2018 - Amazon Web ServicesVirtual AWSome Day October 2018 - Amazon Web Services
Virtual AWSome Day October 2018 - Amazon Web ServicesAmazon Web Services
 
Databases-on-AWS-Purpose-built-databases,-the-right-tool-for-the-right-job
Databases-on-AWS-Purpose-built-databases,-the-right-tool-for-the-right-jobDatabases-on-AWS-Purpose-built-databases,-the-right-tool-for-the-right-job
Databases-on-AWS-Purpose-built-databases,-the-right-tool-for-the-right-jobAmazon Web Services
 
Top Cloud Security Myths - Dispelled
Top Cloud Security Myths - DispelledTop Cloud Security Myths - Dispelled
Top Cloud Security Myths - DispelledAmazon Web Services
 
An Amazonian Approach To Enterprise Transformation
An Amazonian Approach To Enterprise TransformationAn Amazonian Approach To Enterprise Transformation
An Amazonian Approach To Enterprise TransformationAmazon Web Services
 
Media Processing Workflows at High Velocity and Scale using AI and ML - AWS O...
Media Processing Workflows at High Velocity and Scale using AI and ML - AWS O...Media Processing Workflows at High Velocity and Scale using AI and ML - AWS O...
Media Processing Workflows at High Velocity and Scale using AI and ML - AWS O...Amazon Web Services
 
Threat detection and mitigation at AWS - SEC301 - Santa Clara AWS Summit
Threat detection and mitigation at AWS - SEC301 - Santa Clara AWS SummitThreat detection and mitigation at AWS - SEC301 - Santa Clara AWS Summit
Threat detection and mitigation at AWS - SEC301 - Santa Clara AWS SummitAmazon Web Services
 
Stages of Adoption leading to Complete Migration
Stages of Adoption leading to Complete Migration	Stages of Adoption leading to Complete Migration
Stages of Adoption leading to Complete Migration Amazon Web Services
 
Elevate your security with the cloud
Elevate your security with the cloudElevate your security with the cloud
Elevate your security with the cloudAmazon Web Services
 
在-MongoDB-Cloud-上構建無服務器化應用
在-MongoDB-Cloud-上構建無服務器化應用在-MongoDB-Cloud-上構建無服務器化應用
在-MongoDB-Cloud-上構建無服務器化應用Amazon Web Services
 
Serverless Extract-transform-load (ETL) on AWS Webinar
Serverless Extract-transform-load (ETL) on AWS WebinarServerless Extract-transform-load (ETL) on AWS Webinar
Serverless Extract-transform-load (ETL) on AWS WebinarAmazon Web Services
 
Accelerate and Modernise Microsoft Workload Migrations on AWS
Accelerate and Modernise Microsoft Workload Migrations on AWSAccelerate and Modernise Microsoft Workload Migrations on AWS
Accelerate and Modernise Microsoft Workload Migrations on AWSAmazon Web Services
 
Elevate_your_security_with_the_cloud
Elevate_your_security_with_the_cloudElevate_your_security_with_the_cloud
Elevate_your_security_with_the_cloudAmazon Web Services
 
Migrate & Optimize Microsoft Applications on AWS
Migrate & Optimize Microsoft Applications on AWSMigrate & Optimize Microsoft Applications on AWS
Migrate & Optimize Microsoft Applications on AWSAmazon Web Services
 
Starting your Cloud Transformation Journey - Tel Aviv Summit 2018
Starting your Cloud Transformation Journey - Tel Aviv Summit 2018Starting your Cloud Transformation Journey - Tel Aviv Summit 2018
Starting your Cloud Transformation Journey - Tel Aviv Summit 2018Boaz Ziniman
 

La actualidad más candente (20)

AWS Multi-Account Architecture and Best Practices
AWS Multi-Account Architecture and Best PracticesAWS Multi-Account Architecture and Best Practices
AWS Multi-Account Architecture and Best Practices
 
Education : Digital transformation & AWS Foundations
Education : Digital transformation & AWS FoundationsEducation : Digital transformation & AWS Foundations
Education : Digital transformation & AWS Foundations
 
Running Mission Critical Workloads on AWS
Running Mission Critical Workloads on AWSRunning Mission Critical Workloads on AWS
Running Mission Critical Workloads on AWS
 
Architecting security and governance across your AWS environment
Architecting security and governance across your AWS environmentArchitecting security and governance across your AWS environment
Architecting security and governance across your AWS environment
 
Virtual AWSome Day October 2018 - Amazon Web Services
Virtual AWSome Day October 2018 - Amazon Web ServicesVirtual AWSome Day October 2018 - Amazon Web Services
Virtual AWSome Day October 2018 - Amazon Web Services
 
Democratizing AI
Democratizing AIDemocratizing AI
Democratizing AI
 
Databases-on-AWS-Purpose-built-databases,-the-right-tool-for-the-right-job
Databases-on-AWS-Purpose-built-databases,-the-right-tool-for-the-right-jobDatabases-on-AWS-Purpose-built-databases,-the-right-tool-for-the-right-job
Databases-on-AWS-Purpose-built-databases,-the-right-tool-for-the-right-job
 
Top Cloud Security Myths - Dispelled
Top Cloud Security Myths - DispelledTop Cloud Security Myths - Dispelled
Top Cloud Security Myths - Dispelled
 
An Amazonian Approach To Enterprise Transformation
An Amazonian Approach To Enterprise TransformationAn Amazonian Approach To Enterprise Transformation
An Amazonian Approach To Enterprise Transformation
 
Media Processing Workflows at High Velocity and Scale using AI and ML - AWS O...
Media Processing Workflows at High Velocity and Scale using AI and ML - AWS O...Media Processing Workflows at High Velocity and Scale using AI and ML - AWS O...
Media Processing Workflows at High Velocity and Scale using AI and ML - AWS O...
 
Threat detection and mitigation at AWS - SEC301 - Santa Clara AWS Summit
Threat detection and mitigation at AWS - SEC301 - Santa Clara AWS SummitThreat detection and mitigation at AWS - SEC301 - Santa Clara AWS Summit
Threat detection and mitigation at AWS - SEC301 - Santa Clara AWS Summit
 
Stages of Adoption leading to Complete Migration
Stages of Adoption leading to Complete Migration	Stages of Adoption leading to Complete Migration
Stages of Adoption leading to Complete Migration
 
Elevate your security with the cloud
Elevate your security with the cloudElevate your security with the cloud
Elevate your security with the cloud
 
在-MongoDB-Cloud-上構建無服務器化應用
在-MongoDB-Cloud-上構建無服務器化應用在-MongoDB-Cloud-上構建無服務器化應用
在-MongoDB-Cloud-上構建無服務器化應用
 
Serverless Extract-transform-load (ETL) on AWS Webinar
Serverless Extract-transform-load (ETL) on AWS WebinarServerless Extract-transform-load (ETL) on AWS Webinar
Serverless Extract-transform-load (ETL) on AWS Webinar
 
Accelerate and Modernise Microsoft Workload Migrations on AWS
Accelerate and Modernise Microsoft Workload Migrations on AWSAccelerate and Modernise Microsoft Workload Migrations on AWS
Accelerate and Modernise Microsoft Workload Migrations on AWS
 
Elevate_your_security_with_the_cloud
Elevate_your_security_with_the_cloudElevate_your_security_with_the_cloud
Elevate_your_security_with_the_cloud
 
Migrate & Optimize Microsoft Applications on AWS
Migrate & Optimize Microsoft Applications on AWSMigrate & Optimize Microsoft Applications on AWS
Migrate & Optimize Microsoft Applications on AWS
 
Starting your Cloud Transformation Journey - Tel Aviv Summit 2018
Starting your Cloud Transformation Journey - Tel Aviv Summit 2018Starting your Cloud Transformation Journey - Tel Aviv Summit 2018
Starting your Cloud Transformation Journey - Tel Aviv Summit 2018
 
HK-AWS-Quick-Start-Workshop
HK-AWS-Quick-Start-WorkshopHK-AWS-Quick-Start-Workshop
HK-AWS-Quick-Start-Workshop
 

Similar a Preparing Your Data for Cloud Analytics & AI/ML

Preparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/ML Preparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/ML Amazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析Amazon Web Services
 
Building a modern data platform in the cloud. AWS DevDay Nordics
Building a modern data platform in the cloud. AWS DevDay NordicsBuilding a modern data platform in the cloud. AWS DevDay Nordics
Building a modern data platform in the cloud. AWS DevDay Nordicsjavier ramirez
 
Deep dive session - how to achieve database freedom
Deep dive session - how to achieve database freedomDeep dive session - how to achieve database freedom
Deep dive session - how to achieve database freedomRitesh Toshniwal
 
Databases - Choosing the right Database on AWS
Databases - Choosing the right Database on AWSDatabases - Choosing the right Database on AWS
Databases - Choosing the right Database on AWSAmazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Using AWS Purpose-Built Databases to Modernize your Applications
Using AWS Purpose-Built Databases to Modernize your ApplicationsUsing AWS Purpose-Built Databases to Modernize your Applications
Using AWS Purpose-Built Databases to Modernize your ApplicationsAmazon Web Services
 
Immersion Day - Democratize o acesso ao dado
Immersion Day - Democratize o acesso ao dadoImmersion Day - Democratize o acesso ao dado
Immersion Day - Democratize o acesso ao dadoAmazon Web Services LATAM
 
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdfBuilding_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdfAmazon Web Services
 
Building-a-Modern-Data-Platform-in-the-Cloud.pdf
Building-a-Modern-Data-Platform-in-the-Cloud.pdfBuilding-a-Modern-Data-Platform-in-the-Cloud.pdf
Building-a-Modern-Data-Platform-in-the-Cloud.pdfAmazon Web Services
 
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdfData Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdfAmazon Web Services
 
Everything You Need to Know About Big Data: From Architectural Principles to ...
Everything You Need to Know About Big Data: From Architectural Principles to ...Everything You Need to Know About Big Data: From Architectural Principles to ...
Everything You Need to Know About Big Data: From Architectural Principles to ...Amazon Web Services
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAmazon Web Services
 
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Amazon Web Services
 

Similar a Preparing Your Data for Cloud Analytics & AI/ML (20)

Preparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/ML Preparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/ML
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Data_Analytics_and_AI_ML
Data_Analytics_and_AI_MLData_Analytics_and_AI_ML
Data_Analytics_and_AI_ML
 
在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析
 
Building a modern data platform in the cloud. AWS DevDay Nordics
Building a modern data platform in the cloud. AWS DevDay NordicsBuilding a modern data platform in the cloud. AWS DevDay Nordics
Building a modern data platform in the cloud. AWS DevDay Nordics
 
Deep dive session - how to achieve database freedom
Deep dive session - how to achieve database freedomDeep dive session - how to achieve database freedom
Deep dive session - how to achieve database freedom
 
Databases - Choosing the right Database on AWS
Databases - Choosing the right Database on AWSDatabases - Choosing the right Database on AWS
Databases - Choosing the right Database on AWS
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Using AWS Purpose-Built Databases to Modernize your Applications
Using AWS Purpose-Built Databases to Modernize your ApplicationsUsing AWS Purpose-Built Databases to Modernize your Applications
Using AWS Purpose-Built Databases to Modernize your Applications
 
Immersion Day - Democratize o acesso ao dado
Immersion Day - Democratize o acesso ao dadoImmersion Day - Democratize o acesso ao dado
Immersion Day - Democratize o acesso ao dado
 
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdfBuilding_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
 
Building-a-Modern-Data-Platform-in-the-Cloud.pdf
Building-a-Modern-Data-Platform-in-the-Cloud.pdfBuilding-a-Modern-Data-Platform-in-the-Cloud.pdf
Building-a-Modern-Data-Platform-in-the-Cloud.pdf
 
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdfData Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
 
Everything You Need to Know About Big Data: From Architectural Principles to ...
Everything You Need to Know About Big Data: From Architectural Principles to ...Everything You Need to Know About Big Data: From Architectural Principles to ...
Everything You Need to Know About Big Data: From Architectural Principles to ...
 
Big Data@Scale
 Big Data@Scale Big Data@Scale
Big Data@Scale
 
AWS-Quick-Start
AWS-Quick-StartAWS-Quick-Start
AWS-Quick-Start
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
 

Más de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Preparing Your Data for Cloud Analytics & AI/ML

  • 1. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2019, Amazon Web Services, Inc. or its Affiliates. WWPS EMEA Tech Business Development Abir Roychoudhury, TechBD Database and Analytics Data Lifecycle Preparing Your Data for Cloud Analytics & AI/ML
  • 2. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda • Public Sector Situation • Data Lifecycle Walkthrough • Demonstration around Redshift Analytics + Machine Learning • Customer References • Architectural Principles • Q&A
  • 3. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What do we observe in Public Sector? • Data is dispersed and difficult to access • Limited views on what is going in the business • Resource constraints limit business value activities • Governance and compliance
  • 4. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is the Big Data Challenge? Challenge Characteristic Use Case Solution Requirement to Address Challenge Volume Ranges from Tb to Pb Large data set required for accurate data model • Offline processing of large data set • Transportation • Extraction (key/value pairs) Variety Different sources and formats Bring siloed data sources together different formats • Consolidate disparate sources (structured, unstructured, semi, rest and motion) Velocity stringent requirements from the time data is generated, to the time actionable insights Stream data created at high speed, only relevant for short period. • Capturing stream data • Cataloguing the data, safe for offline • Real-time analytics, ad-hoc queries https://aws.amazon.com/big-data/what-is-big-data/
  • 5. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What do we observe in Public Sector? According to Forbes: 82% of enterprises are prioritizing analytics and BI as part of their budgets for new technologies and cloud-based services. Data warehouse or mart in the cloud (41%), data lake in the cloud (39%) and BI platform in the cloud (38%) are the top three types of technologies enterprises are planning to use.. 42% are seeking to improve user experiences by automating discovery of data insights and 26% are using AI to provide user recommendations.
  • 6. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lifecycle
  • 7. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Ingest Mechanism for data movement from external sources into your data system Questions to ask: a) What are my data sources? b) What is the format of the data? c) Is the data source immutable? d) Is it real-time or batch? e) Where is the destination?
  • 8. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Ingestion: AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Amazon Managed Streaming for Kafka Real-time Data SourcesTraditional Data Sources Media and Log Files ERP Systems Databases (SQL/NoSQL) Data Warehouses (EDW) IoT Sensors Clickstream Telemetry Business Activities Data Lake Database Data Warehouse
  • 9. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Kinesis Data Firehose Real-time data movement and Data Lakes on AWS AWS Glue Data Catalog Amazon S3 Data Data Lake on AWS Amazon Kinesis Data Streams Data definitionKinesis Agent Apache Kafka AWS SDK LOG4J Flume Fluentd AWS Mobile SDK Kinesis Producer Library 1) 2) 3) 4a) 4b)
  • 10. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. VS Single / monolithic Purpose-built / micro-services
  • 11. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of purpose-built architectures Better performance Better scale More functionality Easier to debug Independence between teams
  • 12. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is the data structure? Access Patterns What to use? Put/Get (key, value) In-memory, NoSQL Simple relationships → 1:N, M:N NoSQL Multi-table joins, transaction, SQL SQL Faceting, Search Search Graph traversal GraphDB Data Structure What to use? Fixed schema SQL, NoSQL Schema-free (JSON) NoSQL, Search Key/Value In-memory, NoSQL Graph GraphDB Time Interval Time Series Ledger Ledger How will the data be accessed?
  • 13. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon QLDB Amazon DynamoDB Amazon RDS / Aurora Amazon Timestream Amazon Elasticsearch Amazon Neptune Amazon S3 + Glacier Use Cases Immutable Ledger Key Value with GSI/LSI Indexes OLTP, Transactional stores and processes this data by time intervals Log Analysis, Reverse Indexing Graph Data Lake / File and Object store Performance Very High Performance Ultra High request rate, Ultra low to low latency Very high request rate, low latency High request rate, low latency Medium request rate, low latency Medium request rate, low latency High Throughput Shape Ledger K/V and Document Relational Time Series Documents Node/Edges Files Size TB, PB (no limits) GB, Mid TB GB, Low TB GB, TB GB, Mid TB GB, TB, PB, EB (no limits) Cost / GB $ ¢¢ - $$ $ $ $$ $ ¢- ¢4/10 VPC Support Inside VPC VPC Endpoint Inside VPC Outside or Inside VPC Inside VPC VPC Endpoint Database Characteristics
  • 14. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Staging Validate, Verify, Catalog the incoming Raw Data Perform common housekeeping tasks Questions to ask: Which validation checks? How will the raw dataset catalog be populated? Automated Tagging of data?
  • 15. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Cleansing Transform and Process data for downstream analytics Questions to ask: Which users and analytics will consume data? Is there a common data model? Optimize for reads/queries or writes? How will data cleanup over time be performed? (compaction, etc..)
  • 16. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ELT/ETL Preparing Raw, Staging, and Cleansed Data Lakes Raw Ingestion Staged Datasets Optimized ML Datasets Optimized ML Datasets Data Lake on AWS ELT/ETL Cleansed “views” of the data
  • 17. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demonstration Setting • Use data from AWS Open Data: https://aws.amazon.com/opendata/ • Cornell University has created a public data lake of climate data in ORC* format • Get Data into S3, AWS Glue Catalogue • Look at the structure • Move to Redshift Data Warehouse analyse temperature development by min/max and location • Analyse, basic prediction in advanced analytics using ML in Sagemaker (using DEEPAR Forecast) • *Redshift supports ORC and Parquet
  • 18. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demonstration SettingCornell Open Data provides climate data Data is copied to local S3 or can be queried directly from Cornell Data Lake Glue is cataloguing data Early insight into data structure Redshift loads data for queries on temperature by period and location Data enriched by ML model (DEEPAR) for forecast User can query report with QuickSight visualisation
  • 19. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demonstration
  • 20. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Analytics & Visualization Deliver decisions makers the insights to transform an organization by identifying unmet needs within the customers or by optimizing operational processes Questions to ask: What business question is being answered? Does the data support answering them? Who are the users driving the insights? What skills do those users have?
  • 21. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Customer References
  • 22. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Petabytes of data generated on-premises, brought to AWS, and stored in S3 Thousands of analytical queries performed on EMR and Amazon Redshift. Stringent security requirements met by leveraging VPC, VPN, encryption at-rest and in- transit, CloudTrail, and database auditing Flexible Interactive Queries Predefined Queries Surveillance Analytics Web Applications Analysts; Regulators FINRA: Migrating to AWS
  • 23. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hearst’s Serverless Data Pipeline cosmopolitan.com caranddriver.com sfchronicle.com elle.com Ingestion proxy (Node.js) Serverless data pipeline Offline analysis and archive Real-time analysis
  • 24. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. Process high variety or volume structured or unstructured datasets • Big Data Processing 2. Power Business Users to drive Insights • Data Warehousing 3. Interactively query and explore datasets • Ad Hoc Querying 4. Analyze what’s happening now • Streaming Analytics 5. Drive operational and security understanding. • Log Analysis Common Types of Data Analytics
  • 25. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Which Analytics Should I Use? PROCESS / ANALYZE Batch Takes minutes to hours Example: Daily/weekly/monthly reports Amazon EMR (MapReduce, Hive, Pig, Spark) Interactive Takes seconds Example: Self-service dashboards Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark) Stream Takes milliseconds to seconds Example: Fraud alerts, 1 minute metrics Amazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL, AWS Lambda, etc. Predictive Takes milliseconds (real-time) to hours (batch) Example: Fraud detection, Forecasting demand, Speech recognition Amazon SageMaker, Polly, Rekognition, Transcribe, Translate, Amazon EMR (Spark ML), Deep Learning AMI (MXNet, TensorFlow, Theano, Torch, CNTK and Caffe) FastSlow Amazon Redshift & Spectrum Amazon Athena BatchInteractive Amazon ES Presto Amazon EMR Predictive AmazonML KCL Apps AWS Lambda Amazon Kinesis Analytics Stream Streaming Fast
  • 26. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Which Analytics Tool Should I Use? Amazon Redshift Amazon Redshift Spectrum Amazon Athena Amazon EMR Presto Spark Hive Use case Optimized for data warehousing Query S3 data from Redshift Interactive Queries over S3 data Interactive Query General purpose Batch Scale/Throughput ~Nodes ~Nodes Automatic ~ Nodes Managed Service Yes Yes Yes, Serverless Yes Storage Local storage Amazon S3 Amazon S3 Amazon S3, HDFS Optimization Columnar storage, data compression, and zone maps AVRO, PARQUET TEXT, SEQ RCFILE, ORC, etc. AVRO, PARQUET TEXT, SEQ RCFILE, ORC, etc. Framework dependent Metadata Redshift Catalog Glue Catalog Glue Catalog Glue Catalog or Hive Meta-store Auth/Access controls IAM, Users, groups, and access controls IAM, Users, groups, and access controls IAM IAM, LDAP & Kerberos UDF support Yes (Scalar) Yes (Scalar) No Yes
  • 27. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Which Stream Processing Technology Should I Use? Amazon EMR (Spark Streaming) KCL Application Amazon Kinesis Analytics AWS Lambda Managed Service Yes No (EC2 + Auto Scaling) Yes Yes Serverless No No Yes Yes Scale / Throughput No limits / ~ nodes No limits / ~ nodes No Limits / automatic No limits / automatic Availability Single AZ Multi-AZ Multi-AZ Multi-AZ Programming Languages Java, Python, Scala Java, others via MultiLangDaemon ANSI SQL or Java/Flink Node.js, Java, Python, .Net Core Sliding Window Functions Build-in App needs to implement Built-in No Reliability KCL and Spark checkpoints Managed by KCL Managed by Amazon Kinesis Analytics Managed by AWS Lambda
  • 28. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Enforce security policies across multiple services Gain and manage new insights Identify, ingest, clean, and transform data Build a secure data lake in days AWS Lake Formation
  • 29. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Archiving Makes the archival process easy to manage, and allows you to focus on the storage of your data, rather than the management of your tape systems and library.
  • 30. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Securing, Protecting and Managing Data • Access policy options and AWS IAM (resource and user base policies) • Data Encryption with Amazon S3 and AWS KMS • S3 protects against corruption, loss and accidental overwrites, modifications or deletions • Managing Data with Object Tagging • S3 includes certs PCI-DSS, SOC123, HIPAA/HITECH, FedRAMP, SEC Rule 17, FISMA, EU Data Protection Directive https://docs.aws.amazon.com/en_pv/whitepapers/latest/building-data-lakes/securing-protecting-managing-data
  • 31. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Architectural Principles 1. Build decoupled systems • Data → Store → Process → Store → Analyze → Answers 2. Use the right tool for the job • Data structure, latency, throughput, access patterns 3. Leverage managed and serverless services • Scalable/elastic, available, reliable, secure, no/low admin 4. Use event-journal design patterns • Immutable datasets (data lake), materialized views 5. Be cost-conscious • Big data ≠ big cost 6. Machine Learning (ML) enable your applications
  • 32. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you & Questions