SlideShare una empresa de Scribd logo
1 de 46
Descargar para leer sin conexión
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
2019.06.18
U T R E C H T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Building a modern data platform
on AWS
Javier Ramirez
AWS Tech Evangelist
@supercoco9
ANT001
2019.06.18
UTRECHT
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Solution
My reports make
my database
server very slow
Before 2009
The DBA years
Overnight DB dump
Read-only replica
My data doesn’t fit in
one machine
And it’s not only
transactional
2009-2011
The Hadoop epiphany
Hadoop
Map/Reduce all the
things
My data is very
fast
Map/Reduce is
hard to use
2012-2014
The Message Broker
and NoSQL Age
Kafka/RabbitMQ
Cassandra/HBASE
/STORM
Basic ETL
Hive
Duplicating batch/stream is inefficient
I need to cleanse my source data
Hadoop ecosystem is hard to manage
My data scientists don’t like JAVA
I am not sure which data we are
already processing
2015-2017
The Spark kingdom and
the spreadsheet wars
Kafka/Spark
Complex ETL
Create new departments for data
governance
Spreadsheet all the things
Streaming is hard
My schemas have evolved
I cannot query old and new
data together
My cluster is running old
versions. Upgrading is hard
I want to use ML
2017-2018
The myth of DataOps
Kafka/Flink (JAVA or Scala
required)
Complex ETL with a pinch of
ML
Apache Atlas
Commercial distributions
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Some problems during all periods
• My team spends more time maintaining the cluster than adding functionality
• Security and monitoring are hard
• Most of my time my cluster is sitting idle; Then it’s a bottleneck
• I don’t have the time to experiment
• Data preparation, cleansing, and basic transformations take a
disproportionally high amount of my time. And it’s so frustrating
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Some simple things that scare me (and eat my productivity)
• Text encodings
• Empty strings. Literal ”NULL” strings
• Uppercase and Lowercase
• Date and time formats: which date would you say this is 1/4/19? And this? 1553589297
• CSV, especially if uploaded by end users
• JSON files with a single array and 200.000 records inside
• The same JSON file when row 176.543 has a column never seen before
• The same JSON file when all the numbers are strings
• XML
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The downfall of the data engineer
Watching paint dry is exciting in comparison to writing and maintaining Extract
Transform and Load (ETL) logic. Most ETL jobs take a long time to execute and errors
or issues tend to happen at runtime or are post-runtime assertions. Since the
development time to execution time ratio is typically low, being productive means
juggling with multiple pipelines at once and inherently doing a lot of context
switching. By the time one of your 5 running “big data jobs” has finished, you have to
get back in the mind space you were in many hours ago and craft your next iteration.
Depending on how caffeinated you are, how long it’s been since the last iteration, and
how systematic you are, you may fail at restoring the full context in your short term
memory. This leads to systemic, stupid errors that waste hours.
“
”Maxime Beauchemin, Data engineer extraordinaire at Lyft, creator of Apache Airflow and Apache Superset.
Ex-Facebook, Ex-Yahoo!, Ex-Airbnb
https://medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A data lake is a centralized repository that allows
you to store all your structured and unstructured
data at any scale.
Modern data analytics 101
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A good data lake allows self-service and can
easily plug-in new analytical engines.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
It’s all about democratizing access to your data
across the organization.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The concept of a Data Lake
• All data in one place, a single source of truth
• Handles structured/semi-structured/unstructured/raw data
• Supports fast ingestion and consumption
• Schema on read
• Designed for low-cost storage
• Decouples storage and compute
• Supports protection and security rules
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A Possible Open Source solution
• Hadoop Cluster (static/multi tenant)
• Apache NiFi for ingestion workflows
• Sqoop to ingest data from RDBMS
• HDFS to store the data (tied to the Hadoop cluster)
• Hive/HCatalog for data Catalog
• Apache Atlas for a more human data catalog and governance
• Apache Spark for complex ETL –with Apache Livy for REST
• Hive for batch workloads with SQL
• Presto for interactive queries with SQL
• Kafka for streaming ingest
• Apache Spark/Apache Flink for streaming analytics
• Apache Hbase (or maybe Cassandra) to store streaming data
• Apache Phoenix to run SQL queries on top of Hbase
• Prometheus (or fluentd/collectd/ganglia/Nagios…) for logs and monitoring. Maybe with Elastic Search/Kibana
• Airflow/Oozie to schedule workflows
• Superset for business dashboards
• Jupyter/JupyterHub/Zeppelin for data science
• Security (Apache Sentry for Roles, Ranger for configuration, Knox as a firewall)
• YARN to coordinate resources
• Ambari for cluster administration
• Terraform/chef/puppet for provisioning
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Or a cloud native Solution on AWS
Amazon
DynamoDB
Amazon Elasticsearch
Service
AWS
AppSync
Amazon
API Gateway
Amazon
Cognito
AWS
KMS
AWS
CloudTrail
AWS
IAM
Amazon
CloudWatch
AWS
Snowball
AWS Storage
Gateway
Amazon
Kinesis Data
Firehose
AWS Direct
Connect
AWS Database
Migration
Service
Amazon
Athena
Amazon
EMR
AWS
Glue
Amazon
Redshift
Amazon
DynamoDB
Amazon
QuickSight
Amazon
Kinesis
Amazon
Elasticsearch
Service
Amazon
Neptune
Amazon
RDS
AWS
Glue
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
More data lakes & analytics on AWS than anywhere else
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Movement From On-premises Datacenters
AWS Snowball,
Snowball Edge and
Snowmobile
Petabyte and Exabyte-
scale data transport
solution that uses secure
appliances to transfer
large amounts of data
into and out of the AWS
cloud
AWS Direct Connect
Establish a dedicated
network connection from
your premises to AWS;
reduces your network
costs, increase bandwidth
throughput, and provide a
more consistent network
experience than Internet-
based connections
AWS Storage
Gateway
Lets your on-premises
applications to use AWS
for storage; includes a
highly-optimized data
transfer mechanism,
bandwidth management,
along with local cache
AWS Database
Migration Service
Migrate database from
the most widely-used
commercial and open-
source offerings to AWS
quickly and securely with
minimal downtime to
applications
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Movement From Real-time Sources
Amazon Kinesis
Video Streams
Securely stream video
from connected devices
to AWS for analytics,
machine learning (ML),
and other processing
Amazon Kinesis Data
Firehose
Capture, transform, and
load data streams into
AWS data stores for near
real-time analytics with
existing business
intelligence tools.
Amazon Kinesis Data
Streams
Build custom, real-time
applications that process
data streams using
popular stream
processing frameworks
AWS IoT Core
Supports billions of
devices and trillions of
messages, and can
process and route those
messages to AWS
endpoints and to other
devices reliably and
securely
Managed Streaming
For Kafka
Fully managed open-
source platform for
building real-time
streaming data pipelines
and applications.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3—Object Storage
Security and
Compliance
Three different forms of
encryption; encrypts data
in transit when
replicating across regions;
log and monitor with
CloudTrail, use ML to
discover and protect
sensitive data with Macie
Flexible Management
Classify, report, and
visualize data usage
trends; objects can be
tagged to see storage
consumption, cost, and
security; build lifecycle
policies to automate
tiering, and retention
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Query in Place
Run analytics & ML on
data lake without data
movement; S3 Select can
retrieve subset of data,
improving analytics
performance by 400%
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Glacier—Backup and Archive
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Secure
Log and monitor with
CloudTrail, Vault Lock
enables WORM storage
capabilities, helping
satisfy compliance
requirements
Retrieves data in
minutes
Three retrieval options to
fit your use case;
expedited retrievals with
Glacier Select can return
data in minutes
Inexpensive
Lowest cost AWS object
storage class, allowing
you to archive large
amounts of data at a very
low cost
$
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Preparation Accounts for ~80% of the Work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Use AWS Glue to cleanse, prep, and catalog
AWS Glue Data Catalog - a single view
across your data lake
Automatically discovers data and stores schema
Makes data searchable, and available for ETL
Contains table definitions and custom metadata
Use AWS Glue ETL jobs to cleanse,
transform, and store processed data
Serverless Apache Spark environment
Use Glue ETL libraries or bring your own code
Write code in Python or Scala
Call any AWS API using the AWS boto3 SDK
Amazon S3
(Raw data)
Amazon S3
(Staging
data)
Amazon S3
(Processed data)
AWS Glue Data Catalog
Crawlers Crawlers Crawlers
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes, Analytics, and ML Portfolio from AWS
Broadest, deepest set of analytic services
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch service
Amazon Kinesis
Amazon QuickSight
Analytics
Machine Learning
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS Storage Gateway
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR—Big Data Processing
Low cost
Flexible billing with per-
second billing, EC2 spot,
reserved instances and
auto-scaling to reduce
costs 50–80%
$
Easy
Launch fully managed
Hadoop & Spark in
minutes; no cluster
setup, node provisioning,
cluster tuning
Latest versions
Updated with the latest
open source frameworks
within 30 days of release
Use S3 storage
Process data directly in
the S3 data lake securely
with high performance
using the EMRFS
connector
Data Lake
100110000100101011100
101010111001010100000
111100101100101010001
100001
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR— More than just managed Hadoop
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift—Data Warehousing
Fast at scale
Columnar storage
technology to improve
I/O efficiency and scale
query performance
Secure
Audit everything; encrypt
data end-to-end;
extensive certification
and compliance
Open file formats
Analyze optimized data
formats on the latest
SSD, and all open data
formats in Amazon S3
Inexpensive
As low as $1,000 per
terabyte per year, 1/10th
the cost of traditional
data warehouse
solutions; start at $0.25
per hour
$
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift Spectrum
Extend the data warehouse to exabytes of data in S3 data lake
S3 data lakeRedshift data
Redshift Spectrum
query engine • Exabyte Redshift SQL queries against S3
• Join data across Redshift and S3
• Scale compute and storage separately
• Stable query performance and unlimited concurrency
• CSV, ORC, Avro, & Parquet data formats
• Pay only for the amount of data scanned
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Let’s play a game
Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017
https://youtu.be/RpPf38L0HHU?t=3963
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Let’s play a game
Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017
https://youtu.be/RpPf38L0HHU?t=3963
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Numbers are fun
Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017
https://youtu.be/RpPf38L0HHU?t=3963
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Numbers are fun
Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017
https://youtu.be/RpPf38L0HHU?t=3963
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena—Interactive Analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)
Query Instantly
Zero setup cost; just
point to S3 and
start querying
SQL
Open
ANSI SQL interface,
JDBC/ODBC drivers,
multiple formats,
compression types,
and complex joins and
data types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with
QuickSight
Pay per query
Pay only for queries
run; save 30–90% on
per-query costs
through compression
$
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon QuickSight
easy
Empower
everyone
Seamless
connectivity
Fast analysis Serverless
Now with ML superpowers!
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Provides Highest Levels of Security
Secure
Compliance
AWS Artifact
Amazon Inspector
Amazon Cloud HSM
Amazon Cognito
AWS CloudTrail
Security
Amazon GuardDuty
AWS Shield
AWS WAF
Amazon Macie
VPC
Encryption
AWS Certification Manager
AWS Key Management
Service
Encryption at rest
Encryption in transit
Bring your own keys, HSM
support
Identity
AWS IAM
AWS SSO
Amazon Cloud Directory
AWS Directory Service
AWS Organizations
Customer need to have multiple levels of security, identity and access management,
encryption, and compliance to secure their data lake
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Compliance: Virtually Every Regulatory Agency
CSA
Cloud Security
Alliance Controls
ISO 9001
Global Quality
Standard
ISO 27001
Security Management
Controls
ISO 27017
Cloud Specific
Controls
ISO 27018
Personal Data
Protection
PCI DSS Level 1
Payment Card
Standards
SOC 1
Audit Controls
Report
SOC 2
Security, Availability, &
Confidentiality Report
SOC 3
General Controls
Report
Global United States
CJIS
Criminal Justice
Information Services
DoD SRG
DoD Data
Processing
FedRAMP
Government Data
Standards
FERPA
Educational
Privacy Act
FIPS
Government Security
Standards
FISMA
Federal Information
Security Management
GxP
Quality Guidelines
and Regulations
ISO FFIEC
Financial Institutions
Regulation
HIPPA
Protected Health
Information
ITAR
International Arms
Regulations
MPAA
Protected Media
Content
NIST
National Institute of
Standards and Technology
SEC Rule 17a-4(f)
Financial Data
Standards
VPAT/Section 508
Accountability
Standards
Asia Pacific
FISC [Japan]
Financial Industry
Information Systems
IRAP [Australia]
Australian Security
Standards
K-ISMS [Korea]
Korean Information
Security
MTCS Tier 3 [Singapore]
Multi-Tier Cloud
Security Standard
My Number Act [Japan]
Personal Information
Protection
Europe
C5 [Germany]
Operational Security
Attestation
Cyber Essentials
Plus [UK]
Cyber Threat
Protection
G-Cloud [UK]
UK Government
Standards
IT-Grundschutz
[Germany]
Baseline Protection
Methodology
X P
G
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS databases and analytics
Broad and deep portfolio, built for builders
AWS Marketplace
Amazon Redshift
Data warehousing
Amazon EMR
Hadoop + Spark
Athena
Interactive analytics
Kinesis Analytics
Real-time
Amazon Elasticsearch service
Operational Analytics
RDS
MySQL, PostgreSQL, MariaDB,
Oracle, SQL Server
Aurora
MySQL, PostgreSQL
Amazon
QuickSight
Amazon
SageMaker
DynamoDB
Key value, Document
ElastiCache
Redis, Memcached
Neptune
Graph
Timestream
Time Series
QLDB
Ledger Database
S3/Amazon Glacier
AWS Glue
ETL & Data Catalog
Lake Formation
Data Lakes
Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Data Pipeline | Direct Connect
Data Movement
AnalyticsDatabases
Business Intelligence & Machine Learning
Data Lake
Managed
Blockchain
Blockchain
Templates
Blockchain
Amazon
Comprehend
Amazon
Rekognition
Amazon
Lex
Amazon
Transcribe
AWS DeepLens 250+ solutions
730+ Database
solutions
600+ Analytics
solutions
25+ Blockchain
solutions
20+ Data lake
solutions
30+ solutions
RDS on VMWare
CHALLENGE
Need to create constant feedback loop
for designers
Gain up-to-the-minute understanding
of gamer satisfaction to guarantee
gamers are engaged, thus resulting in
the most popular game played in the
world
Fortnite | 125+ million players
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Epic Games uses Data Lakes and analytics
Entire analytics platform running on AWS
S3 leveraged as a Data Lake
All telemetry data is collected with Kinesis
Real-time analytics done through Spark on EMR,
DynamoDB to create scoreboards and real-time queries
Use Amazon EMR for large batch data processing
Game designers use data to inform their decisions
Game
clients
Game
servers
Launcher
Game
services
N E A R R E A L T I M E P I P E L I N E
N E A R R E A L T I M E P I P E L I N E
Grafana
Scoreboards API
Limited Raw Data
(real time ad-hoc SQL)
User ETL
(metric definition)
Spark on EMR DynamoDB
NEAR REALTIME PIPELINES
BATCH PIPELINES
ETL using
EMR
Tableau/BI
Ad-hoc SQLS3
(Data Lake)
Kinesis
APIs
Databases
S3
Other
sources
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Demo Overview
https://aws.amazon.com/blogs/big-data/harmonize-query-and-visualize-data-
from-various-providers-using-aws-glue-amazon-athena-and-amazon-quicksight/
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Typical steps of building a data lake
Setup Storage1
Move data2
Cleanse, prep, and
catalog data
3
Configure and enforce
security and compliance
policies
4
Make data available
for analytics
5
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Building data lakes can still take months
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Lake Formation (join the preview)
Build, secure, and manage a data lake in days
Build a data lake in days,
not months
Build and deploy a fully
managed data lake with a few
clicks
Enforce security policies
across multiple services
Centrally define security,
governance, and auditing policies in
one place and enforce those policies
for all users and all applications
Combine different
analytics approaches
Empower analyst and data scientist
productivity, giving them self-
service discovery and safe access to
all data from a single catalog
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How it works: AWS Lake Formation
S3
IAM KMS
OLTP
ERP
CRM
LOB
Devices
Web
Sensors
Social Kinesis
Build Data Lakes quickly
• Identify, crawl, and catalog sources
• Ingest and clean data
• Transform into optimal formats
Simplify security management
• Enforce encryption
• Define access policies
• Implement audit login
Enable self-service and combined analytics
• Analysts discover all data available for analysis
from a single data catalog
• Use multiple analytics tools over the same data
Athena
Amazon
Redshift
AI Services
Amazon
EMR
Amazon
QuickSight
Data
Catalog
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Customer interest in AWS Lake Formation
“We are very excited about the launch of AWS Lake
Formation, which provides a central point of control to
easily load, clean, secure, and catalog data from thousands of
clients to our AWS-based data lake, dramatically reducing
our operational load. … Additionally, AWS Lake Formation
will be HIPAA compliant from day one …”
- Aaron Symanski, CTO, Change Healthcare
“I can’t wait for my team to get our hands on AWS Lake
Formation. With an enterprise-ready option like Lake
Formation, we will be able to spend more time deriving
value from our data rather than doing the heavy lifting
involved in manually setting up and managing our data lake.”
- Joshua Couch, VP Engineering, Fender Digital
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Javier Ramirez
AWS Tech Evangelist
@supercoco9
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Select AWS Glue customers

Más contenido relacionado

La actualidad más candente

Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...Amazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueAmazon Web Services
 
Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
 Citrix Moves Data to Amazon Redshift Fast with Matillion ETL Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
Citrix Moves Data to Amazon Redshift Fast with Matillion ETLAmazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Amazon Web Services
 
Building Serverless ETL Pipelines
Building Serverless ETL PipelinesBuilding Serverless ETL Pipelines
Building Serverless ETL PipelinesAmazon Web Services
 
Innovation-at-Hyper-scale-Outlook-on-Emerging-Technologies
Innovation-at-Hyper-scale-Outlook-on-Emerging-TechnologiesInnovation-at-Hyper-scale-Outlook-on-Emerging-Technologies
Innovation-at-Hyper-scale-Outlook-on-Emerging-TechnologiesAmazon Web Services
 
How to Choose The Right Database on AWS - Berlin Summit - 2019
How to Choose The Right Database on AWS - Berlin Summit - 2019How to Choose The Right Database on AWS - Berlin Summit - 2019
How to Choose The Right Database on AWS - Berlin Summit - 2019Randall Hunt
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018Amazon Web Services
 
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...Amazon Web Services
 
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperThe Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperVasu S
 
Scalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS Summit
Scalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS SummitScalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS Summit
Scalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS SummitAmazon Web Services
 
ElastiCache: Deep Dive Best Practices and Usage Patterns - AWS Online Tech Talks
ElastiCache: Deep Dive Best Practices and Usage Patterns - AWS Online Tech TalksElastiCache: Deep Dive Best Practices and Usage Patterns - AWS Online Tech Talks
ElastiCache: Deep Dive Best Practices and Usage Patterns - AWS Online Tech TalksAmazon Web Services
 
Humans and Data Don't Mix- Best Practices to Secure Your Cloud
Humans and Data Don't Mix- Best Practices to Secure Your CloudHumans and Data Don't Mix- Best Practices to Secure Your Cloud
Humans and Data Don't Mix- Best Practices to Secure Your CloudAmazon Web Services
 

La actualidad más candente (20)

Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
 Citrix Moves Data to Amazon Redshift Fast with Matillion ETL Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Building Serverless ETL Pipelines
Building Serverless ETL PipelinesBuilding Serverless ETL Pipelines
Building Serverless ETL Pipelines
 
Innovation-at-Hyper-scale-Outlook-on-Emerging-Technologies
Innovation-at-Hyper-scale-Outlook-on-Emerging-TechnologiesInnovation-at-Hyper-scale-Outlook-on-Emerging-Technologies
Innovation-at-Hyper-scale-Outlook-on-Emerging-Technologies
 
How to Choose The Right Database on AWS - Berlin Summit - 2019
How to Choose The Right Database on AWS - Berlin Summit - 2019How to Choose The Right Database on AWS - Berlin Summit - 2019
How to Choose The Right Database on AWS - Berlin Summit - 2019
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
 
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperThe Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
 
Analyzing Streams
Analyzing StreamsAnalyzing Streams
Analyzing Streams
 
Scalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS Summit
Scalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS SummitScalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS Summit
Scalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS Summit
 
ElastiCache: Deep Dive Best Practices and Usage Patterns - AWS Online Tech Talks
ElastiCache: Deep Dive Best Practices and Usage Patterns - AWS Online Tech TalksElastiCache: Deep Dive Best Practices and Usage Patterns - AWS Online Tech Talks
ElastiCache: Deep Dive Best Practices and Usage Patterns - AWS Online Tech Talks
 
Humans and Data Don't Mix- Best Practices to Secure Your Cloud
Humans and Data Don't Mix- Best Practices to Secure Your CloudHumans and Data Don't Mix- Best Practices to Secure Your Cloud
Humans and Data Don't Mix- Best Practices to Secure Your Cloud
 

Similar a Building a modern data platform on AWS. Utrecht AWS Dev Day

Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
Building a Modern Data Platform in the Cloud. AWS Initiate PortugalBuilding a Modern Data Platform in the Cloud. AWS Initiate Portugal
Building a Modern Data Platform in the Cloud. AWS Initiate Portugaljavier ramirez
 
Building Data Lakes & Analytics on AWS
Building Data Lakes & Analytics on AWSBuilding Data Lakes & Analytics on AWS
Building Data Lakes & Analytics on AWSAWS Summits
 
The Modern Tech Stack: Data Analytics in the Cloud for Developers and Founders
The Modern Tech Stack: Data Analytics in the Cloud for Developers and FoundersThe Modern Tech Stack: Data Analytics in the Cloud for Developers and Founders
The Modern Tech Stack: Data Analytics in the Cloud for Developers and FoundersAggregage
 
21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with ZopaAmazon Web Services
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Amazon Web Services
 
Everything You Need to Know About Big Data: From Architectural Principles to ...
Everything You Need to Know About Big Data: From Architectural Principles to ...Everything You Need to Know About Big Data: From Architectural Principles to ...
Everything You Need to Know About Big Data: From Architectural Principles to ...Amazon Web Services
 
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWSAWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWSSteven Hsieh
 
Migrando seus dados para nuvem: Explore as opções da nuvem AWS
Migrando seus dados para nuvem: Explore as opções da nuvem AWSMigrando seus dados para nuvem: Explore as opções da nuvem AWS
Migrando seus dados para nuvem: Explore as opções da nuvem AWSAmazon Web Services LATAM
 
Big Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_SingaporeBig Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_SingaporeAmazon Web Services
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...AWS Riyadh User Group
 
AWS Initiate Day Manchester 2019 – AWS Migrating Data to the Cloud
AWS Initiate Day Manchester 2019 – AWS Migrating Data to the CloudAWS Initiate Day Manchester 2019 – AWS Migrating Data to the Cloud
AWS Initiate Day Manchester 2019 – AWS Migrating Data to the CloudAmazon Web Services
 
AWS Initiate Day Dublin 2019 – Migrating Data to the Cloud
AWS Initiate Day Dublin 2019 – Migrating Data to the CloudAWS Initiate Day Dublin 2019 – Migrating Data to the Cloud
AWS Initiate Day Dublin 2019 – Migrating Data to the CloudAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSAmazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
AWS Initiate - Migrando Dados Para a Nuvem: Explorando suas opções com AWS
AWS Initiate - Migrando Dados Para a Nuvem: Explorando suas opções com AWSAWS Initiate - Migrando Dados Para a Nuvem: Explorando suas opções com AWS
AWS Initiate - Migrando Dados Para a Nuvem: Explorando suas opções com AWSAmazon Web Services LATAM
 
Initiate Edinburgh 2019 - Migrating Data to the Cloud
Initiate Edinburgh 2019 - Migrating Data to the CloudInitiate Edinburgh 2019 - Migrating Data to the Cloud
Initiate Edinburgh 2019 - Migrating Data to the CloudAmazon Web Services
 
Data Driven Enterprise with Apache Kafka
Data Driven Enterprise with Apache KafkaData Driven Enterprise with Apache Kafka
Data Driven Enterprise with Apache Kafkaconfluent
 

Similar a Building a modern data platform on AWS. Utrecht AWS Dev Day (20)

Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
Building a Modern Data Platform in the Cloud. AWS Initiate PortugalBuilding a Modern Data Platform in the Cloud. AWS Initiate Portugal
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
 
Building Data Lakes & Analytics on AWS
Building Data Lakes & Analytics on AWSBuilding Data Lakes & Analytics on AWS
Building Data Lakes & Analytics on AWS
 
The Modern Tech Stack: Data Analytics in the Cloud for Developers and Founders
The Modern Tech Stack: Data Analytics in the Cloud for Developers and FoundersThe Modern Tech Stack: Data Analytics in the Cloud for Developers and Founders
The Modern Tech Stack: Data Analytics in the Cloud for Developers and Founders
 
21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with Zopa
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
 
Everything You Need to Know About Big Data: From Architectural Principles to ...
Everything You Need to Know About Big Data: From Architectural Principles to ...Everything You Need to Know About Big Data: From Architectural Principles to ...
Everything You Need to Know About Big Data: From Architectural Principles to ...
 
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWSAWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
 
Migrando seus dados para nuvem: Explore as opções da nuvem AWS
Migrando seus dados para nuvem: Explore as opções da nuvem AWSMigrando seus dados para nuvem: Explore as opções da nuvem AWS
Migrando seus dados para nuvem: Explore as opções da nuvem AWS
 
Big Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_SingaporeBig Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_Singapore
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
 
AWS Initiate Day Manchester 2019 – AWS Migrating Data to the Cloud
AWS Initiate Day Manchester 2019 – AWS Migrating Data to the CloudAWS Initiate Day Manchester 2019 – AWS Migrating Data to the Cloud
AWS Initiate Day Manchester 2019 – AWS Migrating Data to the Cloud
 
AWS Initiate Day Dublin 2019 – Migrating Data to the Cloud
AWS Initiate Day Dublin 2019 – Migrating Data to the CloudAWS Initiate Day Dublin 2019 – Migrating Data to the Cloud
AWS Initiate Day Dublin 2019 – Migrating Data to the Cloud
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWS
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
AWS Initiate - Migrando Dados Para a Nuvem: Explorando suas opções com AWS
AWS Initiate - Migrando Dados Para a Nuvem: Explorando suas opções com AWSAWS Initiate - Migrando Dados Para a Nuvem: Explorando suas opções com AWS
AWS Initiate - Migrando Dados Para a Nuvem: Explorando suas opções com AWS
 
Initiate Edinburgh 2019 - Migrating Data to the Cloud
Initiate Edinburgh 2019 - Migrating Data to the CloudInitiate Edinburgh 2019 - Migrating Data to the Cloud
Initiate Edinburgh 2019 - Migrating Data to the Cloud
 
Data Driven Enterprise with Apache Kafka
Data Driven Enterprise with Apache KafkaData Driven Enterprise with Apache Kafka
Data Driven Enterprise with Apache Kafka
 

Más de javier ramirez

¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfestjavier ramirez
 
QuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databaseQuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databasejavier ramirez
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...javier ramirez
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBjavier ramirez
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)javier ramirez
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Databasejavier ramirez
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728javier ramirez
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022javier ramirez
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...javier ramirez
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragónjavier ramirez
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessPrimeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessjavier ramirez
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloudjavier ramirez
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMjavier ramirez
 
Getting started with streaming analytics
Getting started with streaming analyticsGetting started with streaming analytics
Getting started with streaming analyticsjavier ramirez
 
Getting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelineGetting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelinejavier ramirez
 
Getting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep DiveGetting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep Divejavier ramirez
 
Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)javier ramirez
 
Monitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSMonitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSjavier ramirez
 

Más de javier ramirez (20)

¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest
 
QuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databaseQuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series database
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDB
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Database
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragón
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessPrimeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverless
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloud
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
 
Getting started with streaming analytics
Getting started with streaming analyticsGetting started with streaming analytics
Getting started with streaming analytics
 
Getting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelineGetting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipeline
 
Getting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep DiveGetting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep Dive
 
Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)
 
Monitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSMonitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWS
 

Último

Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 

Último (20)

Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 

Building a modern data platform on AWS. Utrecht AWS Dev Day

  • 1. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. 2019.06.18 U T R E C H T
  • 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building a modern data platform on AWS Javier Ramirez AWS Tech Evangelist @supercoco9 ANT001 2019.06.18 UTRECHT
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Solution My reports make my database server very slow Before 2009 The DBA years Overnight DB dump Read-only replica My data doesn’t fit in one machine And it’s not only transactional 2009-2011 The Hadoop epiphany Hadoop Map/Reduce all the things My data is very fast Map/Reduce is hard to use 2012-2014 The Message Broker and NoSQL Age Kafka/RabbitMQ Cassandra/HBASE /STORM Basic ETL Hive Duplicating batch/stream is inefficient I need to cleanse my source data Hadoop ecosystem is hard to manage My data scientists don’t like JAVA I am not sure which data we are already processing 2015-2017 The Spark kingdom and the spreadsheet wars Kafka/Spark Complex ETL Create new departments for data governance Spreadsheet all the things Streaming is hard My schemas have evolved I cannot query old and new data together My cluster is running old versions. Upgrading is hard I want to use ML 2017-2018 The myth of DataOps Kafka/Flink (JAVA or Scala required) Complex ETL with a pinch of ML Apache Atlas Commercial distributions
  • 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Some problems during all periods • My team spends more time maintaining the cluster than adding functionality • Security and monitoring are hard • Most of my time my cluster is sitting idle; Then it’s a bottleneck • I don’t have the time to experiment • Data preparation, cleansing, and basic transformations take a disproportionally high amount of my time. And it’s so frustrating
  • 5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Some simple things that scare me (and eat my productivity) • Text encodings • Empty strings. Literal ”NULL” strings • Uppercase and Lowercase • Date and time formats: which date would you say this is 1/4/19? And this? 1553589297 • CSV, especially if uploaded by end users • JSON files with a single array and 200.000 records inside • The same JSON file when row 176.543 has a column never seen before • The same JSON file when all the numbers are strings • XML
  • 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The downfall of the data engineer Watching paint dry is exciting in comparison to writing and maintaining Extract Transform and Load (ETL) logic. Most ETL jobs take a long time to execute and errors or issues tend to happen at runtime or are post-runtime assertions. Since the development time to execution time ratio is typically low, being productive means juggling with multiple pipelines at once and inherently doing a lot of context switching. By the time one of your 5 running “big data jobs” has finished, you have to get back in the mind space you were in many hours ago and craft your next iteration. Depending on how caffeinated you are, how long it’s been since the last iteration, and how systematic you are, you may fail at restoring the full context in your short term memory. This leads to systemic, stupid errors that waste hours. “ ”Maxime Beauchemin, Data engineer extraordinaire at Lyft, creator of Apache Airflow and Apache Superset. Ex-Facebook, Ex-Yahoo!, Ex-Airbnb https://medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b
  • 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. Modern data analytics 101
  • 8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. A good data lake allows self-service and can easily plug-in new analytical engines.
  • 9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. It’s all about democratizing access to your data across the organization.
  • 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. The concept of a Data Lake • All data in one place, a single source of truth • Handles structured/semi-structured/unstructured/raw data • Supports fast ingestion and consumption • Schema on read • Designed for low-cost storage • Decouples storage and compute • Supports protection and security rules
  • 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. A Possible Open Source solution • Hadoop Cluster (static/multi tenant) • Apache NiFi for ingestion workflows • Sqoop to ingest data from RDBMS • HDFS to store the data (tied to the Hadoop cluster) • Hive/HCatalog for data Catalog • Apache Atlas for a more human data catalog and governance • Apache Spark for complex ETL –with Apache Livy for REST • Hive for batch workloads with SQL • Presto for interactive queries with SQL • Kafka for streaming ingest • Apache Spark/Apache Flink for streaming analytics • Apache Hbase (or maybe Cassandra) to store streaming data • Apache Phoenix to run SQL queries on top of Hbase • Prometheus (or fluentd/collectd/ganglia/Nagios…) for logs and monitoring. Maybe with Elastic Search/Kibana • Airflow/Oozie to schedule workflows • Superset for business dashboards • Jupyter/JupyterHub/Zeppelin for data science • Security (Apache Sentry for Roles, Ranger for configuration, Knox as a firewall) • YARN to coordinate resources • Ambari for cluster administration • Terraform/chef/puppet for provisioning
  • 12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Or a cloud native Solution on AWS Amazon DynamoDB Amazon Elasticsearch Service AWS AppSync Amazon API Gateway Amazon Cognito AWS KMS AWS CloudTrail AWS IAM Amazon CloudWatch AWS Snowball AWS Storage Gateway Amazon Kinesis Data Firehose AWS Direct Connect AWS Database Migration Service Amazon Athena Amazon EMR AWS Glue Amazon Redshift Amazon DynamoDB Amazon QuickSight Amazon Kinesis Amazon Elasticsearch Service Amazon Neptune Amazon RDS AWS Glue
  • 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. More data lakes & analytics on AWS than anywhere else
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Movement From On-premises Datacenters AWS Snowball, Snowball Edge and Snowmobile Petabyte and Exabyte- scale data transport solution that uses secure appliances to transfer large amounts of data into and out of the AWS cloud AWS Direct Connect Establish a dedicated network connection from your premises to AWS; reduces your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet- based connections AWS Storage Gateway Lets your on-premises applications to use AWS for storage; includes a highly-optimized data transfer mechanism, bandwidth management, along with local cache AWS Database Migration Service Migrate database from the most widely-used commercial and open- source offerings to AWS quickly and securely with minimal downtime to applications
  • 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Movement From Real-time Sources Amazon Kinesis Video Streams Securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing Amazon Kinesis Data Firehose Capture, transform, and load data streams into AWS data stores for near real-time analytics with existing business intelligence tools. Amazon Kinesis Data Streams Build custom, real-time applications that process data streams using popular stream processing frameworks AWS IoT Core Supports billions of devices and trillions of messages, and can process and route those messages to AWS endpoints and to other devices reliably and securely Managed Streaming For Kafka Fully managed open- source platform for building real-time streaming data pipelines and applications.
  • 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3—Object Storage Security and Compliance Three different forms of encryption; encrypts data in transit when replicating across regions; log and monitor with CloudTrail, use ML to discover and protect sensitive data with Macie Flexible Management Classify, report, and visualize data usage trends; objects can be tagged to see storage consumption, cost, and security; build lifecycle policies to automate tiering, and retention Durability, Availability & Scalability Built for eleven nine’s of durability; data distributed across 3 physical facilities in an AWS region; automatically replicated to any other AWS region Query in Place Run analytics & ML on data lake without data movement; S3 Select can retrieve subset of data, improving analytics performance by 400%
  • 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Glacier—Backup and Archive Durability, Availability & Scalability Built for eleven nine’s of durability; data distributed across 3 physical facilities in an AWS region; automatically replicated to any other AWS region Secure Log and monitor with CloudTrail, Vault Lock enables WORM storage capabilities, helping satisfy compliance requirements Retrieves data in minutes Three retrieval options to fit your use case; expedited retrievals with Glacier Select can return data in minutes Inexpensive Lowest cost AWS object storage class, allowing you to archive large amounts of data at a very low cost $
  • 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Preparation Accounts for ~80% of the Work Building training sets Cleaning and organizing data Collecting data sets Mining data for patterns Refining algorithms Other
  • 20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use AWS Glue to cleanse, prep, and catalog AWS Glue Data Catalog - a single view across your data lake Automatically discovers data and stores schema Makes data searchable, and available for ETL Contains table definitions and custom metadata Use AWS Glue ETL jobs to cleanse, transform, and store processed data Serverless Apache Spark environment Use Glue ETL libraries or bring your own code Write code in Python or Scala Call any AWS API using the AWS boto3 SDK Amazon S3 (Raw data) Amazon S3 (Staging data) Amazon S3 (Processed data) AWS Glue Data Catalog Crawlers Crawlers Crawlers
  • 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lakes, Analytics, and ML Portfolio from AWS Broadest, deepest set of analytic services Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch service Amazon Kinesis Amazon QuickSight Analytics Machine Learning AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS Storage Gateway AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Real-time Data Movement On-premises Data Movement Data Lake on AWS Storage | Archival Storage | Data Catalog
  • 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EMR—Big Data Processing Low cost Flexible billing with per- second billing, EC2 spot, reserved instances and auto-scaling to reduce costs 50–80% $ Easy Launch fully managed Hadoop & Spark in minutes; no cluster setup, node provisioning, cluster tuning Latest versions Updated with the latest open source frameworks within 30 days of release Use S3 storage Process data directly in the S3 data lake securely with high performance using the EMRFS connector Data Lake 100110000100101011100 101010111001010100000 111100101100101010001 100001
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EMR— More than just managed Hadoop
  • 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift—Data Warehousing Fast at scale Columnar storage technology to improve I/O efficiency and scale query performance Secure Audit everything; encrypt data end-to-end; extensive certification and compliance Open file formats Analyze optimized data formats on the latest SSD, and all open data formats in Amazon S3 Inexpensive As low as $1,000 per terabyte per year, 1/10th the cost of traditional data warehouse solutions; start at $0.25 per hour $
  • 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift Spectrum Extend the data warehouse to exabytes of data in S3 data lake S3 data lakeRedshift data Redshift Spectrum query engine • Exabyte Redshift SQL queries against S3 • Join data across Redshift and S3 • Scale compute and storage separately • Stable query performance and unlimited concurrency • CSV, ORC, Avro, & Parquet data formats • Pay only for the amount of data scanned
  • 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Let’s play a game Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017 https://youtu.be/RpPf38L0HHU?t=3963 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Let’s play a game Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017 https://youtu.be/RpPf38L0HHU?t=3963
  • 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Numbers are fun Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017 https://youtu.be/RpPf38L0HHU?t=3963
  • 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Numbers are fun Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017 https://youtu.be/RpPf38L0HHU?t=3963
  • 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena—Interactive Analysis Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Ability to run SQL queries on data archived in Amazon Glacier (coming soon) Query Instantly Zero setup cost; just point to S3 and start querying SQL Open ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types Easy Serverless: zero infrastructure, zero administration Integrated with QuickSight Pay per query Pay only for queries run; save 30–90% on per-query costs through compression $
  • 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon QuickSight easy Empower everyone Seamless connectivity Fast analysis Serverless Now with ML superpowers!
  • 32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Provides Highest Levels of Security Secure Compliance AWS Artifact Amazon Inspector Amazon Cloud HSM Amazon Cognito AWS CloudTrail Security Amazon GuardDuty AWS Shield AWS WAF Amazon Macie VPC Encryption AWS Certification Manager AWS Key Management Service Encryption at rest Encryption in transit Bring your own keys, HSM support Identity AWS IAM AWS SSO Amazon Cloud Directory AWS Directory Service AWS Organizations Customer need to have multiple levels of security, identity and access management, encryption, and compliance to secure their data lake
  • 33. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Compliance: Virtually Every Regulatory Agency CSA Cloud Security Alliance Controls ISO 9001 Global Quality Standard ISO 27001 Security Management Controls ISO 27017 Cloud Specific Controls ISO 27018 Personal Data Protection PCI DSS Level 1 Payment Card Standards SOC 1 Audit Controls Report SOC 2 Security, Availability, & Confidentiality Report SOC 3 General Controls Report Global United States CJIS Criminal Justice Information Services DoD SRG DoD Data Processing FedRAMP Government Data Standards FERPA Educational Privacy Act FIPS Government Security Standards FISMA Federal Information Security Management GxP Quality Guidelines and Regulations ISO FFIEC Financial Institutions Regulation HIPPA Protected Health Information ITAR International Arms Regulations MPAA Protected Media Content NIST National Institute of Standards and Technology SEC Rule 17a-4(f) Financial Data Standards VPAT/Section 508 Accountability Standards Asia Pacific FISC [Japan] Financial Industry Information Systems IRAP [Australia] Australian Security Standards K-ISMS [Korea] Korean Information Security MTCS Tier 3 [Singapore] Multi-Tier Cloud Security Standard My Number Act [Japan] Personal Information Protection Europe C5 [Germany] Operational Security Attestation Cyber Essentials Plus [UK] Cyber Threat Protection G-Cloud [UK] UK Government Standards IT-Grundschutz [Germany] Baseline Protection Methodology X P G
  • 34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS databases and analytics Broad and deep portfolio, built for builders AWS Marketplace Amazon Redshift Data warehousing Amazon EMR Hadoop + Spark Athena Interactive analytics Kinesis Analytics Real-time Amazon Elasticsearch service Operational Analytics RDS MySQL, PostgreSQL, MariaDB, Oracle, SQL Server Aurora MySQL, PostgreSQL Amazon QuickSight Amazon SageMaker DynamoDB Key value, Document ElastiCache Redis, Memcached Neptune Graph Timestream Time Series QLDB Ledger Database S3/Amazon Glacier AWS Glue ETL & Data Catalog Lake Formation Data Lakes Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Data Pipeline | Direct Connect Data Movement AnalyticsDatabases Business Intelligence & Machine Learning Data Lake Managed Blockchain Blockchain Templates Blockchain Amazon Comprehend Amazon Rekognition Amazon Lex Amazon Transcribe AWS DeepLens 250+ solutions 730+ Database solutions 600+ Analytics solutions 25+ Blockchain solutions 20+ Data lake solutions 30+ solutions RDS on VMWare
  • 35. CHALLENGE Need to create constant feedback loop for designers Gain up-to-the-minute understanding of gamer satisfaction to guarantee gamers are engaged, thus resulting in the most popular game played in the world Fortnite | 125+ million players
  • 36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Epic Games uses Data Lakes and analytics Entire analytics platform running on AWS S3 leveraged as a Data Lake All telemetry data is collected with Kinesis Real-time analytics done through Spark on EMR, DynamoDB to create scoreboards and real-time queries Use Amazon EMR for large batch data processing Game designers use data to inform their decisions Game clients Game servers Launcher Game services N E A R R E A L T I M E P I P E L I N E N E A R R E A L T I M E P I P E L I N E Grafana Scoreboards API Limited Raw Data (real time ad-hoc SQL) User ETL (metric definition) Spark on EMR DynamoDB NEAR REALTIME PIPELINES BATCH PIPELINES ETL using EMR Tableau/BI Ad-hoc SQLS3 (Data Lake) Kinesis APIs Databases S3 Other sources
  • 37. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo Overview https://aws.amazon.com/blogs/big-data/harmonize-query-and-visualize-data- from-various-providers-using-aws-glue-amazon-athena-and-amazon-quicksight/
  • 39. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 40. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Typical steps of building a data lake Setup Storage1 Move data2 Cleanse, prep, and catalog data 3 Configure and enforce security and compliance policies 4 Make data available for analytics 5
  • 41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building data lakes can still take months
  • 42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Lake Formation (join the preview) Build, secure, and manage a data lake in days Build a data lake in days, not months Build and deploy a fully managed data lake with a few clicks Enforce security policies across multiple services Centrally define security, governance, and auditing policies in one place and enforce those policies for all users and all applications Combine different analytics approaches Empower analyst and data scientist productivity, giving them self- service discovery and safe access to all data from a single catalog
  • 43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. How it works: AWS Lake Formation S3 IAM KMS OLTP ERP CRM LOB Devices Web Sensors Social Kinesis Build Data Lakes quickly • Identify, crawl, and catalog sources • Ingest and clean data • Transform into optimal formats Simplify security management • Enforce encryption • Define access policies • Implement audit login Enable self-service and combined analytics • Analysts discover all data available for analysis from a single data catalog • Use multiple analytics tools over the same data Athena Amazon Redshift AI Services Amazon EMR Amazon QuickSight Data Catalog
  • 44. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Customer interest in AWS Lake Formation “We are very excited about the launch of AWS Lake Formation, which provides a central point of control to easily load, clean, secure, and catalog data from thousands of clients to our AWS-based data lake, dramatically reducing our operational load. … Additionally, AWS Lake Formation will be HIPAA compliant from day one …” - Aaron Symanski, CTO, Change Healthcare “I can’t wait for my team to get our hands on AWS Lake Formation. With an enterprise-ready option like Lake Formation, we will be able to spend more time deriving value from our data rather than doing the heavy lifting involved in manually setting up and managing our data lake.” - Joshua Couch, VP Engineering, Fender Digital
  • 45. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Javier Ramirez AWS Tech Evangelist @supercoco9
  • 46. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Select AWS Glue customers