SlideShare una empresa de Scribd logo
1 de 63
Descargar para leer sin conexión
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Adam Savitzky, Yahoo!
Tina Adams, AWS
October 2015
DAT308
How Yahoo! Analyzes Billions of Events with
Amazon Redshift
Fast, simple, petabyte-scale data warehousing for $1,000/TB/Year
Amazon Redshift a lot faster
a lot cheaper
a lot simpler
Amazon Redshift architecture
Leader node
Simple SQL end point
Stores metadata
Optimizes query plan
Coordinates query execution
Compute nodes
Local columnar storage
Parallel/distributed execution of all queries,
loads, backups, restores, resizes
Start at $0.25/hour, grow to 2 PB (compressed)
DC1: SSD; scale from 160 GB to 326 TB
DS2: HDD; scale from 2 TB to 2 PB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
Amazon Redshift is priced to analyze all your data
Pricing is simple
# of nodes X hourly price
No charge for leader node
3x data compression on avg
Three copies of data
DS2 (HDD)
Price Per Hour for
smallest single node
Effective Annual
Price per TB compressed
On-Demand $ 0.850 $ 3,725
1 Year Reservation $ 0.500 $ 2,190
3 Year Reservation $ 0.228 $ 999
DC1 (SSD)
Price Per Hour for
smallest single node
Effective Annual
Price per TB compressed
On-Demand $ 0.250 $ 13,690
1 Year Reservation $ 0.161 $ 8,795
3 Year Reservation $ 0.100 $ 5,500
Amazon Redshift is easy to use
Provision in minutes
Monitor query performance
Point and click resize
Built-in security
Automatic backups
Selected Amazon Redshift customers
Analytics at Yahoo
What to expect from the session
• What does analytics mean for Yahoo?
• Learn how our extract, transform, load (ETL) process runs
• Learn about our Amazon Redshift architecture
• Do’s, don’ts, and best practices for working with
Amazon Redshift
• Deep dive into advanced analytics, featuring how we
define and report user retention
Setting the stage
“We are returning an iconic company
to greatness.”
—Marissa Mayer
Guiding principles
Guiding principles
“You can’t grow a product that hasn’t
reached product market fit.”
—Arjun Sethi, @arjset
Guiding principles
Analytics is critical for growth
Overall volume
0
10
20
30
40
50
60
70
80
90
Yahoo Events Auto Miles
Driven
Google
Searches
McDonald's
Fries Served
Babies Born
Billions
Audience data breakdown
Desktop
Mail
Tumblr
Sports
Weather
Front Page
Aviate
Other
Hadoop
Clusters Nodes Data centers Data
14 42,000 3 500PB
Hive
Slow Hard to use
Hard to share
Hard to repeat
Hive
And many others…
Benchmarks (lower is better)
1
10
100
1000
10000
Count
Distinct
Devices
Count All
Events
Filter
Clauses
Joins
Seconds
Amazon Redshift
Vertica
Impala
Amazon Redshift at Yahoo
Nodes Events per Day Queries per Day Data
21dc1.8xl 2B 1,200 27TB
Architecture
Extract, transform, load (ETL)
Hadoop • Pig
S3 • Airflow
Amazon
Redshift
• Looker
ETL—upstream
Clickstream
Data
(Hadoop)
Intermediate
Storage
(HDFS)
AWS
(S3)
Hourly Batch Process
(Oozie)
Custom Uploader
(python/boto)
ETL—downstream
Data
available?
Copy to
Amazon
Redshift
Sanitize
Export new
installs
Process new
installs
Update
hourly table
Update
install table
Update
params
Subdivide
params
Clean up
Subdivide
events
Data flows in hourly from S3 to Amazon Redshift, where it’s processed
and subdivided
ETL—downstream
Visualization of running and
complete tasks
Schema
event_raw
mail
event
hourly
event
daily
install
install
attribution
event_raw
flickr
event_raw
homerun
event_raw
stark
event_raw
livetext
e
v
e
n
t
r
a
w
u
n
i
o
n
v
i
e
w
user
retention
funnel
first_event
date
param
mail
param
flickr
param
homerun
param
stark
param
livetext
p
a
r
a
m
u
n
i
o
n
v
i
e
w
is_active
param
keys
telemetry
daily
revenue
daily
Raw tables Summary tables
Derived tables
ETL—Nightly
24 hours
available?
Wipe old
data
Build
daily table
Build user
retention
Build
funnel
Vacuum
Runs all daily aggregations and cleans up/vacuums
Do’s and don’ts
DO
Summarize
user_id event_date action
1 2015-10-08 spam
1 2015-10-08 spam
1 2015-10-08 spam
1 2015-10-08 spam
1 2015-10-08 spam
user_id event_date action event_count
1 2015-10-08 spam 5
DO
Choose good
sort keys
(and use them)
CREATE TABLE revenue (
customer_id BIGINT,
transaction_id BIGINT,
location VARCHAR(64),
event_date DATE,
event_ts TIMESTAMP,
revenue_usd DECIMAL
)
DISTKEY(customer_id)
SORTKEY(
location,
event_date,
customer_id
)
DO
Vacuum nightly
(or weekly and tell people you do it nightly)
DO
Avoid joins
where possible—and learn mitigation strategies for when
you must join
Join mitigation strategies
Key
distribution
Records
distributed by
distkey
Choose a field
that you join on
Avoid causing
excess skew
All
distribution
All records
distributed to all
nodes
Most robust, but
most space-
intensive
Fastest joins occur when records are colocated
Key
distribution
A.1 B.1
A.3 B.3
A.5 B.5
A.2 B.2
A.4 B.4
A.6 B.6
All
distribution
A.1 B.1
A.2 B.2
A.3 B.3
A.4 B.4
A.5 B.5
A.6 B.6
A.1 B.1
A.2 B.2
A.3 B.3
A.4 B.4
A.5 B.5
A.6 B.6
Even
distribution
A.1 B.6
A.5 B.2
A.3 B.3
A.4 B.1
A.2 B.5
A.6 B.4
DO
Automate
DON’T
Fill the cluster
(leave more than you think)
DON’T
Run ETL in the default queue
Workload management (WLM) is your friend
Example WLM configuration
Queue Concurrency User Groups Timeout (ms) Memory (%)
1 1 etl 50
2 10 60,000 50
Two queues: ETL and ad hoc
Purpose: Insulate normal users from ETL and free up plenty of memory for big
batch jobs
DON’T
Use CREATE TABLE AS
For permanent tables
DON’T
Email SQL around
Find a good reporting tool
Deep dive: user retention
User retention is…
User retention is…
The most important* quality metric for
your product
* kinda
Day-14 retention over time
User retention and growth
N-day retention
User retention and growth
0
1000
2000
3000
4000
5000
6000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
DailyActiveUsers
Product Age (days)
Product A
Product B
High churn = wasted ad dollars
$-
$5,000.00
$10,000.00
$15,000.00
$20,000.00
$25,000.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Product age (days)
Product A
Product B
The Sputnik method
For generating a multidimensional user retention
analysis table
event_date install_date os_name country active_users cohort_size
monday monday android us 100 100
tuesday monday android us 83 100
monday monday ios us 75 75
tuesday monday ios us 75 75
Get one-day retention
SELECT
SUM(active_users) AS active_users,
SUM(cohort_size) AS cohort_size,
SUM(active_users) / SUM(cohort_size) AS retention
FROM user_retention
WHERE
event_date – install_date = 1 AND
CURRENT_DATE – 1 > event_date;
Get one-day retention
event_date install_date os_name country active_users cohort_size
monday monday android us 100 100
tuesday monday android us 83 100
monday monday ios us 75 75
tuesday monday ios us 75 75
Active Users: 83 + 75 = 158
Cohort Size: 100 + 75 = 175
-------------------------------
Pct Retention = 158 / 175 = 90%
Get one-day retention by OS
SELECT
os_name,
SUM(active_users) AS active_users,
SUM(cohort_size) AS cohort_size,
SUM(active_users) / SUM(cohort_size) AS retention
FROM user_retention
WHERE
event_date – install_date = 1 AND
CURRENT_DATE – 1 > event_date
GROUP BY 1;
Get one-day retention
event_date install_date os_name country active_users cohort_size
monday monday android us 100 100
tuesday monday android us 83 100
monday monday ios us 75 75
tuesday monday ios us 75 75
Active Users: 83
Cohort Size: 100
-------------------
Pct Retention = 83%
Active Users: 75
Cohort Size: 75
--------------------
Pct Retention = 100%
iOS: Android:
The Sputnik method
You will need:
Daily event
summary
User
user_id
The Sputnik method
Calculate cohort
sizes
• Count users by all
dimensions
• For example: Male,
iOS, in USA, who
installed today
Determine user
activity
• For each day, for each
user, were they active
• Create a table with
user_id and
event_date
Join and
aggregate
• Join user table to
user_activity on
user_id
• SUM active users by
cohort and join to
cohort sizes
Calculate cohort sizes
user_id install_date os_name country
1 2015-10-02 iOS us
2 2015-10-01 android ca
3 2015-10-01 android ca
SELECT
install_date, os_name, country,
COUNT(*) AS cohort_size
FROM user
GROUP BY 1,2,3;
Calculate cohort sizes
install_date os_name country cohort_size
2015-10-02 iOS us 1
2015-10-01 android ca 2
SELECT
install_date, os_name, country,
COUNT(*) AS cohort_size
FROM user
GROUP BY 1,2,3;
Determine user activity
user_id event_date action
1 2015-10-02 app_open
1 2015-10-02 spam
1 2015-10-03 app_open
CREATE TEMP TABLE user_activity AS
SELECT
DISTINCT user_id, event_date
FROM event_daily
WHERE action = ‘app_open’;
Determine user activity
user_id event_date action
1 2015-10-02 app_open
1 2015-10-02 spam
1 2015-10-03 app_open
CREATE TEMP TABLE all_users AS
SELECT DISTINCT user_id FROM event_daily;
CREATE TEMP TABLE all_days AS
SELECT DISTINCT event_date FROM event_daily;
Determine user activity
user_id event_date action
1 2015-10-02 app_open
1 2015-10-02 spam
1 2015-10-03 app_open
CREATE TABLE active_users_by_day AS
SELECT
xproduct.user_id, xproduct.event_date
FROM (
SELECT * FROM all_users CROSS JOIN all_dates
) xproduct
INNER JOIN user_activity u ON u.user_id = xproduct.user_id;
Determine cohort activity
user_id event_date
1 2015-10-02
1 2015-10-03
CREATE TEMP TABLE cohort_activity AS
SELECT
u.*, all_dates.event_date, <1 if hit 0 if miss> as is_active
FROM user AS u
LEFT JOIN all_dates ON all_dates.event_date >= u.install_date
LEFT JOIN active_users_by_day AS au ON
au.user_id = u.user_id AND
au.event_date = all_dates.event_date
WHERE all_dates.event_date >= u.install_date;
user_id install_date os_name country
1 2015-10-02 iOS us
Determine cohort activity
user_id event_date install_date os_name country is_active
1 2015-10-02 2015-10-02 iOS us 1
1 2015-10-03 2015-10-02 iOS us 1
1 2015-10-04 2015-10-02 iOS us 0
CREATE TEMP TABLE active_users AS
SELECT
event_date,
install_date, os_name, country,
SUM(is_active) AS count
FROM cohort_activity
GROUP BY 1, 2, 3, 4;
Determine cohort activity
event_date install_date os_name country is_active
2015-10-
03
2015-10-
02
iOS us 100
2015-10-
03
2015-10-
02
android us 350
2015-10-
03
2015-10-
02
iOS ca 50 Join these
two tables on
matching cohort
dimensions
install_date os_name country cohort_size
2015-10-02 iOS us 200
2015-10-02 android us 400
2015-10-02 iOS ca 60
Big wins for Yahoo
Real-time insights Easier deployment
and maintenance
Data-driven product
development
Cutting edge
analytics
Thank you!
Related sessions
Hear from other customers discussing their Amazon Redshift use cases:
• DAT201—Introduction to Amazon Redshift (RetailMeNot)
• ISM303—Migrating Your Enterprise Data Warehouse to Amazon Redshift (Boingo Wireless
and Edmunds)
• ARC303—Pure Play Video OTT: A Microservices Architecture in the Cloud (Verizon)
• ARC305—Self-Service Cloud Services: How J&J Is Managing AWS at Scale for Enterprise
Workloads
• BDT306—The Life of a Click: How Hearst Publishing Manages Clickstream Analytics with
AWS
• DAT311—Large-Scale Genomic Analysis with Amazon Redshift (Human Longevity)
• BDT314—Running a Big Data and Analytics Application on Amazon EMR and Amazon
Redshift with a Focus on Security (Nasdaq)
• BDT316—Offloading ETL to Amazon Elastic MapReduce (Amgen)
• BDT401—Amazon Redshift Deep Dive (TripAdvisor)
• Building a Mobile App using Amazon EC2, Amazon S3, Amazon DynamoDB, and Amazon
Redshift (Tinder)

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data Warehouse
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon RedshiftUses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Building your data warehouse with Redshift
Building your data warehouse with RedshiftBuilding your data warehouse with Redshift
Building your data warehouse with Redshift
 
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features
 
Scalability of Amazon Redshift Data Loading and Query Speed
Scalability of Amazon Redshift Data Loading and Query SpeedScalability of Amazon Redshift Data Loading and Query Speed
Scalability of Amazon Redshift Data Loading and Query Speed
 
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
 
Getting Started with Amazon EC2 and Compute Services
Getting Started with Amazon EC2 and Compute ServicesGetting Started with Amazon EC2 and Compute Services
Getting Started with Amazon EC2 and Compute Services
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
 

Destacado

Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
DataWorks Summit
 

Destacado (20)

AWS初心者向けWebinar AWSでBig Data活用
AWS初心者向けWebinar AWSでBig Data活用AWS初心者向けWebinar AWSでBig Data活用
AWS初心者向けWebinar AWSでBig Data活用
 
Apache Hbase バルクロードの使い方
Apache Hbase バルクロードの使い方Apache Hbase バルクロードの使い方
Apache Hbase バルクロードの使い方
 
Bluemixを使ったTwitter分析
Bluemixを使ったTwitter分析Bluemixを使ったTwitter分析
Bluemixを使ったTwitter分析
 
Apache Airflow入門 (マーケティングデータ分析基盤技術勉強会)
Apache Airflow入門  (マーケティングデータ分析基盤技術勉強会)Apache Airflow入門  (マーケティングデータ分析基盤技術勉強会)
Apache Airflow入門 (マーケティングデータ分析基盤技術勉強会)
 
データ分析基盤構築のポイントと関連クラスメソッドサービスの紹介
データ分析基盤構築のポイントと関連クラスメソッドサービスの紹介データ分析基盤構築のポイントと関連クラスメソッドサービスの紹介
データ分析基盤構築のポイントと関連クラスメソッドサービスの紹介
 
Cloud Foundry varz
Cloud Foundry varzCloud Foundry varz
Cloud Foundry varz
 
たまにはOpenShiftも触ってみよう
たまにはOpenShiftも触ってみようたまにはOpenShiftも触ってみよう
たまにはOpenShiftも触ってみよう
 
短期間で大規模なシンクラ環境を用意した話
短期間で大規模なシンクラ環境を用意した話短期間で大規模なシンクラ環境を用意した話
短期間で大規模なシンクラ環境を用意した話
 
AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側
 
iOSアプリ開発者から見たMobile Hub
iOSアプリ開発者から見たMobile HubiOSアプリ開発者から見たMobile Hub
iOSアプリ開発者から見たMobile Hub
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
Microsoft Azure - SQL Data Warehouse
Microsoft Azure - SQL Data WarehouseMicrosoft Azure - SQL Data Warehouse
Microsoft Azure - SQL Data Warehouse
 
Cloud Foundryで学ぶ、PaaSのしくみ講座
Cloud Foundryで学ぶ、PaaSのしくみ講座Cloud Foundryで学ぶ、PaaSのしくみ講座
Cloud Foundryで学ぶ、PaaSのしくみ講座
 
Re:dash Use Cases at iPROS
Re:dash Use Cases at iPROSRe:dash Use Cases at iPROS
Re:dash Use Cases at iPROS
 
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWS
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
 
AWS Black Belt Online Seminar 2016 AWS IoT
AWS Black Belt Online Seminar 2016 AWS IoTAWS Black Belt Online Seminar 2016 AWS IoT
AWS Black Belt Online Seminar 2016 AWS IoT
 
EmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤とEmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤と
 
AWS Black Belt Online Seminar 2016 Amazon EMR
AWS Black Belt Online Seminar 2016 Amazon EMRAWS Black Belt Online Seminar 2016 Amazon EMR
AWS Black Belt Online Seminar 2016 Amazon EMR
 

Similar a (DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 

Similar a (DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift (20)

AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”
 
Introducing Amazon RDS for PostgreSQL (DAT210) | AWS re:Invent 2013
Introducing Amazon RDS for PostgreSQL (DAT210) | AWS re:Invent 2013Introducing Amazon RDS for PostgreSQL (DAT210) | AWS re:Invent 2013
Introducing Amazon RDS for PostgreSQL (DAT210) | AWS re:Invent 2013
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
 
High-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsHigh-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutions
 
A Data Culture with Embedded Analytics in Action
A Data Culture with Embedded Analytics in ActionA Data Culture with Embedded Analytics in Action
A Data Culture with Embedded Analytics in Action
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
How Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon RedshiftHow Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon Redshift
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS Cloud
 
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWSAWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
eBay EDW元数据管理及应用
eBay EDW元数据管理及应用eBay EDW元数据管理及应用
eBay EDW元数据管理及应用
 
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataHadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
 
Understanding The Azure Platform November 09
Understanding The Azure Platform   November 09Understanding The Azure Platform   November 09
Understanding The Azure Platform November 09
 
Building Streaming Applications with Streaming SQL
Building Streaming Applications with Streaming SQLBuilding Streaming Applications with Streaming SQL
Building Streaming Applications with Streaming SQL
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
ITReady DW Day2
ITReady DW Day2ITReady DW Day2
ITReady DW Day2
 

Más de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Adam Savitzky, Yahoo! Tina Adams, AWS October 2015 DAT308 How Yahoo! Analyzes Billions of Events with Amazon Redshift
  • 2. Fast, simple, petabyte-scale data warehousing for $1,000/TB/Year Amazon Redshift a lot faster a lot cheaper a lot simpler
  • 3. Amazon Redshift architecture Leader node Simple SQL end point Stores metadata Optimizes query plan Coordinates query execution Compute nodes Local columnar storage Parallel/distributed execution of all queries, loads, backups, restores, resizes Start at $0.25/hour, grow to 2 PB (compressed) DC1: SSD; scale from 160 GB to 326 TB DS2: HDD; scale from 2 TB to 2 PB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  • 4. Amazon Redshift is priced to analyze all your data Pricing is simple # of nodes X hourly price No charge for leader node 3x data compression on avg Three copies of data DS2 (HDD) Price Per Hour for smallest single node Effective Annual Price per TB compressed On-Demand $ 0.850 $ 3,725 1 Year Reservation $ 0.500 $ 2,190 3 Year Reservation $ 0.228 $ 999 DC1 (SSD) Price Per Hour for smallest single node Effective Annual Price per TB compressed On-Demand $ 0.250 $ 13,690 1 Year Reservation $ 0.161 $ 8,795 3 Year Reservation $ 0.100 $ 5,500
  • 5. Amazon Redshift is easy to use Provision in minutes Monitor query performance Point and click resize Built-in security Automatic backups
  • 8. What to expect from the session • What does analytics mean for Yahoo? • Learn how our extract, transform, load (ETL) process runs • Learn about our Amazon Redshift architecture • Do’s, don’ts, and best practices for working with Amazon Redshift • Deep dive into advanced analytics, featuring how we define and report user retention
  • 9. Setting the stage “We are returning an iconic company to greatness.” —Marissa Mayer
  • 11. Guiding principles “You can’t grow a product that hasn’t reached product market fit.” —Arjun Sethi, @arjset
  • 12. Guiding principles Analytics is critical for growth
  • 13. Overall volume 0 10 20 30 40 50 60 70 80 90 Yahoo Events Auto Miles Driven Google Searches McDonald's Fries Served Babies Born Billions
  • 15. Hadoop Clusters Nodes Data centers Data 14 42,000 3 500PB
  • 16. Hive Slow Hard to use Hard to share Hard to repeat
  • 17. Hive
  • 19. Benchmarks (lower is better) 1 10 100 1000 10000 Count Distinct Devices Count All Events Filter Clauses Joins Seconds Amazon Redshift Vertica Impala
  • 20. Amazon Redshift at Yahoo Nodes Events per Day Queries per Day Data 21dc1.8xl 2B 1,200 27TB
  • 22. Extract, transform, load (ETL) Hadoop • Pig S3 • Airflow Amazon Redshift • Looker
  • 24. ETL—downstream Data available? Copy to Amazon Redshift Sanitize Export new installs Process new installs Update hourly table Update install table Update params Subdivide params Clean up Subdivide events Data flows in hourly from S3 to Amazon Redshift, where it’s processed and subdivided
  • 27. ETL—Nightly 24 hours available? Wipe old data Build daily table Build user retention Build funnel Vacuum Runs all daily aggregations and cleans up/vacuums
  • 29. DO Summarize user_id event_date action 1 2015-10-08 spam 1 2015-10-08 spam 1 2015-10-08 spam 1 2015-10-08 spam 1 2015-10-08 spam user_id event_date action event_count 1 2015-10-08 spam 5
  • 30. DO Choose good sort keys (and use them) CREATE TABLE revenue ( customer_id BIGINT, transaction_id BIGINT, location VARCHAR(64), event_date DATE, event_ts TIMESTAMP, revenue_usd DECIMAL ) DISTKEY(customer_id) SORTKEY( location, event_date, customer_id )
  • 31. DO Vacuum nightly (or weekly and tell people you do it nightly)
  • 32. DO Avoid joins where possible—and learn mitigation strategies for when you must join
  • 33. Join mitigation strategies Key distribution Records distributed by distkey Choose a field that you join on Avoid causing excess skew All distribution All records distributed to all nodes Most robust, but most space- intensive Fastest joins occur when records are colocated Key distribution A.1 B.1 A.3 B.3 A.5 B.5 A.2 B.2 A.4 B.4 A.6 B.6 All distribution A.1 B.1 A.2 B.2 A.3 B.3 A.4 B.4 A.5 B.5 A.6 B.6 A.1 B.1 A.2 B.2 A.3 B.3 A.4 B.4 A.5 B.5 A.6 B.6 Even distribution A.1 B.6 A.5 B.2 A.3 B.3 A.4 B.1 A.2 B.5 A.6 B.4
  • 35. DON’T Fill the cluster (leave more than you think)
  • 36. DON’T Run ETL in the default queue Workload management (WLM) is your friend
  • 37. Example WLM configuration Queue Concurrency User Groups Timeout (ms) Memory (%) 1 1 etl 50 2 10 60,000 50 Two queues: ETL and ad hoc Purpose: Insulate normal users from ETL and free up plenty of memory for big batch jobs
  • 38. DON’T Use CREATE TABLE AS For permanent tables
  • 39. DON’T Email SQL around Find a good reporting tool
  • 40. Deep dive: user retention
  • 42. User retention is… The most important* quality metric for your product * kinda
  • 43. Day-14 retention over time User retention and growth N-day retention
  • 44. User retention and growth 0 1000 2000 3000 4000 5000 6000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 DailyActiveUsers Product Age (days) Product A Product B
  • 45. High churn = wasted ad dollars $- $5,000.00 $10,000.00 $15,000.00 $20,000.00 $25,000.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Product age (days) Product A Product B
  • 46. The Sputnik method For generating a multidimensional user retention analysis table event_date install_date os_name country active_users cohort_size monday monday android us 100 100 tuesday monday android us 83 100 monday monday ios us 75 75 tuesday monday ios us 75 75
  • 47. Get one-day retention SELECT SUM(active_users) AS active_users, SUM(cohort_size) AS cohort_size, SUM(active_users) / SUM(cohort_size) AS retention FROM user_retention WHERE event_date – install_date = 1 AND CURRENT_DATE – 1 > event_date;
  • 48. Get one-day retention event_date install_date os_name country active_users cohort_size monday monday android us 100 100 tuesday monday android us 83 100 monday monday ios us 75 75 tuesday monday ios us 75 75 Active Users: 83 + 75 = 158 Cohort Size: 100 + 75 = 175 ------------------------------- Pct Retention = 158 / 175 = 90%
  • 49. Get one-day retention by OS SELECT os_name, SUM(active_users) AS active_users, SUM(cohort_size) AS cohort_size, SUM(active_users) / SUM(cohort_size) AS retention FROM user_retention WHERE event_date – install_date = 1 AND CURRENT_DATE – 1 > event_date GROUP BY 1;
  • 50. Get one-day retention event_date install_date os_name country active_users cohort_size monday monday android us 100 100 tuesday monday android us 83 100 monday monday ios us 75 75 tuesday monday ios us 75 75 Active Users: 83 Cohort Size: 100 ------------------- Pct Retention = 83% Active Users: 75 Cohort Size: 75 -------------------- Pct Retention = 100% iOS: Android:
  • 51. The Sputnik method You will need: Daily event summary User user_id
  • 52. The Sputnik method Calculate cohort sizes • Count users by all dimensions • For example: Male, iOS, in USA, who installed today Determine user activity • For each day, for each user, were they active • Create a table with user_id and event_date Join and aggregate • Join user table to user_activity on user_id • SUM active users by cohort and join to cohort sizes
  • 53. Calculate cohort sizes user_id install_date os_name country 1 2015-10-02 iOS us 2 2015-10-01 android ca 3 2015-10-01 android ca SELECT install_date, os_name, country, COUNT(*) AS cohort_size FROM user GROUP BY 1,2,3;
  • 54. Calculate cohort sizes install_date os_name country cohort_size 2015-10-02 iOS us 1 2015-10-01 android ca 2 SELECT install_date, os_name, country, COUNT(*) AS cohort_size FROM user GROUP BY 1,2,3;
  • 55. Determine user activity user_id event_date action 1 2015-10-02 app_open 1 2015-10-02 spam 1 2015-10-03 app_open CREATE TEMP TABLE user_activity AS SELECT DISTINCT user_id, event_date FROM event_daily WHERE action = ‘app_open’;
  • 56. Determine user activity user_id event_date action 1 2015-10-02 app_open 1 2015-10-02 spam 1 2015-10-03 app_open CREATE TEMP TABLE all_users AS SELECT DISTINCT user_id FROM event_daily; CREATE TEMP TABLE all_days AS SELECT DISTINCT event_date FROM event_daily;
  • 57. Determine user activity user_id event_date action 1 2015-10-02 app_open 1 2015-10-02 spam 1 2015-10-03 app_open CREATE TABLE active_users_by_day AS SELECT xproduct.user_id, xproduct.event_date FROM ( SELECT * FROM all_users CROSS JOIN all_dates ) xproduct INNER JOIN user_activity u ON u.user_id = xproduct.user_id;
  • 58. Determine cohort activity user_id event_date 1 2015-10-02 1 2015-10-03 CREATE TEMP TABLE cohort_activity AS SELECT u.*, all_dates.event_date, <1 if hit 0 if miss> as is_active FROM user AS u LEFT JOIN all_dates ON all_dates.event_date >= u.install_date LEFT JOIN active_users_by_day AS au ON au.user_id = u.user_id AND au.event_date = all_dates.event_date WHERE all_dates.event_date >= u.install_date; user_id install_date os_name country 1 2015-10-02 iOS us
  • 59. Determine cohort activity user_id event_date install_date os_name country is_active 1 2015-10-02 2015-10-02 iOS us 1 1 2015-10-03 2015-10-02 iOS us 1 1 2015-10-04 2015-10-02 iOS us 0 CREATE TEMP TABLE active_users AS SELECT event_date, install_date, os_name, country, SUM(is_active) AS count FROM cohort_activity GROUP BY 1, 2, 3, 4;
  • 60. Determine cohort activity event_date install_date os_name country is_active 2015-10- 03 2015-10- 02 iOS us 100 2015-10- 03 2015-10- 02 android us 350 2015-10- 03 2015-10- 02 iOS ca 50 Join these two tables on matching cohort dimensions install_date os_name country cohort_size 2015-10-02 iOS us 200 2015-10-02 android us 400 2015-10-02 iOS ca 60
  • 61. Big wins for Yahoo Real-time insights Easier deployment and maintenance Data-driven product development Cutting edge analytics
  • 63. Related sessions Hear from other customers discussing their Amazon Redshift use cases: • DAT201—Introduction to Amazon Redshift (RetailMeNot) • ISM303—Migrating Your Enterprise Data Warehouse to Amazon Redshift (Boingo Wireless and Edmunds) • ARC303—Pure Play Video OTT: A Microservices Architecture in the Cloud (Verizon) • ARC305—Self-Service Cloud Services: How J&J Is Managing AWS at Scale for Enterprise Workloads • BDT306—The Life of a Click: How Hearst Publishing Manages Clickstream Analytics with AWS • DAT311—Large-Scale Genomic Analysis with Amazon Redshift (Human Longevity) • BDT314—Running a Big Data and Analytics Application on Amazon EMR and Amazon Redshift with a Focus on Security (Nasdaq) • BDT316—Offloading ETL to Amazon Elastic MapReduce (Amgen) • BDT401—Amazon Redshift Deep Dive (TripAdvisor) • Building a Mobile App using Amazon EC2, Amazon S3, Amazon DynamoDB, and Amazon Redshift (Tinder)