SlideShare una empresa de Scribd logo
1 de 81
Descargar para leer sin conexión
Building a Sustainable
Data Platform on AWS
Takumi Sakamoto
2016.01.27
Takumi Sakamoto
@takus
😍 = ⚽ ✈ 📷
http://bit.ly/1MCOyBX
JAWSDAYS 2015
Mentioned by @jeffbarr
https://twitter.com/jeffbarr/status/649575575787454464
http://www.slideshare.net/smartnews/smart-newss-journey-into-microservices
AWS Case Study
http://aws.amazon.com/solutions/case-studies/smartnews/
Data Platform at
SmartNews
What is SmartNews?
• News Discovery App
• Launched in 2012
• 15M+ Downloads in World Wide
https://www.smartnews.com/en/
Our Mission
the world's quality information?
the people who need it?
How?
Machine Learning
URLs Found
Structure Analysis
Semantics Analysis
Importance Estimation
Diversification
Internet
100,000+ /day
1000+ /day
Feedback
Deliver
Trending Stories
Data Platform Use Cases
• Product development
• track KPI such as DAU and MAU
• A/B test for new feature, on-boarding, etc...
• ad-hoc analysis
• Provide data to applications
• realtime re-ranking news articles
• CTR prediction of Ads system
• dashboard service for media partners
Data & Its Numbers
• User activities
• ~100 GBs per day (compressed)
• 60+ record types
• User demographics or configurations etc...
• 15M+ records
• Articles metadata
• 100K+ records per day
Sustainable
Data Platform?
Sustainable Data Platform
• Provide a reliable and scalable "Lambda Architecture"
• Minimize both operation & running cost
• Be open to uncertain future
Lambda Architecture
http://lambda-architecture.net/
Why Sustainable?
• Do a lot with a few engineers
• no one is a full-time maintainer
• avoid to waste too much time
• Empower brilliant engineers in SmartNews
• everything should be as self-serve as possible
• don't ask for permission, beg for forgiveness
System Design
λ Architecture at SmartNews
Input Batch Serving
Speed
Output
Design Principles
• Decoupled "Computation" and "Storage" layers
• multiple consumers can use the same data
• run consumers on Spot Instances
• prevent serious data lost with minimum effort
• Use the right tool for the job
• leverage AWS managed service as possible
• fill in the missing pieces by Presto & PipelineDB
An Example
Amazon EMR
AMI 3.x
Amazon S3
Amazon EMR
Hive
General
Users
Application
Engineer
I wanna
upgrade hive
Ad
Engineer
I wanna combine
news data with
ad data
Amazon EMR
AMI 4.x
Amazon EMR
Spark
We’re satisfied
with current
version
Data
Scientist
I wanna test my
algorithm with the
latest spark
Batch Layer
Run multiple EMR clusters for each usages
Kinesis
Stream
Spark
on EMR
AWS
Lambda
Data
Scientist
I wanna consume
streaming data by
Spark
Application
Engineer
I wanna add a
streaming monitor
by Lambda
Speed Layer
Consume the same data for each usages
• AWS managed services
• Replicated data into Multiple AZs
• High availability
Input Data
Collect Events by Fluentd
• Forwarder (running on each instances)
• store JSON events to S3
• forward events to aggregators
• collect metrics and post them to Datadog
• Aggregator
• input events into Kinesis & PipelineDB
• other reporting tasks (not mentioned today)
Forwards to S3
<source>
@type tail
format json
path /data/log/user_activity.log
pos_file /data/log/pos/user_activity.pos
tag smartnews.user_activity
time_key timestamp
</source>
<match smartnews.user_activity>
@type copy
<store>
@type relabel
@label @s3
</store>
<store>
@type forward
@label @forward
</store>
</match>
@include conf.d/s3.conf
@include conf.d/forward.conf
<label @s3>
<% node[:td_agent][:s3].each do |c| -%>
<match <%= c[:tag] %>>
@id s3.<%= c[:tag] %>
@type s3
...
path fluentd/<%= node[:env] %>/<%= node[:role] %>/<%= c[:tag] %>
time_slice_format dt=%Y-%m-%d/hh=%H
time_key timestamp
include_time_key
time_as_epoch
reduced_redundancy true
format json
utc
buffer_chunk_limit 2048m
</match>
<% end -%>
</label>
td-agent.conf conf.d/s3.conf
Capture DynamoDB Streams
<source>
type dynamodb_streams
stream_arn YOUR_DDB_STREAMS_ARN
pos_file /path/to/table.pos
fetch_interval 1
fetch_size 100
</source>
https://github.com/takus/fluent-plugin-dynamodb-streams
DynamoDB DynamoDB
Streams
Amazon S3
AWS
Lambda
Fluentd
Recommended Practices
• Make configuration simple as possible
• fluentd can cover everything, but shouldn't
• keep stateless
• Use v0.12 or later
• "Filter" : better performance
• "Label": eliminate 'output_tag' configuration
Monitor Fluentd Status
• Monitor traffic volume & retry count by Datadog
• Datadog's fluentd integration
• fluent-plugin-flowcounter
• fluent-plugin-dogstatsd
Archive to Amazon S3
• I have 2 recommended settings
• versioning
• enable to recover from human error
• lifecycle policy
• minify storage cost
Archives to IA or Gracier
xx days after the creation date
Keep previous versions xx days
Save you in the future!!
Batch Layer
Various ETL Tasks
• Extract
• dump MySQL records by Embulk
• make files on S3 readable to Hive
• Transform
• transform text files into columnar files (RCFile, ORC)
• generate features for machine learning
• aggregate records (by country, by channel)
• Load
• load aggregated metrics into Amazon Aurora
Hive
• Most popular project on Hadoop ecosystem
• famous for its lovely logo :)
• HiveQL and MapReduce
• convert SQL-like query into MR jobs
• Not adopt Tez engine yet
• Amazon EMR doesn't support now
• limited improvement to our queries
How to process JSON?
A. Transform into columnar table periodically
• required converting job
• better performance
B. Use JSON-SerDe for temporary analysis
• easy way for querying raw json text files
• required to "drop table" for change schema
• performance is not good
Transform Tables
-- Make S3 files readable by Hive
ALTER TABLE raw_activities ADD IF NOT EXISTS PARTITION
(dt='${DATE}', hh='${HOUR}');
-- Transform text files into columnar files (Flatten JSON)
INSERT OVERWRITE TABLE activities
PARTITION (dt='${DATE}', action)
SELECT
user_id, timestamp, os, country,
data,
action
FROM raw_activities
LATERAL VIEW json_tuple(
raw_activities.json,
'userId','timestamp','platform','country','action','data'
) a as user_id, timestamp, os, country, action, data
WHERE dt = '${DATE}'
CLUSTER BY os, country, action, user_id
;
JSON-SerDe
-- Define table with SERDE
CREATE TABLE json_table (
country string,
languages array<string>,
religions map<string,array<int>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
-- Result: 10
SELECT religions['catholic'][0] FROM json_table;
cf. hive-ruby-scripting
-- Define your ruby (JRuby) script
SET rb.script=
require 'json'
def parse (json)
j = JSON.load(json)
j['profile']['attribute1']
end
;
-- Use the script in HQL
SELECT rb_exec('&parse', json) FROM user;
https://github.com/gree/hive-ruby-scripting
Spark
http://www.slideshare.net/smartnews/aws-meetupapache-spark-on-emr
Self-Serve via AWS CLI
# Create EMR clusters that runs Hive & Spark & Ganglia
aws emr create-cluster 
--name "My Cluster" 
--release-label emr-4.2.0 
--applications Name=Hive Name=Spark Name=GANGLIA 
--ec2-attributes KeyName=myKey 
--instance-type c3.4xlarge 
--instance-count 4 
--use-default-roles
Minimize expenses
• Use Spot Instances as possible
• typically discount 50-90%
• select instance type with stable price
• C3 families spike often :(
• Dynamic cluster resizing
• x2 capacity during daily batch job
• 1/2 capacity during midnight
Handle Data Dependencies
Typical Anti-Pattern
5 * * * * app hive -f query_1.hql
15 * * * * app hive -f query_2.hql
30 * * * * app hive -f query_3.hql
0 * * * * app hive -f query_4.hql
1 * * * * app hive -f query_5.hql
Workflow Management
• Define dependencies
• task E is executed after finishing task C and task D
• Scheduling
• task A is kicked after 09:00 AM
• throttle concurrent running of the same task
• Monitoring
• notification in failure
• task C must finish before 01:00 PM (SLA)
cf. http://www.slideshare.net/taroleo/workflow-hacks-1-dots-tokyo
Airflow
• A workflow management systems
• define workflow by Python
• built in shiny UI & CLI
• pluggable architecture
http://nerds.airbnb.com/airflow/
Define Tasks
dag = DAG('tutorial', default_args=default_args)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
t3 = BashOperator(
task_id='templated',
bash_command="""
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 7)}}"
echo "{{ params.my_param }}"
{% endfor %}
""",
params={'my_param': 'Parameter I passed in'},
dag=dag)
t2.set_upstream(t1)
t3.set_upstream(t1)
Task
Dependencies
Python code
DAG
Workflow as Code
Deploy codes automatically after merging into master
Visualize Dependencies
What is done or not?
Alerting to Slack
• SLA Violation
• task A should be done till 00:00 PM
• other team's task K has dependency into task A
• Output validation failure
• stop the following tasks if the output is doubtful
Retry from Web UI
Once clear histories, airflow scheduler back fill the histories
Retry from CLI
// Clear some histories from 2016-01-01
airflow clear etl_smartnews 
--task_regex user_ 
--downstream 
--start_date 2016-01-01
// Backfill uncompleted tasks
airflow backfill etl_smartnews 
--start_date 2016-01-01
Check Rendered Query
How Long Each Tasks?
Pluggable Architecture
• Built-in plugins
• operator: bash, hive, preto, mysql
• transfer: hive_to_mysql
• sensor: wait_hive_partition, wait_s3_file
• Written our own plugin
• mysql_partition
Examples
user_sensor = S3KeySensor(
task_id='wait_user',
bucket_name='smartnews',
bucket_key='user/dt={{ ds }}/dump.csv',
)
etl = HiveOperator(
task_id="task1",
hql="INSERT OVERWRITE INTO...."
)
etl.set_upstream(user_sensor)
import = HiveToMySqlTransfer(
task_id=name,
mysql_preoperator="DELETE FROM %s WHERE date = '{{ ds }}'" % table,
sql="SELECT country, count(*) FROM %s" % table,
mysql_table=table
)
import.set_upstream(etl)
Wait a S3 file creation
After the file is created,
Run ETL Query
After that,
Import into MySQL
Serving Layer
Provides batch views
in low-latency and ad-hoc way
Presto
• A distributed SQL query engine
• join multiple data sources (Hive + MySQL)
• support standard ANSI SQL
• designed to handle TBs or PBs scale data
cf. http://www.slideshare.net/frsyuki/presto-hadoop-conference-japan-2014
Presto Architecture
Amazon S3 Kinesis
Stream
Amazon
RDS
Amazon
Aurora
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Coordinator
Client
1. Query with Standard SQL
4. Scan data concurrently
5. Aggregate data without disk I/O
6. Return result to client
2. Generate execution plan
3. Dispatch tasks into multiple workers
Amazon EMR
(Hive Metastore)
Provides Hive table metadata
(S3 access only)
※ https://github.com/qubole/presto-kinesis
※
Why Presto?
• Join multiple data sources
• skip large parts of ETL process
• enable to merge Hive/MySQL/Kinesis/PipelineDB
• Low latency
• ~30s to scan billions records in S3
• Low maintenance cost
• stateless, and easy to integrate with Auto Scaling
Use case: A/B Test
-- Suppose that this table exists
DESC hive.default.user_activities;
user_id bigint
action varchar
abtest array<map<varchar, bigint>>
url varchar
-- Summarize page view per A/B Test identifier
-- for comparing two algorithms v1 & v2
SELECT
  dt,
  t['behaviorId'],
  count(*) as pv
FROM hive.default.user_activities CROSS JOIN UNNEST(abtest) AS t (t)
WHERE dt like '2016-01-%' AND action = 'viewArticle'
AND t['definitionId'] = 163
GROUP BY dt, t['behaviorId'] ORDER BY dt
;
2015-12-01 | algorithm_v1 | 40000
2015-12-01 | algorithm_v2 | 62000
Use case: Troubleshoot
-- Store access logs to S3, and query to them
-- Summarize access & 95pct response time by SQL
SELECT
from_unixtime(timestamp),
count(*) as access,
approx_percentile(reqtime, 0.95) as pct95_reqtime
FROM hive.default.access_log
WHERE dt = '2015-11-04' AND hh = '13' AND role = 'xxx'
GROUP BY timestamp ORDER BY timestamp
;
2015-11-04 22:00:00.000 | 6377 | 0.522
2015-11-04 22:00:01.000 | 3580 | 0.422
Scheduled Auto Scaling
$ aws autoscaling describe-scheduled-actions
{
"ScheduledUpdateGroupActions": [
{
"DesiredCapacity": 2,
"AutoScalingGroupName": "presto-worker-prd",
"Recurrence": "59 14 * * *",
"ScheduledActionName": "scalein-2359-jst"
},
{
"DesiredCapacity": 20,
"AutoScalingGroupName": "presto-worker-prd",
"Recurrence": "45 0 * * 1-5",
"ScheduledActionName": "scaleout-0945-jst"
}
]
}
Presto Covers Everything? No!
• Fixed system on Amazon Aurora (or other RDB)
• provides KPI for products & business
• require high availability & low latency
• has no flexibility
• Ad-hoc system on Presto
• provides access to all dataset on data platform
• require high scalability
• has flexibility (join various data sources)
Why Fixed vs Ad-hoc?
• Difficulties on the Ad-hoc only solution
• difficult to prevent heavy queries
• large distinct count exhausts computing resources
• decrease presto maintainability
Output Data
Chartio
• Dashboard as A Service
• helps businesses analyze and track their critical data
• one of AWS partners (※)
• Combine multiple data sources at one dashboard
• Presto, MySQL, Redshift, BigQuery, Elasticsearch ...
• enable to join BigQuery + MySQL internally
• Easy to use for every one
• everyone can make their own dashboard
• write SQL directly / generate query by drag & drop
※ http://www.aws-partner-directory.com/PartnerDirectory/PartnerDetail?id=8959
Creating dashboard
1. Building query
(Drag&Drop / SQL)
2. Add step
(filter、sort、modify)
3. Select visualize way
(table、graph)
Examples
Why Chartio?
• Chartio saves a lot of engineering resources
• before
• maintain in-house dashboard written by rails
• everyone got tired to maintain it
• after
• everyone can build their own dashboard easily
• Chartio's UI is cool
• very important factor for dashboard tool
Missing Pieces of Chartio
• No programable API provides
• need to edit dashboard / chart manually
• No rollback feature
• all changes are recorded, but not rollback to the
previous state
• work around : clone => edit => rename
Speed Layer
Why Speed is Matter?
Today’s News is Wrapping
Tomorrow’s Fish and Chips
↑
Yesterday's News
http://www.personalchefapproach.com/tomorrows-fish-n-chips-wrapper/
How News Behaves?
https://gdsdata.blog.gov.uk/2013/10/22/the-half-life-of-news/
Use cases
• Re-rank news articles by user feedback
• track user's positive/negative signal
• consider gender, age, location, interests
• Realtime article monitoring
• detect high bounce rate (may be broken?)
• make realtime reporting dashboard for A/B test
Realtime Re-Ranking
ref. Stream 処理 (Spark Streaming + Kinesis) と Offline 処理 (Hive) の統合
www.slideshare.net/smartnews/stremspark-streaming-kinesisofflinehive
Amazon
CloudSearch
Search
API
API
Gateway
Kinesis
Stream
Amazon S3
Amazon EMR
Amazon S3 Amazon EMR
DynamoDB
Realtime
Feedback
Re-rank
Articles
Article
Metadata
User
Interests
User
Behaviors
Offline Procees
by Hive / Spark
Realtime Monitoring
API
Gateway
Stream
Continuous
View
Continuous
View
Continuous
View
Discard raw record soon after
consumed by Continuous View
Incrementally
updated in realtime
PipelineDB Chartio
AWS
Lambda
Slack
Access Continuous View
by PostgreSQL Client
Record
※1
※1
Use cron on 26 Feb. 2016
Migrate it soon after supporting VPC
PipelineDB
• OSS & enterprise streaming SQL database
• PostgreSQL compatible
• connect to Chartio 😍
• join stream to normal PostgreSQL table
• Support probabilistic data structures
• e.g. HyperLogLog
https://www.pipelinedb.com/
http://developer.smartnews.com/blog/2015/09/09/20150907pipelinedb/
Continuous View
-- Calculate unique users seen per media each day
-- Using only a constant amount of space (HyperLogLog)
CREATE CONTINUOUS VIEW uniques AS
SELECT
day(arrival_timestamp),
substring(url from '.*://([^/]*)') as hostname,
COUNT(DISTINCT user_id::integer)
FROM activity_stream GROUP BY day,hostname;
-- How many impressions have we served in the last five minutes?
CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS
SELECT COUNT(*) FROM imps_stream;
-- What are the 90th, 95th, 99th percentiles of request latency?
CREATE CONTINUOUS VIEW latency AS
SELECT
percentile_cont(array[90, 95, 99])
WITHIN GROUP (ORDER BY latency::integer)
FROM latency_stream;
Summary
Sustainable Data Platform
• build a reliable and scalable lambda architecture
• minimize operation & running cost
• be open to uncertain future
My Wishlist to AWS
• Support Reduced Redundancy Storage (RRS) on EMR
• Faster EMR Launch
• Set TTL to DynamoDB records
• Auto-scale Kinesis Stream
• Launch Kinesis Analytics in Tokyo region
Thank you!!

Más contenido relacionado

La actualidad más candente

Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisArnab Mitra
 
Snowflakeって実際どうなの?数多のDBを使い倒した猛者が語る
Snowflakeって実際どうなの?数多のDBを使い倒した猛者が語るSnowflakeって実際どうなの?数多のDBを使い倒した猛者が語る
Snowflakeって実際どうなの?数多のDBを使い倒した猛者が語るRyota Shibuya
 
クロスドメインアクセスを理解してWeb APIを楽しく使おう
クロスドメインアクセスを理解してWeb APIを楽しく使おうクロスドメインアクセスを理解してWeb APIを楽しく使おう
クロスドメインアクセスを理解してWeb APIを楽しく使おうkitfactory
 
MySQLを割と一人で300台管理する技術
MySQLを割と一人で300台管理する技術MySQLを割と一人で300台管理する技術
MySQLを割と一人で300台管理する技術yoku0825
 
WebSocketのキホン
WebSocketのキホンWebSocketのキホン
WebSocketのキホンYou_Kinjoh
 
20191115-PGconf.Japan
20191115-PGconf.Japan20191115-PGconf.Japan
20191115-PGconf.JapanKohei KaiGai
 
Introduction to Tokyo Products
Introduction to Tokyo ProductsIntroduction to Tokyo Products
Introduction to Tokyo ProductsMikio Hirabayashi
 
分散処理基盤ApacheHadoop入門とHadoopエコシステムの最新技術動向(OSC2015 Kansai発表資料)
分散処理基盤ApacheHadoop入門とHadoopエコシステムの最新技術動向(OSC2015 Kansai発表資料)分散処理基盤ApacheHadoop入門とHadoopエコシステムの最新技術動向(OSC2015 Kansai発表資料)
分散処理基盤ApacheHadoop入門とHadoopエコシステムの最新技術動向(OSC2015 Kansai発表資料)NTT DATA OSS Professional Services
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performanceDataWorks Summit
 
엘라스틱서치 클러스터로 수십억 건의 데이터 운영하기
엘라스틱서치 클러스터로 수십억 건의 데이터 운영하기엘라스틱서치 클러스터로 수십억 건의 데이터 운영하기
엘라스틱서치 클러스터로 수십억 건의 데이터 운영하기흥래 김
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...
Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...
Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...NTT DATA Technology & Innovation
 
PostgreSQL Unconference #29 Unicode IVS
PostgreSQL Unconference #29 Unicode IVSPostgreSQL Unconference #29 Unicode IVS
PostgreSQL Unconference #29 Unicode IVSNoriyoshi Shinoda
 
MySQLからPostgreSQLへのマイグレーションのハマリ所
MySQLからPostgreSQLへのマイグレーションのハマリ所MySQLからPostgreSQLへのマイグレーションのハマリ所
MySQLからPostgreSQLへのマイグレーションのハマリ所Makoto Kaga
 
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)NTT DATA Technology & Innovation
 
これからLDAPを始めるなら 「389-ds」を使ってみよう
これからLDAPを始めるなら 「389-ds」を使ってみようこれからLDAPを始めるなら 「389-ds」を使ってみよう
これからLDAPを始めるなら 「389-ds」を使ってみようNobuyuki Sasaki
 
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
Tuning Apache Ambari performance for Big Data at scale with 3000 agentsTuning Apache Ambari performance for Big Data at scale with 3000 agents
Tuning Apache Ambari performance for Big Data at scale with 3000 agentsDataWorks Summit
 
ソーシャルゲーム案件におけるDB分割のPHP実装
ソーシャルゲーム案件におけるDB分割のPHP実装ソーシャルゲーム案件におけるDB分割のPHP実装
ソーシャルゲーム案件におけるDB分割のPHP実装infinite_loop
 

La actualidad más candente (20)

Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Snowflakeって実際どうなの?数多のDBを使い倒した猛者が語る
Snowflakeって実際どうなの?数多のDBを使い倒した猛者が語るSnowflakeって実際どうなの?数多のDBを使い倒した猛者が語る
Snowflakeって実際どうなの?数多のDBを使い倒した猛者が語る
 
僕とヤフーと時々Teradata #prestodb
僕とヤフーと時々Teradata #prestodb僕とヤフーと時々Teradata #prestodb
僕とヤフーと時々Teradata #prestodb
 
クロスドメインアクセスを理解してWeb APIを楽しく使おう
クロスドメインアクセスを理解してWeb APIを楽しく使おうクロスドメインアクセスを理解してWeb APIを楽しく使おう
クロスドメインアクセスを理解してWeb APIを楽しく使おう
 
MySQLを割と一人で300台管理する技術
MySQLを割と一人で300台管理する技術MySQLを割と一人で300台管理する技術
MySQLを割と一人で300台管理する技術
 
WebSocketのキホン
WebSocketのキホンWebSocketのキホン
WebSocketのキホン
 
ClickHouse導入事例紹介
ClickHouse導入事例紹介ClickHouse導入事例紹介
ClickHouse導入事例紹介
 
20191115-PGconf.Japan
20191115-PGconf.Japan20191115-PGconf.Japan
20191115-PGconf.Japan
 
Introduction to Tokyo Products
Introduction to Tokyo ProductsIntroduction to Tokyo Products
Introduction to Tokyo Products
 
分散処理基盤ApacheHadoop入門とHadoopエコシステムの最新技術動向(OSC2015 Kansai発表資料)
分散処理基盤ApacheHadoop入門とHadoopエコシステムの最新技術動向(OSC2015 Kansai発表資料)分散処理基盤ApacheHadoop入門とHadoopエコシステムの最新技術動向(OSC2015 Kansai発表資料)
分散処理基盤ApacheHadoop入門とHadoopエコシステムの最新技術動向(OSC2015 Kansai発表資料)
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
엘라스틱서치 클러스터로 수십억 건의 데이터 운영하기
엘라스틱서치 클러스터로 수십억 건의 데이터 운영하기엘라스틱서치 클러스터로 수십억 건의 데이터 운영하기
엘라스틱서치 클러스터로 수십억 건의 데이터 운영하기
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...
Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...
Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...
 
PostgreSQL Unconference #29 Unicode IVS
PostgreSQL Unconference #29 Unicode IVSPostgreSQL Unconference #29 Unicode IVS
PostgreSQL Unconference #29 Unicode IVS
 
MySQLからPostgreSQLへのマイグレーションのハマリ所
MySQLからPostgreSQLへのマイグレーションのハマリ所MySQLからPostgreSQLへのマイグレーションのハマリ所
MySQLからPostgreSQLへのマイグレーションのハマリ所
 
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
 
これからLDAPを始めるなら 「389-ds」を使ってみよう
これからLDAPを始めるなら 「389-ds」を使ってみようこれからLDAPを始めるなら 「389-ds」を使ってみよう
これからLDAPを始めるなら 「389-ds」を使ってみよう
 
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
Tuning Apache Ambari performance for Big Data at scale with 3000 agentsTuning Apache Ambari performance for Big Data at scale with 3000 agents
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
 
ソーシャルゲーム案件におけるDB分割のPHP実装
ソーシャルゲーム案件におけるDB分割のPHP実装ソーシャルゲーム案件におけるDB分割のPHP実装
ソーシャルゲーム案件におけるDB分割のPHP実装
 

Destacado

[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例Amazon Web Services Japan
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
 
AWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon ConnectAWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon ConnectAmazon Web Services Japan
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best PracticesAmazon Web Services
 
AWS Black Belt Online Seminar 2017 AWS Shield
AWS Black Belt Online Seminar 2017 AWS ShieldAWS Black Belt Online Seminar 2017 AWS Shield
AWS Black Belt Online Seminar 2017 AWS ShieldAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめAWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt Online Seminar 2017 Deployment on AWSAWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt Online Seminar 2017 Deployment on AWSAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 AWS X-RayAWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 AWS X-RayAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon AuroraAWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon AuroraAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB Amazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 Amazon EMR AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 Amazon EMR Amazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計Amazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハックAWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハックAmazon Web Services Japan
 
Amazon Athena 初心者向けハンズオン
Amazon Athena 初心者向けハンズオンAmazon Athena 初心者向けハンズオン
Amazon Athena 初心者向けハンズオンAmazon Web Services Japan
 

Destacado (20)

20170725 black belt_monitoring_on_aws
20170725 black belt_monitoring_on_aws20170725 black belt_monitoring_on_aws
20170725 black belt_monitoring_on_aws
 
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
20170726 black belt_stepfunctions
20170726 black belt_stepfunctions20170726 black belt_stepfunctions
20170726 black belt_stepfunctions
 
AWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon ConnectAWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon Connect
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
AWS Black Belt Online Seminar 2017 AWS Shield
AWS Black Belt Online Seminar 2017 AWS ShieldAWS Black Belt Online Seminar 2017 AWS Shield
AWS Black Belt Online Seminar 2017 AWS Shield
 
20170621 aws-black belt-ads-sms
20170621 aws-black belt-ads-sms20170621 aws-black belt-ads-sms
20170621 aws-black belt-ads-sms
 
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめAWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
 
AWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt Online Seminar 2017 Deployment on AWSAWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt Online Seminar 2017 Deployment on AWS
 
AWS Black Belt online seminar 2017 Snowball
AWS Black Belt online seminar 2017 SnowballAWS Black Belt online seminar 2017 Snowball
AWS Black Belt online seminar 2017 Snowball
 
AWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 AWS X-RayAWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 AWS X-Ray
 
AWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon AuroraAWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon Aurora
 
AWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLift
 
AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB
 
AWS BlackBelt AWS上でのDDoS対策
AWS BlackBelt AWS上でのDDoS対策AWS BlackBelt AWS上でのDDoS対策
AWS BlackBelt AWS上でのDDoS対策
 
AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 Amazon EMR AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 Amazon EMR
 
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
 
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハックAWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
 
Amazon Athena 初心者向けハンズオン
Amazon Athena 初心者向けハンズオンAmazon Athena 初心者向けハンズオン
Amazon Athena 初心者向けハンズオン
 

Similar a Building a Sustainable Data Platform on AWS

Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformChester Chen
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesSadayuki Furuhashi
 
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013Amazon Web Services
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebookAniket Mokashi
 
Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql databasePARIKSHIT SAVJANI
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션Amazon Web Services Korea
 
Evolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.jsEvolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.jsSteve Jamieson
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextPrateek Maheshwari
 
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...MongoDB
 
React state management with Redux and MobX
React state management with Redux and MobXReact state management with Redux and MobX
React state management with Redux and MobXDarko Kukovec
 
(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT
(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT
(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoTAmazon Web Services
 
Orchestrating complex workflows with aws step functions
Orchestrating complex workflows with aws step functionsOrchestrating complex workflows with aws step functions
Orchestrating complex workflows with aws step functionsChris Shenton
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore IndexSolidQ
 
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...Amazon Web Services
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Zhenxiao Luo
 
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel PartnersCraeg Strong
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
 
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
MongoDB.local Austin 2018:  Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...MongoDB.local Austin 2018:  Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...MongoDB
 

Similar a Building a Sustainable Data Platform on AWS (20)

Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
 
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql database
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
 
Evolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.jsEvolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.js
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
 
React state management with Redux and MobX
React state management with Redux and MobXReact state management with Redux and MobX
React state management with Redux and MobX
 
(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT
(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT
(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT
 
Orchestrating complex workflows with aws step functions
Orchestrating complex workflows with aws step functionsOrchestrating complex workflows with aws step functions
Orchestrating complex workflows with aws step functions
 
DW on AWS
DW on AWSDW on AWS
DW on AWS
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
 
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
MongoDB.local Austin 2018:  Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...MongoDB.local Austin 2018:  Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
 

Más de SmartNews, Inc.

SmartNewsを支えるデータパイプラインとその運用
SmartNewsを支えるデータパイプラインとその運用SmartNewsを支えるデータパイプラインとその運用
SmartNewsを支えるデータパイプラインとその運用SmartNews, Inc.
 
Spring で実現する SmartNews のニュース配信基盤
Spring で実現する SmartNews のニュース配信基盤Spring で実現する SmartNews のニュース配信基盤
Spring で実現する SmartNews のニュース配信基盤SmartNews, Inc.
 
エンジニアからプロダクトマネージャーへ
エンジニアからプロダクトマネージャーへエンジニアからプロダクトマネージャーへ
エンジニアからプロダクトマネージャーへSmartNews, Inc.
 
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.SmartNews, Inc.
 
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_cccSmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_cccSmartNews, Inc.
 
Stream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysStream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysSmartNews, Inc.
 
AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側SmartNews, Inc.
 
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews, Inc.
 
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...SmartNews, Inc.
 
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテムSmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテムSmartNews, Inc.
 
SmartNews TechNight vol5 SmartNews Ads大図解
SmartNews TechNight vol5 SmartNews Ads大図解SmartNews TechNight vol5 SmartNews Ads大図解
SmartNews TechNight vol5 SmartNews Ads大図解SmartNews, Inc.
 
SmartNews's journey into microservices
SmartNews's journey into microservicesSmartNews's journey into microservices
SmartNews's journey into microservicesSmartNews, Inc.
 
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合SmartNews, Inc.
 
SmartNews の Webmining を支えるプラットフォーム
SmartNews の Webmining を支えるプラットフォームSmartNews の Webmining を支えるプラットフォーム
SmartNews の Webmining を支えるプラットフォームSmartNews, Inc.
 
AWS meetup「Apache Spark on EMR」
AWS meetup「Apache Spark on EMR」AWS meetup「Apache Spark on EMR」
AWS meetup「Apache Spark on EMR」SmartNews, Inc.
 
Smartnews Product Manager Night
Smartnews Product Manager NightSmartnews Product Manager Night
Smartnews Product Manager NightSmartNews, Inc.
 
SmartNews Ads System - AWS Summit Tokyo 2015
SmartNews Ads System - AWS Summit Tokyo 2015SmartNews Ads System - AWS Summit Tokyo 2015
SmartNews Ads System - AWS Summit Tokyo 2015SmartNews, Inc.
 
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法SmartNews, Inc.
 

Más de SmartNews, Inc. (19)

SmartNewsを支えるデータパイプラインとその運用
SmartNewsを支えるデータパイプラインとその運用SmartNewsを支えるデータパイプラインとその運用
SmartNewsを支えるデータパイプラインとその運用
 
Spring で実現する SmartNews のニュース配信基盤
Spring で実現する SmartNews のニュース配信基盤Spring で実現する SmartNews のニュース配信基盤
Spring で実現する SmartNews のニュース配信基盤
 
エンジニアからプロダクトマネージャーへ
エンジニアからプロダクトマネージャーへエンジニアからプロダクトマネージャーへ
エンジニアからプロダクトマネージャーへ
 
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
 
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_cccSmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
 
Stream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysStream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdays
 
AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側
 
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
 
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
 
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテムSmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
 
SmartNews TechNight vol5 SmartNews Ads大図解
SmartNews TechNight vol5 SmartNews Ads大図解SmartNews TechNight vol5 SmartNews Ads大図解
SmartNews TechNight vol5 SmartNews Ads大図解
 
NLP in SmartNews
NLP in SmartNewsNLP in SmartNews
NLP in SmartNews
 
SmartNews's journey into microservices
SmartNews's journey into microservicesSmartNews's journey into microservices
SmartNews's journey into microservices
 
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
 
SmartNews の Webmining を支えるプラットフォーム
SmartNews の Webmining を支えるプラットフォームSmartNews の Webmining を支えるプラットフォーム
SmartNews の Webmining を支えるプラットフォーム
 
AWS meetup「Apache Spark on EMR」
AWS meetup「Apache Spark on EMR」AWS meetup「Apache Spark on EMR」
AWS meetup「Apache Spark on EMR」
 
Smartnews Product Manager Night
Smartnews Product Manager NightSmartnews Product Manager Night
Smartnews Product Manager Night
 
SmartNews Ads System - AWS Summit Tokyo 2015
SmartNews Ads System - AWS Summit Tokyo 2015SmartNews Ads System - AWS Summit Tokyo 2015
SmartNews Ads System - AWS Summit Tokyo 2015
 
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
 

Último

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Último (20)

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Building a Sustainable Data Platform on AWS

  • 1. Building a Sustainable Data Platform on AWS Takumi Sakamoto 2016.01.27
  • 7. What is SmartNews? • News Discovery App • Launched in 2012 • 15M+ Downloads in World Wide https://www.smartnews.com/en/
  • 8. Our Mission the world's quality information? the people who need it? How?
  • 9. Machine Learning URLs Found Structure Analysis Semantics Analysis Importance Estimation Diversification Internet 100,000+ /day 1000+ /day Feedback Deliver Trending Stories
  • 10. Data Platform Use Cases • Product development • track KPI such as DAU and MAU • A/B test for new feature, on-boarding, etc... • ad-hoc analysis • Provide data to applications • realtime re-ranking news articles • CTR prediction of Ads system • dashboard service for media partners
  • 11. Data & Its Numbers • User activities • ~100 GBs per day (compressed) • 60+ record types • User demographics or configurations etc... • 15M+ records • Articles metadata • 100K+ records per day
  • 13. Sustainable Data Platform • Provide a reliable and scalable "Lambda Architecture" • Minimize both operation & running cost • Be open to uncertain future
  • 15. Why Sustainable? • Do a lot with a few engineers • no one is a full-time maintainer • avoid to waste too much time • Empower brilliant engineers in SmartNews • everything should be as self-serve as possible • don't ask for permission, beg for forgiveness
  • 17. λ Architecture at SmartNews Input Batch Serving Speed Output
  • 18. Design Principles • Decoupled "Computation" and "Storage" layers • multiple consumers can use the same data • run consumers on Spot Instances • prevent serious data lost with minimum effort • Use the right tool for the job • leverage AWS managed service as possible • fill in the missing pieces by Presto & PipelineDB
  • 19. An Example Amazon EMR AMI 3.x Amazon S3 Amazon EMR Hive General Users Application Engineer I wanna upgrade hive Ad Engineer I wanna combine news data with ad data Amazon EMR AMI 4.x Amazon EMR Spark We’re satisfied with current version Data Scientist I wanna test my algorithm with the latest spark Batch Layer Run multiple EMR clusters for each usages Kinesis Stream Spark on EMR AWS Lambda Data Scientist I wanna consume streaming data by Spark Application Engineer I wanna add a streaming monitor by Lambda Speed Layer Consume the same data for each usages • AWS managed services • Replicated data into Multiple AZs • High availability
  • 21. Collect Events by Fluentd • Forwarder (running on each instances) • store JSON events to S3 • forward events to aggregators • collect metrics and post them to Datadog • Aggregator • input events into Kinesis & PipelineDB • other reporting tasks (not mentioned today)
  • 22. Forwards to S3 <source> @type tail format json path /data/log/user_activity.log pos_file /data/log/pos/user_activity.pos tag smartnews.user_activity time_key timestamp </source> <match smartnews.user_activity> @type copy <store> @type relabel @label @s3 </store> <store> @type forward @label @forward </store> </match> @include conf.d/s3.conf @include conf.d/forward.conf <label @s3> <% node[:td_agent][:s3].each do |c| -%> <match <%= c[:tag] %>> @id s3.<%= c[:tag] %> @type s3 ... path fluentd/<%= node[:env] %>/<%= node[:role] %>/<%= c[:tag] %> time_slice_format dt=%Y-%m-%d/hh=%H time_key timestamp include_time_key time_as_epoch reduced_redundancy true format json utc buffer_chunk_limit 2048m </match> <% end -%> </label> td-agent.conf conf.d/s3.conf
  • 23. Capture DynamoDB Streams <source> type dynamodb_streams stream_arn YOUR_DDB_STREAMS_ARN pos_file /path/to/table.pos fetch_interval 1 fetch_size 100 </source> https://github.com/takus/fluent-plugin-dynamodb-streams DynamoDB DynamoDB Streams Amazon S3 AWS Lambda Fluentd
  • 24. Recommended Practices • Make configuration simple as possible • fluentd can cover everything, but shouldn't • keep stateless • Use v0.12 or later • "Filter" : better performance • "Label": eliminate 'output_tag' configuration
  • 25. Monitor Fluentd Status • Monitor traffic volume & retry count by Datadog • Datadog's fluentd integration • fluent-plugin-flowcounter • fluent-plugin-dogstatsd
  • 26. Archive to Amazon S3 • I have 2 recommended settings • versioning • enable to recover from human error • lifecycle policy • minify storage cost Archives to IA or Gracier xx days after the creation date Keep previous versions xx days Save you in the future!!
  • 28. Various ETL Tasks • Extract • dump MySQL records by Embulk • make files on S3 readable to Hive • Transform • transform text files into columnar files (RCFile, ORC) • generate features for machine learning • aggregate records (by country, by channel) • Load • load aggregated metrics into Amazon Aurora
  • 29. Hive • Most popular project on Hadoop ecosystem • famous for its lovely logo :) • HiveQL and MapReduce • convert SQL-like query into MR jobs • Not adopt Tez engine yet • Amazon EMR doesn't support now • limited improvement to our queries
  • 30. How to process JSON? A. Transform into columnar table periodically • required converting job • better performance B. Use JSON-SerDe for temporary analysis • easy way for querying raw json text files • required to "drop table" for change schema • performance is not good
  • 31. Transform Tables -- Make S3 files readable by Hive ALTER TABLE raw_activities ADD IF NOT EXISTS PARTITION (dt='${DATE}', hh='${HOUR}'); -- Transform text files into columnar files (Flatten JSON) INSERT OVERWRITE TABLE activities PARTITION (dt='${DATE}', action) SELECT user_id, timestamp, os, country, data, action FROM raw_activities LATERAL VIEW json_tuple( raw_activities.json, 'userId','timestamp','platform','country','action','data' ) a as user_id, timestamp, os, country, action, data WHERE dt = '${DATE}' CLUSTER BY os, country, action, user_id ;
  • 32. JSON-SerDe -- Define table with SERDE CREATE TABLE json_table ( country string, languages array<string>, religions map<string,array<int>> ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE; -- Result: 10 SELECT religions['catholic'][0] FROM json_table;
  • 33. cf. hive-ruby-scripting -- Define your ruby (JRuby) script SET rb.script= require 'json' def parse (json) j = JSON.load(json) j['profile']['attribute1'] end ; -- Use the script in HQL SELECT rb_exec('&parse', json) FROM user; https://github.com/gree/hive-ruby-scripting
  • 35. Self-Serve via AWS CLI # Create EMR clusters that runs Hive & Spark & Ganglia aws emr create-cluster --name "My Cluster" --release-label emr-4.2.0 --applications Name=Hive Name=Spark Name=GANGLIA --ec2-attributes KeyName=myKey --instance-type c3.4xlarge --instance-count 4 --use-default-roles
  • 36. Minimize expenses • Use Spot Instances as possible • typically discount 50-90% • select instance type with stable price • C3 families spike often :( • Dynamic cluster resizing • x2 capacity during daily batch job • 1/2 capacity during midnight
  • 38. Typical Anti-Pattern 5 * * * * app hive -f query_1.hql 15 * * * * app hive -f query_2.hql 30 * * * * app hive -f query_3.hql 0 * * * * app hive -f query_4.hql 1 * * * * app hive -f query_5.hql
  • 39. Workflow Management • Define dependencies • task E is executed after finishing task C and task D • Scheduling • task A is kicked after 09:00 AM • throttle concurrent running of the same task • Monitoring • notification in failure • task C must finish before 01:00 PM (SLA) cf. http://www.slideshare.net/taroleo/workflow-hacks-1-dots-tokyo
  • 40. Airflow • A workflow management systems • define workflow by Python • built in shiny UI & CLI • pluggable architecture http://nerds.airbnb.com/airflow/
  • 41. Define Tasks dag = DAG('tutorial', default_args=default_args) t1 = BashOperator( task_id='print_date', bash_command='date', dag=dag) t2 = BashOperator( task_id='sleep', bash_command='sleep 5', retries=3, dag=dag) t3 = BashOperator( task_id='templated', bash_command=""" {% for i in range(5) %} echo "{{ ds }}" echo "{{ macros.ds_add(ds, 7)}}" echo "{{ params.my_param }}" {% endfor %} """, params={'my_param': 'Parameter I passed in'}, dag=dag) t2.set_upstream(t1) t3.set_upstream(t1) Task Dependencies Python code DAG
  • 42. Workflow as Code Deploy codes automatically after merging into master
  • 44. What is done or not?
  • 45. Alerting to Slack • SLA Violation • task A should be done till 00:00 PM • other team's task K has dependency into task A • Output validation failure • stop the following tasks if the output is doubtful
  • 46. Retry from Web UI Once clear histories, airflow scheduler back fill the histories
  • 47. Retry from CLI // Clear some histories from 2016-01-01 airflow clear etl_smartnews --task_regex user_ --downstream --start_date 2016-01-01 // Backfill uncompleted tasks airflow backfill etl_smartnews --start_date 2016-01-01
  • 49. How Long Each Tasks?
  • 50. Pluggable Architecture • Built-in plugins • operator: bash, hive, preto, mysql • transfer: hive_to_mysql • sensor: wait_hive_partition, wait_s3_file • Written our own plugin • mysql_partition
  • 51. Examples user_sensor = S3KeySensor( task_id='wait_user', bucket_name='smartnews', bucket_key='user/dt={{ ds }}/dump.csv', ) etl = HiveOperator( task_id="task1", hql="INSERT OVERWRITE INTO...." ) etl.set_upstream(user_sensor) import = HiveToMySqlTransfer( task_id=name, mysql_preoperator="DELETE FROM %s WHERE date = '{{ ds }}'" % table, sql="SELECT country, count(*) FROM %s" % table, mysql_table=table ) import.set_upstream(etl) Wait a S3 file creation After the file is created, Run ETL Query After that, Import into MySQL
  • 53. Provides batch views in low-latency and ad-hoc way
  • 54. Presto • A distributed SQL query engine • join multiple data sources (Hive + MySQL) • support standard ANSI SQL • designed to handle TBs or PBs scale data cf. http://www.slideshare.net/frsyuki/presto-hadoop-conference-japan-2014
  • 55. Presto Architecture Amazon S3 Kinesis Stream Amazon RDS Amazon Aurora Presto Worker Presto Worker Presto Worker Presto Worker Presto Worker Presto Worker Presto Coordinator Client 1. Query with Standard SQL 4. Scan data concurrently 5. Aggregate data without disk I/O 6. Return result to client 2. Generate execution plan 3. Dispatch tasks into multiple workers Amazon EMR (Hive Metastore) Provides Hive table metadata (S3 access only) ※ https://github.com/qubole/presto-kinesis ※
  • 56. Why Presto? • Join multiple data sources • skip large parts of ETL process • enable to merge Hive/MySQL/Kinesis/PipelineDB • Low latency • ~30s to scan billions records in S3 • Low maintenance cost • stateless, and easy to integrate with Auto Scaling
  • 57. Use case: A/B Test -- Suppose that this table exists DESC hive.default.user_activities; user_id bigint action varchar abtest array<map<varchar, bigint>> url varchar -- Summarize page view per A/B Test identifier -- for comparing two algorithms v1 & v2 SELECT   dt,   t['behaviorId'],   count(*) as pv FROM hive.default.user_activities CROSS JOIN UNNEST(abtest) AS t (t) WHERE dt like '2016-01-%' AND action = 'viewArticle' AND t['definitionId'] = 163 GROUP BY dt, t['behaviorId'] ORDER BY dt ; 2015-12-01 | algorithm_v1 | 40000 2015-12-01 | algorithm_v2 | 62000
  • 58. Use case: Troubleshoot -- Store access logs to S3, and query to them -- Summarize access & 95pct response time by SQL SELECT from_unixtime(timestamp), count(*) as access, approx_percentile(reqtime, 0.95) as pct95_reqtime FROM hive.default.access_log WHERE dt = '2015-11-04' AND hh = '13' AND role = 'xxx' GROUP BY timestamp ORDER BY timestamp ; 2015-11-04 22:00:00.000 | 6377 | 0.522 2015-11-04 22:00:01.000 | 3580 | 0.422
  • 59. Scheduled Auto Scaling $ aws autoscaling describe-scheduled-actions { "ScheduledUpdateGroupActions": [ { "DesiredCapacity": 2, "AutoScalingGroupName": "presto-worker-prd", "Recurrence": "59 14 * * *", "ScheduledActionName": "scalein-2359-jst" }, { "DesiredCapacity": 20, "AutoScalingGroupName": "presto-worker-prd", "Recurrence": "45 0 * * 1-5", "ScheduledActionName": "scaleout-0945-jst" } ] }
  • 60. Presto Covers Everything? No! • Fixed system on Amazon Aurora (or other RDB) • provides KPI for products & business • require high availability & low latency • has no flexibility • Ad-hoc system on Presto • provides access to all dataset on data platform • require high scalability • has flexibility (join various data sources)
  • 61. Why Fixed vs Ad-hoc? • Difficulties on the Ad-hoc only solution • difficult to prevent heavy queries • large distinct count exhausts computing resources • decrease presto maintainability
  • 63. Chartio • Dashboard as A Service • helps businesses analyze and track their critical data • one of AWS partners (※) • Combine multiple data sources at one dashboard • Presto, MySQL, Redshift, BigQuery, Elasticsearch ... • enable to join BigQuery + MySQL internally • Easy to use for every one • everyone can make their own dashboard • write SQL directly / generate query by drag & drop ※ http://www.aws-partner-directory.com/PartnerDirectory/PartnerDetail?id=8959
  • 64. Creating dashboard 1. Building query (Drag&Drop / SQL) 2. Add step (filter、sort、modify) 3. Select visualize way (table、graph)
  • 66. Why Chartio? • Chartio saves a lot of engineering resources • before • maintain in-house dashboard written by rails • everyone got tired to maintain it • after • everyone can build their own dashboard easily • Chartio's UI is cool • very important factor for dashboard tool
  • 67. Missing Pieces of Chartio • No programable API provides • need to edit dashboard / chart manually • No rollback feature • all changes are recorded, but not rollback to the previous state • work around : clone => edit => rename
  • 69. Why Speed is Matter?
  • 70. Today’s News is Wrapping Tomorrow’s Fish and Chips
  • 73. Use cases • Re-rank news articles by user feedback • track user's positive/negative signal • consider gender, age, location, interests • Realtime article monitoring • detect high bounce rate (may be broken?) • make realtime reporting dashboard for A/B test
  • 74. Realtime Re-Ranking ref. Stream 処理 (Spark Streaming + Kinesis) と Offline 処理 (Hive) の統合 www.slideshare.net/smartnews/stremspark-streaming-kinesisofflinehive Amazon CloudSearch Search API API Gateway Kinesis Stream Amazon S3 Amazon EMR Amazon S3 Amazon EMR DynamoDB Realtime Feedback Re-rank Articles Article Metadata User Interests User Behaviors Offline Procees by Hive / Spark
  • 75. Realtime Monitoring API Gateway Stream Continuous View Continuous View Continuous View Discard raw record soon after consumed by Continuous View Incrementally updated in realtime PipelineDB Chartio AWS Lambda Slack Access Continuous View by PostgreSQL Client Record ※1 ※1 Use cron on 26 Feb. 2016 Migrate it soon after supporting VPC
  • 76. PipelineDB • OSS & enterprise streaming SQL database • PostgreSQL compatible • connect to Chartio 😍 • join stream to normal PostgreSQL table • Support probabilistic data structures • e.g. HyperLogLog https://www.pipelinedb.com/ http://developer.smartnews.com/blog/2015/09/09/20150907pipelinedb/
  • 77. Continuous View -- Calculate unique users seen per media each day -- Using only a constant amount of space (HyperLogLog) CREATE CONTINUOUS VIEW uniques AS SELECT day(arrival_timestamp), substring(url from '.*://([^/]*)') as hostname, COUNT(DISTINCT user_id::integer) FROM activity_stream GROUP BY day,hostname; -- How many impressions have we served in the last five minutes? CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS SELECT COUNT(*) FROM imps_stream; -- What are the 90th, 95th, 99th percentiles of request latency? CREATE CONTINUOUS VIEW latency AS SELECT percentile_cont(array[90, 95, 99]) WITHIN GROUP (ORDER BY latency::integer) FROM latency_stream;
  • 79. Sustainable Data Platform • build a reliable and scalable lambda architecture • minimize operation & running cost • be open to uncertain future
  • 80. My Wishlist to AWS • Support Reduced Redundancy Storage (RRS) on EMR • Faster EMR Launch • Set TTL to DynamoDB records • Auto-scale Kinesis Stream • Launch Kinesis Analytics in Tokyo region