More Related Content Similar to 하이퍼커넥트 데이터 팀이 데이터 증가에 대처해온 기록 (20) More from Jaehyeuk Oh (20) 하이퍼커넥트 데이터 팀이 데이터 증가에 대처해온 기록3. , ,
2014 3 , 33 3
“ (Azar)”
+ !
18 180
( 40%)
+ /
94
363
624
‘14 ‘15 ‘16 ‘17
90% !
21
5. 2 +
6 +
7 AppStore / Google Play
“ 73 5 ”
400 +
(Google Play, SensorTower Q1 2018)
230
(2 )
6. - 2017 150
( : 170 )
- 6 WebRTC
( 2 + )
- (3G/LTE)
- CPU ,
-
- ,
13. Intro | ?
0
200,000,000
400,000,000
600,000,000
800,000,000
1,000,000,000
1,200,000,000
1,400,000,000
1,600,000,000
1,800,000,000
2,000,000,000
2,200,000,000
2,400,000,000
2,600,000,000
2,800,000,000
3,000,000,000
3,200,000,000
2013-11
2013-12
2014-01
2014-02
2014-03
2014-04
2014-05
2014-06
2014-07
2014-08
2014-09
2014-10
2014-11
2014-12
2015-01
2015-02
2015-03
2015-04
2015-05
2015-06
2015-07
2015-08
2015-09
2015-10
2015-11
2015-12
2016-01
2016-02
2016-03
2016-04
2016-05
2016-06
2016-07
2016-08
2016-09
2016-10
2016-11
2016-12
2017-01
2017-02
2017-03
2017-04
2017-05
2017-06
2017-07
2017-08
2017-09
2017-10
2017-11
2017-12
2018-01
2018-02
2018-03
2018-04
2018-05
2018-06
2018-07
2018-08
14. 0
200,000,000
400,000,000
600,000,000
800,000,000
1,000,000,000
1,200,000,000
1,400,000,000
1,600,000,000
1,800,000,000
2,000,000,000
2,200,000,000
2,400,000,000
2,600,000,000
2,800,000,000
3,000,000,000
3,200,000,000
2013-11
2013-12
2014-01
2014-02
2014-03
2014-04
2014-05
2014-06
2014-07
2014-08
2014-09
2014-10
2014-11
2014-12
2015-01
2015-02
2015-03
2015-04
2015-05
2015-06
2015-07
2015-08
2015-09
2015-10
2015-11
2015-12
2016-01
2016-02
2016-03
2016-04
2016-05
2016-06
2016-07
2016-08
2016-09
2016-10
2016-11
2016-12
2017-01
2017-02
2017-03
2017-04
2017-05
2017-06
2017-07
2017-08
2017-09
2017-10
2017-11
2017-12
2018-01
2018-02
2018-03
2018-04
2018-05
2018-06
2018-07
2018-08
Azar (Kbytes / Day)
Intro | ! Azar
7 matches
20 events
3T bytes
2017/6
2T bytes
2017/1
1T bytes
2016/3
500G bytes
2015/5
100G bytes2014/11
50G bytes
Phase I Phase II Phase III Phase IV Phase V Phase VI
15. Intro | ? Data Pipeline (2016-02, Phase III)
API Logs
MySQL
Google
Analytics
S3
Dashboard
Serve
Batched Results
Python Batch
CRON
Batch Processing System
16. Intro | ! Data Pipeline (2018-08, Phase VI)
Update
Lifetime Storage
Update
/ Use
Lifetim
e
Storage
EventLogs
MySQL
3rdParty
(Firebase /
Adjust)
S3
Spark Streaming
Kafka
Cluster
ElasticSearch/
Kibana
API/
Redis
Hive
Cluster
Presto
Cluster
Batch
Dashboard
Analytics with
Queries
Realtim
e
Dashboard
HBase
Automated
Personalized
Operation
Realtime Processing System
Serve
Batched Results
Spark Cluster
Spark Batch
Airflow
Batch Processing System
Superset
Zeppelin
Redash
20. Phase I (~ 50G) | ? Pipeline
MySQL
(+ServiceLog)
Google
Analytics
(+Goal,
Ecommerce)
Unknown
GA
+Ecommerce
+Goal = Conversion
MySQL
+ Service Monitoring Log (table)
22. Phase II (~ 100G) | ?
MySQL
(+SLAVE)
GA
(+Realtime ACU)
PgSQL
Dashboard
Serve
Batched Results
Python Batch
CRON
Batch Processing System
1. 50G => 100G
2. / /
3. KPI
23. Phase II (~ 100G) | !
MySQL
(+SLAVE)
GA
(+Realtime ACU)
PgSQL
Dashboard
Serve
Batched Results
Python Batch
CRON
Batch Processing System
/ /
=> Batch
KPI
=> SLAVE
25. Phase III (~ 500G) | ?
MySQL
(+SLAVE)
GA
(+Realtime ACU)
PgSQL
Dashboard
Serve
Batched Results
Python Batch
CRON
Batch Processing System
1. 100G => 500G
2.
3. ( )
4. ( )
5. , /
( )
26. Phase III (~ 500G) | !
MySQL
GA
PgSQL
(+S3)
Dashboard
Serve
Batched Results
Python Batch
CRON
Batch Processing System
API Logs
Scale Out
OperationLog => PgSQLAPI Logs => S3
28. Phase IV (~ 1T) | ?
MySQL
(+SLAVE)
GA
(+Realtime ACU)
PgSQL
Dashboard
Serve
Batched Results
Python Batch
CRON
Batch Processing System
1. 500G => 1000G
2. ServiceLog MySQL
3.
4. ( )
5.
6. ,
Dimension ,
7. ( )
8. GA 0 , Peak 0
( GA )
29. Phase IV (~ 1T) | !
MySQL
(+Dump)
GA
(+branch.io,
Attribution)
PgSQL
S3
Dashboard
Serve
Batched Results
Python Batch
CRON
Batch Processing System
API Logs
(+EVENT Logs)
+REDSHIFT
( +session / match fact,
segment / cohort )
+Zeppelin,
Redash
ServiceLog MySQL
=> REDSHIFT
=> MySQL DUMP
( )
=> EVENT Logs
=> branch.io, Attribution
Dimension
=> (Big Flat) Fact Table
Segment
=> Segment / Cohort
GA Session
=> session_fact_table
31. Phase V (~ 2T) | ?
MySQL
(+SLAVE)
GA
(+Realtime ACU)
PgSQL
Dashboard
Serve
Batched Results
Python Batch
CRON
Batch Processing System
1. 1T => 2T
2. Redshift ,
, , retention
3. Redshift Maintenance pipeline ,
,
4. , , , 1 ...
5. Team , Dashboard ,
6. ( )
7. Branch.io Attribution ,
8. , ,
9. Status all green! , ,
35. Phase VI (~ now) | ?
MySQL
(+SLAVE)
GA
(+Realtime ACU)
PgSQL
Dashboard
Serve
Batched Results
Python Batch
CRON
Batch Processing System
1. 2T => now
2. Task , , Data pipeline
3.
4. Data Infra
5. API Server
6. Firebase Funnel
7. ,
8. / Query Tool
9.
10. Fact Table , Table Column ,
11. , ,
12. Operation , Operation
13. Window Size 3 Action
Server
14. , Action
15. Segment /
36. Phase VI (~ now) | ! Pipeline
Update
Lifetime Storage
Update
/ Use
Lifetim
e
Storage
EventLogs
(+DataQA)
MySQL
3rdParty
(FIREBASE
+ /
Adjust)
S3
Spark Streaming
Kafka
Cluster
ElasticSearch/
Kibana
API/
Redis
Hive
Cluster
Presto
(DataMarts)
Dashboard
Analytics with
Queries
Realtim
e
Dashboard
HBase
Automated
Personalized
Operation
Realtime Processing System
Serve
Batched Results
Spark Cluster
Spark Batch
Airflow/Ganglia
Batch Processing System
Superset
Zeppelin
Redash
37. Phase VI (~ now) | !
Update
Lifetime Storage
Update
/ Use
Lifetim
e
Storage
EventLogs
(+DataQA)
MySQL
3rdParty
(FIREBASE
+ /
Adjust)
S3
Spark Streaming
Kafka
Cluster
ElasticSearch/
Kibana
API/
Redis
Hive
Cluster
Presto
(DataMarts)
Dashboard
Analytics with
Queries
Realtim
e
Dashboard
HBase
Automated
Personalized
Operation
Realtime Processing System
Serve
Batched Results
Spark Cluster
Spark Batch
Airflow/Ganglia
Batch Processing System
Superset
Zeppelin
Redash
Data pipeline
=> Airflow
Pipeline /
=> Resource Manager / Ganglia
API Server
=> FIREBASE
=> DataQA
Fact Table, Column
=> Metadata
=> Application PoC
Firebase Funnel
=>
,
=> Retention , , DataMarts
Query Infra
=> Data User VOC
Hourly Window Action
=> Spark Streaming
Action
=> Hbase (lifetime storage)
Operation
=>
40. Summary |
0
200,000,000
400,000,000
600,000,000
800,000,000
1,000,000,000
1,200,000,000
1,400,000,000
1,600,000,000
1,800,000,000
2,000,000,000
2,200,000,000
2,400,000,000
2,600,000,000
2,800,000,000
3,000,000,000
3,200,000,000
2013-11
2013-12
2014-01
2014-02
2014-03
2014-04
2014-05
2014-06
2014-07
2014-08
2014-09
2014-10
2014-11
2014-12
2015-01
2015-02
2015-03
2015-04
2015-05
2015-06
2015-07
2015-08
2015-09
2015-10
2015-11
2015-12
2016-01
2016-02
2016-03
2016-04
2016-05
2016-06
2016-07
2016-08
2016-09
2016-10
2016-11
2016-12
2017-01
2017-02
2017-03
2017-04
2017-05
2017-06
2017-07
2017-08
2017-09
2017-10
2017-11
2017-12
2018-01
2018-02
2018-03
2018-04
2018-05
2018-06
2018-07
2018-08
Azar (Kbytes / Day)
7 matches
20 events
3T bytes
2017/6
2T bytes
2017/1
1T bytes
2016/3
500G bytes
2015/5
100G bytes2014/11
50G bytes
Phase I Phase II Phase III Phase IV Phase V Phase VI
41. Summary | Pipeline
Update
Lifetime Storage
Update
/ Use
Lifetim
e
Storage
EventLogs
(+DataQA)
MySQL
3rdParty
(FIREBASE
+ /
Adjust)
S3
Spark Streaming
Kafka
Cluster
ElasticSearch/
Kibana
API/
Redis
Hive
Cluster
Presto
(DataMarts)
Dashboard
Analytics with
Queries
Realtim
e
Dashboard
HBase
Automated
Personalized
Operation
Realtime Processing System
Serve
Batched Results
Spark Cluster
Spark Batch
Airflow/Ganglia
Batch Processing System
Superset
Zeppelin
Redash
44. ( , , , , , , )
, , , PC , VR, ,
14 HyperSpace