IEEE International Conference on Data Engineering 2015

SKT Hadoop DW
SK telecom!
Corporate R&D Center 
Yousun Jeong

Copyright@ 2015 by SK Telecom All rights reserved.
1. Big Data in SKT
2. What is Hadoop DW ?
3. SQL on Hadoop TAJO
4. Hadoop DW Commercialization Cases
Table of Contents
2

High TCO for Data Management
250TB/day (91.25PB/year)
4 Hadoop clusters with various  
commercial MPP databases for analytics
Operational 
Systems
Integration  
Layer
Data
Warehouse
Marts
Marketing
Sales
ERP
SCM
ODS
Staging 
Area
Staging 
Area
Mart A
Mart B
Mart C
Mart D
Hadoop+Hive MPP DBMS
High TCO for Data Management 
(Too much data is loaded into MPP DBMS)
One Unified Solution
30PB+ (compressed) on 1000+ nodes
10+ Hadoop clusters with Tajo & Spark  
for all purposes
Operational 
Systems
Integration  
Layer
Data
Warehouse
Marts
Marketing
Sales
ERP
SCM
ODS
Staging 
Area
Staging 
Area
Mart A
Mart B
Mart C
Mart D
Hadoop+Tajo+Spark
Affordable & Faster 
(Unified framework for Big Data)
1. Big Data in SKT
3

✓ Optimized configuration of a large-scale cluster
✓ Operation know-how of managing 1000+ nodes
✓ Fault tolerant and effective resource management system
Data Collector
Data Collect
& pre-processing
Main Cluster
Analysis
R&D Cluster
~250 TB/day
(700+ node)
Service
Logic
Repository
(200+ Node)
(100+ node)
Service Cluster
(150+ node)
App. 1 … App. N
T-Hadoop
Data Feeding
Data Feeding
Commercialize
Develop.
1. Big Data in SKT
SKT Hadoop Clusters
4

“Hadoop S/W and Commodity H/W!
Based Cost-effective IT Infrastructure System”
【 Hadoop DW Infrastructure】
“High-price, High-performance!
Proprietary IT Infrastructure System”
【 Legacy IT Infrastructure 】
※ MPP Massively Parallel Processing, SAN Storage Area Network, NAS Network Attached Storage, RDBMS Relational DB Management System, !
SQL Structured Query Language
2. What is Hadoop DW ?
Structured/Un-structured Data!
Scale-out Structure (Petabyte, Exabyte)
Low price 
($200 ~ $1,000 / TB)
Data
Cost
Structured Data!
Scale-up Structure (Terabyte)
High price!
($5,000~$50,000 / TB)
Commodity H/W (x86 Server)H/W
High Performance H/W!
(MPP, Fabric Switch, etc.)
Hadoop Architecture
SQL on Hadoop
S/W
Proprietary S/W 
(RDBMS, etc.)
Transaction/Batch
Processing!
(SQL) Hadoop File System
The Hadoop DW provides a Hadoop Architecture based Data Warehouse from
an Enterprise environment so the user can accommodate the massive amount
of increasing data at a low cost.
Solution SKT Hadoop DW
5

Tajo
- Fully Distributed
- Vector process
HDFS
Hadoop Cluster + Tajo
[ Legacy Approach (MR) ] [Tajo Approach ]
Process more data 
on same clusters 
with improved 
processing speed
Response 
Speed
Hadoop
Cluster
Query
Hadoop
Cluster
Query
Up to  
10x min few  
sec~min
+ Tajo
Try more queries 
for analysis  
with improved!
response speed
Hive
MapReduce
- Partially Distributed
- Sequential process
HDFS
Hadoop Cluster
Processing 
Speed
High-speed SQL-on-Hadoop processing engine
• 3~5x improvement in processing speed to Hive under TPC-H procedure
• 80~100% response speed to Impala without data size limit
• Full ANSI-SQL support for easy RDBMS migration
3. SQL on Hadoop - TAJO
6

7
3. SQL on Hadoop - TAJO
SQL Support
▪ ANSI SQL support
▪ Partition Type
▪ Meta Store
Service Stability
▪ High Availability
▪ Resource Manager
▪ Fair Scheduler
Performance
▪ High-speed processing
▪ Shuffling
▪ Dynamic Query Optimizer
▪ Query Rewriting
System Integration
▪ BI Connector
▪ Proxy Support
▪ Tajo-R
Function Support
▪ Analytic Function
▪ Hive Function
[ Tajo Features ]
[ Performance Comparison ]
[ Apache Top-Level Project ]

Worker!
8
3.1 Tajo Architecture
1. Query Master!
2. TaskRunner
Tajo Master!
Persistent Storage!
!!! Derby Store! MySQL Store!
Postgre SQL
Store!
Logical
Planner!
Logical
Optimizer!
Resource
Manager!
SQL Parser!
! Query
Rewriter!
Query
Manager!
Tajo CatalogHCatalog
Client Service
Handler!
JDBC !
Driver
Tajo!
CLI!
Tajo!
CLI!
Worker!
Query Master!
!!!!!!!!
Global  
Planner!
Client Service
Handler!
!!!!!!!
Local Query
Engine!
Storage
Manager!
Local HDFS/Hbase S3 / swift
ODBC !
Driver

9
3.1 Technical Characteristic - Logical Flow Data Processing
Tajo Master!
!
!
!
!
!
!
!
!
SQL Parser
Logical/Global
Planner
Resource
Manager
Query Parsing
Decomposition of a work unit
Work units delivered to the server
Tajo
Worker!
Tajo
Worker!
Tajo
Worker!
Tajo
Worker!
Tajo Worker!
!
!
!
!
!
!
!
Physical Planner
Query Engine
Storage Manager
Decomposing the!
task operation unit
Unit operation
Disk data I/O control

10
3.1 Technical Characteristic - JIT Query Engine
Implemented as a binary to  
consider the number of all cases 
-> performance degradation 
(call, if, switch below 50%)
switch(operand)!
Case numeric : add numeric!
Case string : add string!
real-time code generation  
based on operand type 
combined operation can be  
processed by the compiler optimization
Four functions in a  
single operation(+2,-1,*1)
<Existing methods> <JIT methods>
Behavior depends on
the operand
characteristic!
!
- 1 + 2 = 3!
- “a” + “b” = “ab”!
- {1,2} + {3,4} = {4,6}!
- 1 + {1,2} = {2,3}
Result = A x (1-B) + (1+C)
+
x
- +
A A A A A
+

11
3.1 Technical Characteristic -Vectorized Query Engine
<Tuple at a time> <Vectorized engine>
- DB!
- 1 operation/record
- Vectorized data!
- 1 operation/vector
A[] = {a1, a2, a3, a4, a5, a6}!
B[] = {b1, b2, b3, b4, b5, b6}!
!
C[] = A[] + B[]
a1
a2
a3
a5
a4
a6
b1
b2
b3
b5
b4
b6
+
+
+
+
+
+
a1
a2
a3
a5
a4
a6
+
b1
b2
b3
b5
b4
b6

12
3.1 Technical Characteristic -Storage Manager
Tajo Worker!
Tajo Worker!
Tajo Worker(scan)!
Storage Manager!
!
!
!
!
!
!
!
!
!
Disk Scanner!
! Pre-fetching Buffer!
Disk Scanner!
Disk Scanner!
Request queue!
! ! ! !
Request queue!
Request queue!
Scan !
Scheduler
Bulk Read
Fine granularity
File 
request

13
Business Challenge
How SKT Hadoop DW Helped
[ SK Telecom ]
• Explosion of log data with LTE service
• Increase in types of data to be analyzed
• Insufficient DW capacity due to high cost
✓ 3x storage expansion under same price,  
or 80% reduction in unit price
✓ Enabled Ad-hoc analysis of unstructured text
data sets for daily
✓ Hadoop DW could decrease contents-based
analysis process time from few hours to 20
minutes max.
4. Hadoop DW Commercialization Cases Telco
Category MPP DBMS Hadoop DW
Raw Data Size 0.5 TB/Day 4 TB/Day
Total ETL Time Average of 3 hours Average of 6 hours
DW Creation
!
30 minutes 40 minutes
Mart Creation 1 hour 1 hour 40 minutes
Report
Creation
1 hour 30 minutes 2 hours 4 minutes

14
Business Challenge
[ Global Top-5 Semiconductor Player ]
• Collect immense amount of unstructured
measurement data while manufacturing
• RDMBS & BI are incapable for such data type
• Even data loading can take up to 20 min
How SKT Hadoop DW Helped
✓ Support for unstructured data through variable
column schema
✓ 100x increase in data processing capacity
✓ Decreased data loading time by 10x (2 min)
✓ Minimized user action for pivot/unpivot
4. Hadoop DW Commercialization Cases Manufacturer

Thank you.

IEEE International Conference on Data Engineering 2015

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (8)

Similar a IEEE International Conference on Data Engineering 2015

Similar a IEEE International Conference on Data Engineering 2015 (20)

Más de Yousun Jeong

Más de Yousun Jeong (7)

Último

Último (20)

IEEE International Conference on Data Engineering 2015