SK Telecom developed a Hadoop data warehouse (DW) solution to address the high costs and limitations of traditional DW systems for handling big data. The Hadoop DW provides a scalable architecture using Hadoop, Tajo and Spark to cost-effectively store and analyze over 30PB of data across 1000+ nodes. It offers SQL analytics through Tajo for faster querying and easier migration from RDBMS systems. The Hadoop DW has helped SK Telecom and other customers such as semiconductor manufacturers to more affordably store and process massive volumes of both structured and unstructured data for advanced analytics.
2. Copyright@ 2015 by SK Telecom All rights reserved.
1. Big Data in SKT
2. What is Hadoop DW ?
3. SQL on Hadoop TAJO
4. Hadoop DW Commercialization Cases
Table of Contents
2
3. Copyright@ 2015 by SK Telecom All rights reserved.
High TCO for Data Management
250TB/day (91.25PB/year)
4 Hadoop clusters with various
commercial MPP databases for analytics
Operational
Systems
Integration
Layer
Data
Warehouse
Marts
Marketing
Sales
ERP
SCM
ODS
Staging
Area
Staging
Area
Mart A
Mart B
Mart C
Mart D
Hadoop+Hive MPP DBMS
High TCO for Data Management
(Too much data is loaded into MPP DBMS)
One Unified Solution
30PB+ (compressed) on 1000+ nodes
10+ Hadoop clusters with Tajo & Spark
for all purposes
Operational
Systems
Integration
Layer
Data
Warehouse
Marts
Marketing
Sales
ERP
SCM
ODS
Staging
Area
Staging
Area
Mart A
Mart B
Mart C
Mart D
Hadoop+Tajo+Spark
Affordable & Faster
(Unified framework for Big Data)
1. Big Data in SKT
3
4. Copyright@ 2015 by SK Telecom All rights reserved.
✓ Optimized configuration of a large-scale cluster
✓ Operation know-how of managing 1000+ nodes
✓ Fault tolerant and effective resource management system
Data Collector
Data Collect
& pre-processing
Main Cluster
Analysis
R&D Cluster
~250 TB/day
(700+ node)
Service
Logic
Repository
(200+ Node)
(100+ node)
Service Cluster
(150+ node)
App. 1 … App. N
T-Hadoop
Data Feeding
Data Feeding
Commercialize
Develop.
1. Big Data in SKT
SKT Hadoop Clusters
4
5. Copyright@ 2015 by SK Telecom All rights reserved.
“Hadoop S/W and Commodity H/W!
Based Cost-effective IT Infrastructure System”
【 Hadoop DW Infrastructure】
“High-price, High-performance!
Proprietary IT Infrastructure System”
【 Legacy IT Infrastructure 】
※ MPP Massively Parallel Processing, SAN Storage Area Network, NAS Network Attached Storage, RDBMS Relational DB Management System, !
SQL Structured Query Language
2. What is Hadoop DW ?
Structured/Un-structured Data!
Scale-out Structure (Petabyte, Exabyte)
Low price
($200 ~ $1,000 / TB)
Data
Cost
Structured Data!
Scale-up Structure (Terabyte)
High price!
($5,000~$50,000 / TB)
Commodity H/W (x86 Server)H/W
High Performance H/W!
(MPP, Fabric Switch, etc.)
Hadoop Architecture
SQL on Hadoop
S/W
Proprietary S/W
(RDBMS, etc.)
Transaction/Batch
Processing!
(SQL) Hadoop File System
The Hadoop DW provides a Hadoop Architecture based Data Warehouse from
an Enterprise environment so the user can accommodate the massive amount
of increasing data at a low cost.
Solution SKT Hadoop DW
5
6. Copyright@ 2015 by SK Telecom All rights reserved.
Tajo
- Fully Distributed
- Vector process
HDFS
Hadoop Cluster + Tajo
[ Legacy Approach (MR) ] [Tajo Approach ]
Process more data
on same clusters
with improved
processing speed
Response
Speed
Hadoop
Cluster
Query
Hadoop
Cluster
Query
Up to
10x min few
sec~min
+ Tajo
Try more queries
for analysis
with improved!
response speed
Hive
MapReduce
- Partially Distributed
- Sequential process
HDFS
Hadoop Cluster
Processing
Speed
High-speed SQL-on-Hadoop processing engine
• 3~5x improvement in processing speed to Hive under TPC-H procedure
• 80~100% response speed to Impala without data size limit
• Full ANSI-SQL support for easy RDBMS migration
3. SQL on Hadoop - TAJO
6
7. Copyright@ 2015 by SK Telecom All rights reserved.
7
3. SQL on Hadoop - TAJO
SQL Support
▪ ANSI SQL support
▪ Partition Type
▪ Meta Store
Service Stability
▪ High Availability
▪ Resource Manager
▪ Fair Scheduler
Performance
▪ High-speed processing
▪ Shuffling
▪ Dynamic Query Optimizer
▪ Query Rewriting
System Integration
▪ BI Connector
▪ Proxy Support
▪ Tajo-R
Function Support
▪ Analytic Function
▪ Hive Function
[ Tajo Features ]
[ Performance Comparison ]
[ Apache Top-Level Project ]
8. Copyright@ 2015 by SK Telecom All rights reserved.
Worker!
8
3.1 Tajo Architecture
1. Query Master!
2. TaskRunner
Tajo Master!
Persistent Storage!
!!! Derby Store! MySQL Store!
Postgre SQL
Store!
Logical
Planner!
Logical
Optimizer!
Resource
Manager!
SQL Parser!
! Query
Rewriter!
Query
Manager!
Tajo CatalogHCatalog
Client Service
Handler!
JDBC !
Driver
Tajo!
CLI!
Tajo!
CLI!
Worker!
Query Master!
!!!!!!!!
Global
Planner!
Client Service
Handler!
!!!!!!!
Local Query
Engine!
Storage
Manager!
Local HDFS/Hbase S3 / swift
ODBC !
Driver
9. Copyright@ 2015 by SK Telecom All rights reserved.
9
3.1 Technical Characteristic - Logical Flow Data Processing
Tajo Master!
!
!
!
!
!
!
!
!
SQL Parser
Logical/Global
Planner
Resource
Manager
Query Parsing
Decomposition of a work unit
Work units delivered to the server
Tajo
Worker!
Tajo
Worker!
Tajo
Worker!
Tajo
Worker!
Tajo Worker!
!
!
!
!
!
!
!
Physical Planner
Query Engine
Storage Manager
Decomposing the!
task operation unit
Unit operation
Disk data I/O control
10. Copyright@ 2015 by SK Telecom All rights reserved.
10
3.1 Technical Characteristic - JIT Query Engine
Implemented as a binary to
consider the number of all cases
-> performance degradation
(call, if, switch below 50%)
switch(operand)!
Case numeric : add numeric!
Case string : add string!
real-time code generation
based on operand type
combined operation can be
processed by the compiler optimization
Four functions in a
single operation(+2,-1,*1)
<Existing methods> <JIT methods>
Behavior depends on
the operand
characteristic!
!
- 1 + 2 = 3!
- “a” + “b” = “ab”!
- {1,2} + {3,4} = {4,6}!
- 1 + {1,2} = {2,3}
Result = A x (1-B) + (1+C)
+
x
- +
A A A A A
+
12. Copyright@ 2015 by SK Telecom All rights reserved.
12
3.1 Technical Characteristic -Storage Manager
Tajo Worker!
Tajo Worker!
Tajo Worker(scan)!
Storage Manager!
!
!
!
!
!
!
!
!
!
Disk Scanner!
! Pre-fetching Buffer!
Disk Scanner!
Disk Scanner!
Request queue!
! ! ! !
Request queue!
Request queue!
Scan !
Scheduler
Bulk Read
Fine granularity
File
request
13. Copyright@ 2015 by SK Telecom All rights reserved.
13
Business Challenge
How SKT Hadoop DW Helped
[ SK Telecom ]
• Explosion of log data with LTE service
• Increase in types of data to be analyzed
• Insufficient DW capacity due to high cost
✓ 3x storage expansion under same price,
or 80% reduction in unit price
✓ Enabled Ad-hoc analysis of unstructured text
data sets for daily
✓ Hadoop DW could decrease contents-based
analysis process time from few hours to 20
minutes max.
4. Hadoop DW Commercialization Cases Telco
Category MPP DBMS Hadoop DW
Raw Data Size 0.5 TB/Day 4 TB/Day
Total ETL Time Average of 3 hours Average of 6 hours
DW Creation
!
30 minutes 40 minutes
Mart Creation 1 hour 1 hour 40 minutes
Report
Creation
1 hour 30 minutes 2 hours 4 minutes
14. Copyright@ 2015 by SK Telecom All rights reserved.
14
Business Challenge
[ Global Top-5 Semiconductor Player ]
• Collect immense amount of unstructured
measurement data while manufacturing
• RDMBS & BI are incapable for such data type
• Even data loading can take up to 20 min
How SKT Hadoop DW Helped
✓ Support for unstructured data through variable
column schema
✓ 100x increase in data processing capacity
✓ Decreased data loading time by 10x (2 min)
✓ Minimized user action for pivot/unpivot
4. Hadoop DW Commercialization Cases Manufacturer