Southwest Power Pool big data case study

First Steps Towards a Data Lake:
Insight from Southwest Power Pool
Srinivas Kolluru and
Russell Mason
10/26/2016

• Data stored in historical data store for data analysis and reporting
• Near real-time data feeds
• ETL data feeds
• More data than initial expectations
• Almost 50% of data is less frequently used
3
Background
9/23/2016World of Watson 2016

Near-term requirements – Offload less frequently used data
• SQL query access to the data
• Minimal changes to the business queries
Long-term vision – Data Lake infrastructure
• ETL and real-time data feeds
• Transactional and analytics workloads
• Cost economical model
• Should be able to add compute and storage incrementally
• Long-term partnership
4
Proof-of-Concept (PoC) Evaluation Criteria

• Evaluated three different vendor technologies
• On-site PoC with three months of data and with a sample of business queries
• Six week evaluation of each vendor technology
• Chose BigInsights product
• Minimal business query modifications
• Support for Netezza functions
• Federated query capabilities between Netezza and BigInsights
• Partnership
5
Proof-of-Concept (PoC) Evaluation

Data Lake Vision and Phase 1
Implementation

7
Data Lake Vision
Data Zone
ETL Data
DB Data
Streaming
API Data
BLOB Data Metadata SecurityQuality GovernanceMonitoring
Landing
Zone
Data Extraction
Data Refinement
Data Discovery
Data Analytics
Business
Users
IT
Users

• Phase 1 – Offload less frequently used data, provide SQL query, and
federated query capabilities.
• Phase 2 – Query performance improvements, transactional capabilities,
and security controls
• Phase 3 – ETL data, streaming data, and governance
• Phase 4 – Real-time and BI analytics
8
Data Lake Implementation

9
Phase1 - Data Offload Process
Landing Zone
ORCORC
Data Zone
Data export & CRC
check validation
ORC table import and
CRC check validation
Metadata Security Monitoring
Source System
:

• Custom export process instead of using Fluid query or Scoop for data
transfer
• Fluid query
• Additional burden on source system
• Scoop
• Performance challenges with JDBC approach
• Limitations with incremental data pulls with external table approach
• Limitations with using database views
• Difficulty with data validation checks
• Difficulty with partitioning data based on record’s last update date
• Operational controls to minimize the impact on the source system
10
Architecture and Design Decisions

• Data Integrity
• Row level CRC checks with data export and import processes
• ORC format
• Monthly data partitioning
• Seen better compression with ORC format compared to Parquet (Haven’t compared
the query performance with large sets of data)
• Separate HDFS storage cluster
• Scale storage independent of compute
• Align with the IT strategy and processes
• Expose the same set of files through multiple file system protocols
11
Architecture and Design Decisions

12
Phase 1 Deployment Model
Node
#1
Node
#2
Node
#3
Node
#4
Node
#5
Hadoop
Postgres
Metadata
Database
IBM BigInsights Cluster
Node
#1
Node
#2
Node
#3
HDFS Storage Cluster
BigInsights Component Deployment (Tentative)
Node #1
Zookeeper Server, Hive Metastore, HiveServer2, Knox Gateway, WebHCat
Server, BigInsights Home Server, BigSheets Master, BigSQL worker,
Metrics Monitoring, and Node Manager.
Node #2
Zookeeper Server, History Server, Hive Metastore, HiveServer2, Metrics
Collector, Active Resource Manager, WebHCat Server, Big SQL Worker,
Metrics Collector, and Node Manager.
Node #3
Zookeeper Server, App Timeline Server, Active Hbase Master, Hive
Metastore, HiveServer2, Oozie Server, Standby Resource Manager,
WebHCat Server, Big SQL Worker, Metrics Monitor, and Node Manager.
Node #4
Big SQL Head, Data Server Manager, Hive Metastore, HiveServer2,
WebHCat Server, Metrics Monitor, and Node Manager.
Node #5
BigSQL Secondary Head, Zookeeper Server, Hive Metastore, HiveServer2,
WebHCat Server, Region Server, HBaseREST Server, Metrics Monitor, and
Node Manager.

13
Phase 1 Production Deployment Model
Node
#1
Node
#2
Node
#3
Node
#4
Node
#5
Hadoop
Postgres
Metadata
Database
Node
#1
Node
#2
Node
#3
Node
#1
Node
#2
Node
#3
Node
#4
Node
#5
Hadoop
Postgres
Metadata
Database
Node
#1
Node
#2
Node
#3
Data Center 1 Data Center 2

• Active-Active deployment across the data centers
• Evaluating IBM Big Replicate product
• Training for support teams
• Day light savings challenges
14
Challenges

Southwest Power Pool big data case study

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Southwest Power Pool big data case study

Similar a Southwest Power Pool big data case study (20)

Último

Último (20)

Southwest Power Pool big data case study