3. • Data stored in historical data store for data analysis and reporting
• Near real-time data feeds
• ETL data feeds
• More data than initial expectations
• Almost 50% of data is less frequently used
3
Background
9/23/2016World of Watson 2016
4. Near-term requirements – Offload less frequently used data
• SQL query access to the data
• Minimal changes to the business queries
Long-term vision – Data Lake infrastructure
• ETL and real-time data feeds
• Transactional and analytics workloads
• Cost economical model
• Should be able to add compute and storage incrementally
• Long-term partnership
4
Proof-of-Concept (PoC) Evaluation Criteria
9/23/2016World of Watson 2016
5. • Evaluated three different vendor technologies
• On-site PoC with three months of data and with a sample of business queries
• Six week evaluation of each vendor technology
• Chose BigInsights product
• Minimal business query modifications
• Support for Netezza functions
• Federated query capabilities between Netezza and BigInsights
• Partnership
5
Proof-of-Concept (PoC) Evaluation
9/23/2016World of Watson 2016
7. 7
Data Lake Vision
9/23/2016World of Watson 2016
Data Zone
ETL Data
DB Data
Streaming
API Data
BLOB Data Metadata SecurityQuality GovernanceMonitoring
Landing
Zone
Data Extraction
Data Refinement
Data Discovery
Data Analytics
Business
Users
IT
Users
8. • Phase 1 – Offload less frequently used data, provide SQL query, and
federated query capabilities.
• Phase 2 – Query performance improvements, transactional capabilities,
and security controls
• Phase 3 – ETL data, streaming data, and governance
• Phase 4 – Real-time and BI analytics
8
Data Lake Implementation
9/23/2016World of Watson 2016
9. 9
Phase1 - Data Offload Process
9/23/2016World of Watson 2016
Landing Zone
ORCORC
Data Zone
Data export & CRC
check validation
ORC table import and
CRC check validation
Metadata Security Monitoring
Source System
:
10. • Custom export process instead of using Fluid query or Scoop for data
transfer
• Fluid query
• Additional burden on source system
• Scoop
• Performance challenges with JDBC approach
• Limitations with incremental data pulls with external table approach
• Limitations with using database views
• Difficulty with data validation checks
• Difficulty with partitioning data based on record’s last update date
• Operational controls to minimize the impact on the source system
10
Architecture and Design Decisions
9/23/2016World of Watson 2016
11. • Data Integrity
• Row level CRC checks with data export and import processes
• ORC format
• Monthly data partitioning
• Seen better compression with ORC format compared to Parquet (Haven’t compared
the query performance with large sets of data)
• Separate HDFS storage cluster
• Scale storage independent of compute
• Align with the IT strategy and processes
• Expose the same set of files through multiple file system protocols
11
Architecture and Design Decisions
9/23/2016World of Watson 2016
12. 12
Phase 1 Deployment Model
9/23/2016World of Watson 2016
Node
#1
Node
#2
Node
#3
Node
#4
Node
#5
Hadoop
Postgres
Metadata
Database
IBM BigInsights Cluster
Node
#1
Node
#2
Node
#3
HDFS Storage Cluster
BigInsights Component Deployment (Tentative)
Node #1
Zookeeper Server, Hive Metastore, HiveServer2, Knox Gateway, WebHCat
Server, BigInsights Home Server, BigSheets Master, BigSQL worker,
Metrics Monitoring, and Node Manager.
Node #2
Zookeeper Server, History Server, Hive Metastore, HiveServer2, Metrics
Collector, Active Resource Manager, WebHCat Server, Big SQL Worker,
Metrics Collector, and Node Manager.
Node #3
Zookeeper Server, App Timeline Server, Active Hbase Master, Hive
Metastore, HiveServer2, Oozie Server, Standby Resource Manager,
WebHCat Server, Big SQL Worker, Metrics Monitor, and Node Manager.
Node #4
Big SQL Head, Data Server Manager, Hive Metastore, HiveServer2,
WebHCat Server, Metrics Monitor, and Node Manager.
Node #5
BigSQL Secondary Head, Zookeeper Server, Hive Metastore, HiveServer2,
WebHCat Server, Region Server, HBaseREST Server, Metrics Monitor, and
Node Manager.
13. 13
Phase 1 Production Deployment Model
9/23/2016World of Watson 2016
Node
#1
Node
#2
Node
#3
Node
#4
Node
#5
IBM BigInsights Cluster
Hadoop
Postgres
Metadata
Database
Node
#1
Node
#2
Node
#3
HDFS Storage Cluster
Node
#1
Node
#2
Node
#3
Node
#4
Node
#5
IBM BigInsights Cluster
Hadoop
Postgres
Metadata
Database
Node
#1
Node
#2
Node
#3
HDFS Storage Cluster
Data Center 1 Data Center 2
14. • Active-Active deployment across the data centers
• Evaluating IBM Big Replicate product
• Training for support teams
• Day light savings challenges
14
Challenges
9/23/2016World of Watson 2016