Time Series data at Capital One consists of Infrastructure, Application, and Business Process Metrics. The combination of these metrics are what the internal stakeholders rely on for observability which allows them to deliver better service and uptime for their customers, so protecting this critical data with a proven and tested recovery plan is not a “nice to have” but a “must have.”
In this talk, the members of IT staff, Saravanan Krisharaju, Rajeev Tomer, and Karl Daman will share how they built a fault-tolerant solution based on InfluxEnterprise and AWS that collects and stores metrics and events. They added to this, Machine Learning, which uses the collected time series to model predictions which are then brought back into InfluxDB time series database for real-time access. This Capital One team shares the journey they took to architect and build this solution as well as plan and execute on their disaster recovery plan.
Why Architecting for Disaster Recovery is Important for Your Time Series Data by Capital One
1. Confidential
Time Series Data Management Evolution at Capital
One
Building a highly resilient Enterprise Influx across multiple regions
Monitoring Intelligence
October 23, 2018
2. Confidential
Agenda
2
1. Influx at Capital One
2. Architecture
How Influx Architecture evolved
3. Resiliency
Complete protection against entire region failure
4. Performance Metrics
Critical Platform Metrics
5. Q & A
3. Confidential
Influx at Capital One
3
• Business Transaction Metrics
• Infrastructure Health metrics
• Application Performance Metrics
• Service Adoption Metrics
4. Confidential
Agenda
4
1. Influx at Capital One
2. Architecture
How Influx Architecture evolved
3. Resiliency
Complete protection against entire region failure
4. Performance Metrics
Critical Platform Metrics
5. Q & A
5. Confidential
Architecture – Gen1
5
InfluxDB
( Active )
InfluxDB
(Standby)
DR Site
Visualization
https
Primary Site
LB
Backup/Restore
Features
1. Grafana for visualization
Challenges
1. High Data Retention ( > 400 days )
2. Unstable DR Solution
Splunk
Direct API
Telegraf
6. Confidential
Architecture – Gen2
6
InfluxDB
( Active )
InfluxDB
(Standby)
DR Site
Visualization
Primary Site
LB
AWS S3
Data Lake
ML Model
Execution
Model
Governance
ML Model
Dev/Train
Backup/Restore
Daily batch
Features
1. Grafana for visualization
2. Raw Data Exported Daily to One Lake
3. Raw Data available for ML
Challenges
High Data Retention ( > 400 days )
2. Unstable DR Solution
A
B. ReviewC. Deploy
E
D
Splunk
Direct API
Telegraf
https
7. Confidential
Architecture – Current
7
InfluxDB
( Active )
InfluxDB
(Passive)
DR Site
Visualization
Primary Site
LB
AWS S3
Data Lake
ML Model
Execution
Model
Governance
ML Model
Dev/Train
AWS S3
ExportImport Mini batch
Features
1. Grafana for visualization
2. Raw Data Exported every 30 minutes
to One Lake
3. Raw Data available for ML
4. Stable DR Solution with Passive
Cluster
Challenges
High Data Retention ( > 400 days )
Unstable DR Solution
A
B. ReviewC. Deploy
E
D
Splunk
Direct API
Telegraf
https
8. Confidential
Agenda
8
1. Influx at Capital One
2. Architecture
How Influx Architecture evolved
3. Resiliency
Complete protection against entire region failure
4. Performance Metrics
Critical Platform Metrics
5. Q & A
9. Confidential
Resiliency – Region 1 Active
9
https
D
M
D
M
D
M
Region 1 ( Active)
Zone 1 Zone 2 Zone 3
LB
S3
Cluster 1
D
M
D
M
D
M
Region 2
(DR)
Zone 1 Zone 2 Zone 3
LB
S3
Cluster 2
Route53
(DNS Switch)
Cross Region
Replication
Admin
A
ASGASG
M Meta Node D Data Node A Admin
Node
3
A
Route53
(DNS Switch)
1
4
2
Splunk
Direct API
Telegraf
All Traffic routed to Region 11
Influx Export Script every 15 min2
Data Replicated to Region 23
Influx Import Script every 15 min4
10. Confidential
Resiliency – Region 2 Active
10
https
D
M
D
M
D
M
Region 1 ( DR)
Zone 1 Zone 2 Zone 3
LB
S3
Cluster 1
D
M
D
M
D
M
Region 2 (Active)
Zone 1 Zone 2 Zone 3
LB
S3
Cluster 2
Route53
(DNS Switch)
Cross Region
Replication
Admin
A
ASGASG
M Meta Node D Data Node A Admin
Node
3
A
Route53
(DNS Switch)
1
2
4
Splunk
Direct API
Telegraf
All Traffic routed to Region 21
Influx Export Script every 15 min2
Data Replicated to Region 13
Influx Import Script every 15 min4
12. Confidential
Agenda
12
1. Influx at Capital One
2. Architecture
How Influx Architecture evolved
3. Resiliency
Complete protection against entire region failure
4. Performance Metrics
Critical Platform Metrics
5. Q & A