This presentation will describe the analytics-to-cloud migration initiative underway at Fannie Mae. The goal of this effort is threefold: (1) build a sustainable process for data lake hydration on the cloud and (2) modernize the Fannie Mae enterprise data warehouse infrastructure and (3) retire Netezza.
Fannie Mae partnered with Impetus for modernization of its Netezza legacy analytics platform. This involved the use of the Impetus Workload Migration solution—a sophisticated translation engine that automated the migration of their complex Netezza stored procedures, shell and scheduler scripts to Apache Spark compatible scripts. This delivered substantial savings in time, effort and cost, while reducing overall project risk.
Included in the scope of the automation project was an automated assessment capability to perform detailed profiling of the current workloads. The output from the assessment stage was a data-driven offloading blueprint and roadmap for which workloads to migrate. A hybrid cloud-based big data solution was designed based on that. In addition to fulfilling the essential requirement of historical (and incremental) data migration and automated logic translation, the solution also recommends optimal storage formats for the data in the cloud, performing SCD Type 1 and Type 2 for mission-critical parameters and reloading the transformed data back for reporting/analytical consumption.
This will include the following topics:
i. Fannie Mae analytics overview
ii. Why cloud migration for analytics?
iii. Approach, major challenges, lessons learned
Speaker
Kevin Bates, Vice President for Enterprise Data Strategy Execution, Fannie Mae
7. Analytics and reporting tools will continue to
propagate
AI Libraries
Data Federation &
Virtualization
Data Integration & ETL
Analytics LOB
Applications
Self-serve Analytics &
Visualization
Data Catalogs &
Metadata
Data Science Platforms
Self-serve Data Prep
Data & Compute
Platforms
Traditional BI
Rather than fight the
changes and limit
choice we need a
platform that enables
choice and manages
the complexity
8. Opportunities to drive efficiency and sharing…
Active Analytic Catalog
3rd Party
Cloud
On-Prem
1. Connect to data
tables
2. Join, massage,
aggregate, or shape
the data
3. Create calculations,
derivations,
expressions,
aggregations
Data Science Tools
1. Connect to data
tables
2. Join, massage,
aggregate, or shape
the data
3. Create calculations,
derivations,
expressions,
aggregations
BI Tools
1. Connect to data
tables
2. Join, massage,
aggregate, or shape
the data
3. Create calculations,
derivations,
expressions,
aggregations
Line of Bus. Tools
∞
2. Join, massage,
aggregate, or shape
the data
3. Create calculations,
derivations,
expressions,
aggregations
4. Use tool-specific
functions: send campaign,
view model, etc.
4. Use tool-specific
functions: send campaign,
view model, etc.
4. Use tool-specific
functions: send campaign,
view model, etc.
4. Use tool-specific
functions: send campaign,
view model, etc.
New/Custom Application
1x
1x
1x
∞
Re-Use
Analytic Reuse
1. Connect to data
tables
10. Fannie Mae’s experience with Data Lakes
2014
Open
source
Hadoop
2015
Analytics
Cluster using
proprietary
Hadoop
distribution
2016
Data Lake using
proprietary
Hadoop
distribution
2017
Data Lake using
cloud native
technologies
2018
Driving Data
Lake adoption
Fannie Mae has been in
forefront in adopting to
cloud industry advancements
11. Approach #1: Take a Governance View
Enterprise Data Lake
BI Reports & Dashboards
Ad-hoc and what if Queries
Data as Service
Data Science Results
?
Business Transaction Data
3rd Party Data
Reference Data
Deal and Delivery
Documents
Structured/SemiStructuredUnstructured
Data life-cycle
Metadata
Data
Security
Data
Lineage
App
User
Enter-
prise
Data
Zones
Data
Usage
Data
Standards
Access Control
Platform
Utilization
Focus areas to automate or enable tools to manage data lake.
Data Certification
Compliance Requirements
Preparation &
Transformatio
n
What goes in?
Ingested
What’s done with it?
Processed
What goes out?
Consumed
12. Approach #2: Think about Personas
User Zone
Enterprise Data Lake (EDL)Data Scientist /
Analyst
Data
Discovery
Data reads or copy
from other zones
into User Zone
Data contained
(No outward movement)
Developer
Data ingestion from
external source
User data/results
(Local Governance)
Data
Discovery
(EDL)
Data ingestion from
external source.
(Provide catalog)
Provide NPI
Classification of
external data
Process and Insight Layer
Governance
(Extended Metadata)
Data Reads and movement
between zones
(Controls and Metadata)
Schema Design and
Data Catalog
External movement/
Disclosures
(Controls and Catalog)
No NPI
EDL RBAC
No NPI
External Data
Data
External
Data
* Not all personas shown
Enterprise Zone App Zone
Data Layers
InsightLanding Prep.
Data Layers
InsightLanding Prep.
13. Approach #3: How can we bring two worlds together?
Traditional BI
AI Libraries
Data Catalogs & Metadata
Data & Compute Platforms
Data Federation & Virtualization
Data Integration & ETL
Analytics LOB
Applications
Self-Serve Analytics &
Visualization
Data Science Platforms
Self-Serve Data Prep
Data collaboration platform (centralized service catalog, federated delivery, lineage maintained)
SDLC-drivenTechPlatformsBusinessAnalytics
14. Approach #4: It’s new and evolving, so leverage
partners who can think end-to-end
Worked with Impetus to establish new patterns for
analytics data provisioning
Use case involved retirement project and cloud transition
Implementation required full production context (real
production, real users)
Solution included:
• One time historical data migration (prem to cloud)
• Migration of existing base tables and snapshots
• New build for cloud-hosted dimensions, snapshots
• New build for ongoing data flows (end-to-end)
Establish data extraction and ingestion framework
Job orchestration
Data transformation and change capture
Establish audit framework (operations, controls)
Capture reusable utilities and build the library
Monitor and report performance for each step
2
1
3
4
5
6
16. Challenges
Cloud-adoption and Data Lake development can
require manual processes, hand-coding, and reliance
on command-line tools
Keeping track of your data, its lineage, and making it
easy to find
Coupling of ingestion and processing drives
architecture decisions
Operationalizing processes for production and to
maintain SLAs
Ensuring data is in canonical forms with a shared
schema usable by others
Coding or filing tickets to perform new ingestion and
processing tasks
Multiple architectures and technologies used by
different teams on different clusters
Guaranteeing compliance in a system that is
designed for schema-on-read and raw data
Sharing infrastructure in a multi-tenant “self
service” environment
Business awareness buy-in
17. What we have learned
Review your development practices holistically
• You need new patterns for data movement
• Don’t lift and shift!
Think Governance First!
• Incorporation of new processes into data
governance strategy
• Focus on sustainable practices that fully
envision how the end-to-end together
Engage strategic partners where it makes sense
Keep engaging your business partners to ensure
alignment
As the center of gravity of data moves toward
the cloud, hybrid strategies will become
increasingly important
This is a migration that, for seasoned
companies, will take time
Don’t migrate to the Cloud for tech reasons—
engage your business!