TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Using hadoop for enterprise data management
1.
2. 2
Marc Hebert
Chief Operating Officer
Estuate
510-468-7132
marc@estuate.com
Jeff Tuck
IBM Optim Product
Manager
720-395-6032
jtuck@us.ibm.com
Peter Costigan
IBM Optim Product
Manager
408-656-9161
costigan@us.ibm.com
3. 3
The Hadoop Data Management Challenge
Using Hadoop for Test Data Management
Making Archive Data Available to Big Data
Analytics
Summary
4. 4
The Hadoop Data Management Challenge
Many IT shops are using Hadoop for serious analytic applications,
and accumulating large amounts of data
Hadoop is fast becoming a standard platform for analytics and other
uses
And so, managing data in and with Hadoop will pose management
challenges in the next few years
Hadoop can be a very useful tool for managing test data in an
overall test data management context
Hadoop will also likely become an data archive repository of choice
for many IT shops that have application archiving and retirement
initiatives
And, subsetting, masking and archiving Hadoop data itself needs
attention
IBM’s Optim platform leverages Hadoop for these purposes
5. 5
Channels
Big data
Enterprise
Applications
Discover Mask AnalyzeRefreshSubset
Discover
Identify sensitive data
Understand data
relationships
Identify proper test
data
Subset
Automatically extract
test data required for
each test case
Test only on the
required values to
keep environments
efficient
Mask
Enforce data integrity
while masking
Support context &
application aware
masking
Refresh &
Analyze
On demand access &
refresh of test data
Automate test result
comparisons to reduce
errors
Divisional
customer
Network
Billing
analysis
6. 6
Protect sensitive
information from misuse
and fraud
Prevent data breaches
and associated fines
Achieve better information
governance
Protect confidential data
used in big data platforms
Mask data on screen in
applications and reports
Implement proven data
masking techniques
Support compliance with
privacy regulations
Requirements
Benefits
De-identify sensitive information
with realistic but fictional data
Personally identifiable
information is masked with
realistic but fictional data
JASON MICHAELS ROBERT SMITH
Mask data on demand
7. InfoSphere
Optim
InfoSphere
BigInsights
BigSheets
Dev
QA
Integration
Scalable and Cost Effective
• Leverage the scalability of Hadoop to grow your
test environment to support all test data needs
• Benefit from high performance at a low TCO
Trusted
• Mask sensitive data on the way in and out
• Process test data as a complete business object
and maintain relationship integrity
Open
•Leverages the Hadoop open and flexible
architecture
•Built-in connectors to move data in and out
•Query and analyze test data using Big SQL & Hive
• Visualize test data with BigSheets, Watson
Explorer or other Hadoop analytic tools
InfoSphere
Optim
8. A fully functional test data management offering for
Hadoop
– Supports Hive as a native source and target data
store
– Optim Primary Keys and Relationships
– Access Definitions, Table Maps and Column Maps
– Extract, Convert and Load
» New Load service designed specifically for
Hadoop
» Insert not supported due to Hive limitations
– Browse, Edit, Compare and Create
9. A Test Data Management solution that utilizes Hadoop as a test data
management warehouse to store, analyze, search and retrieve structured test
data to satisfy all testing use case data requirements throughout all phases of
the application development lifecycle
Business Objectives
1. Store and catalog data into Hadoop and utilize it as a test data warehouse
2. Explore cataloged data residing in a Hadoop test data warehouse
3. Search for cataloged data residing in a Hadoop test data warehouse
4. Retrieve cataloged data from a Hadoop test data warehouse
5. Store cataloged data in a Hadoop test data warehouse into other non
Hadoop relational data stores
10. other
New Capabilities Benefits
• Hadoop as a test data landing zone highly scalable at low cost
more data can be under control of testers
higher agility to adjust & create test data sets
open to access & manipulate data
• BigSheets (BigInsights tooling*) Visualization and manipulation of data
• BigSQL (BigInsights tooling*) Rich SQL + standard access + security … and more
Test Data
Test Data
Test Data
PROD
DB
Production
DB
Optim Technology
Expanded control for developers & testers to retrieve & create test data
(Subset) & Mask
Load/Refresh
open + managed
Optim v11.3
Load
Hadoop
BigInsights
*restricted license
11. 11
Making Archive Data Available to Big Data
Analytics
Optim enables clients to take historical data from production
systems and place that data into an archive file
That archive file can have retention applied in support of corporate
and regulatory compliance requirements
Data from archive files can be easily made available in Hadoop in
support of analytic initiatives, while the Optim archive file remains
the system of record
12. 12
Benefits of Using Optim Archive as the
System of Record
Ensures data is kept in the original business context without
modification
Provides the ability to restore information to production systems
(selectively if required) including recreation of schemas and
database objects as needed
Enables retention and disposition of information based on legal and
corporate policy (delete after 7 years)
Enables eDiscovery and Legal Hold workflows
Imposes access controls of archived data for data consumers
13. 13
Considerations when Leveraging Hadoop as a
System of Record
How will data access mechanisms be secured?
Are audit records required for data access?
Will the data set stored in Hadoop be immutable and guaranteed not to be
altered?
How will retention policies be executed and explained when required?
In the event of audit requests, are there processes in place to leverage
Hadoop as a source?
14. Apply Retention / Hold Policies
Capture complete business object
Preserve Data Integrity
Preserve Schema Metadata
Load data into Hadoop as needed
Archive Cold Data
Query-able analytical data
store, using Hadoop
Archive & Purge Data
InfoSphere Optim
Compressed, immutable,
auditable & restorable
archives
Database
IMS
VSAM
More…
Archive files Hadoop
15. Complete
object
Hadoop Cluster
Application
Optim – Hadoop Integration:
•Optim Hadoop Loader to convert Optim archive file into CSV & load into HDFS
•Data accessible via query engines like BigSQL, Hive, or Impala (depending on
Hadoop distribution
Database
Data Archive
files
Optim Data Archive Optim
Hadoop Loader
CSV Files
Hive warehouse
Hcatalog
metadata
BigSQL ..
query processing
16. 16
Manage Your Hadoop Data with Help from
Your Friends: Estuate and IBM Optim
Estuate is the world’s leading specialist in IBM Optim
Deep product development relationship with IBM
Over 250 Optim implementations
IBM Optim is the world’s leading data archiving platform with 76%
market share, per Gartner
Optim customers are starting to leverage Hadoop platforms in their
Information Lifecycle Governance initiatives
Estuate brings deep Optim and Hadoop experience and best
practices to help you advance your Hadoop strategy and projects
And, you can do this with either an on-premise or hosted service
17. 17
Integrated Data Management
Production DatabasesTest & Development Databases
IBM Optim- A Platform for Enterprise Data
Management
IBM InfoSphere Discovery
Value: Improve
Application
Performance, Reduce
Infrastructure Costs
& Improve
Compliance
• Retain only needed
data, move the rest to
archives
• Deploy Tiered
Storage Strategies
• Retain Data
According to Value
• Simplify Infrastructure
Data Growth
Solution
Value: Reduce
Infrastructure Cost &
Compliance
• Decommission
redundant or obsolete
applications
• Retain Access to
historical data
Decommissioning
Solution
Value: Risk
Management
•Protect PII Data
• Apply Single Data
Masking Solution
• Leverage realistic
data
Data Privacy
Solution
Value: Speed
Application Delivery
•Create realistic and
manageable test
environments
•Speed application
delivery
•Improve Test Coverage
•Improve Quality
Test Data
Management
Solution
• Discover undocumented business rules
used to transform data from existing
systems
• Prototype and test new
transformations for the target system
Value: Automates analysis of data and
data relationships for complete
understanding of data assets
•Define the business objects for archiving and sub-
setting
•Identify all instances of private data so that they can
be fully protected
18. 18
Enterprise Architecture
An integrated, modular environment to manage enterprise application data and InfoSphere Optimize data-
driven applications from requirements to retirement across heterogeneous environments.
Data GrowthData PrivacyTest Data Management Application Retirement
Discovery
Data Growth, Application Retirement, Test Data Management, Data Privacy
19. 19
Summary: The Benefits of Hadoop for
Enterprise Data Management
Hadoop is state-of-the-art as a Test Data Management
platform
Makes testing more agile and nimble
Leverages the power of Optim Data Privacy as well for
PCI compliance
Hadoop will gradually become a powerful repository for
corporate archived data
Supporting ILG initiatives and compliance