Data Asset Catalog & Metadata Management - Is It a Fad or Is It the Future?
Many have dubbed metadata as “the new black,” but is this accurate?
How to leverage metadata management to streamline data governance and ensure transparency
Improving data quality and ensuring consistency and accuracy of data across various reporting systems
Looking at the flip side: what are the additional training requirements and value-added for the business?
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Chief Data & Analytics Officer Fall Boston - Presentation
1. CDAO Fall Boston
October 27th- 28th, 2021
Data Catalog and Metadata Management
- Is It a Fad or Is It the Future?
Presented by Srinivasan Sankar
2. Disclaimer
Please note this presentation is for general informational and discussion
purposes only. The opinions expressed in this presentation are those of the
presenter and not necessarily those of their employer. The presenter does not
guarantee the accuracy or reliability of the information provided herein. No
representation or warranty, express or implied, is provided by the presenter in
relation to the fairness, accuracy, correctness, completeness or reliability of
the information, opinions or conclusions expressed herein. The information
contained in this presentation is subject to change without notice. This
presentation does not, and is not designed to, provide legal advice and
meeting participants should consult with an attorney concerning the use of
data to ensure all legal and regulatory requirements are satisfied.
3. AGENDA TOPICS
• Are Data Catalogs / Metadata “the new black,”?
• Leverage metadata management to streamline data governance and ensure
transparency
• Improve insights by extracting value from unstructured data utilizing a machine
learning augmented data catalog
• Let the insights come to you with AI-augmentation
• Practical steps to deal with the onslaught of data and learn how to implement an
effective data catalog
• Multi-source data to increase the potential of data value
• Data Catalog – key enabler of a Data Mesh
4.
5.
6. Definition
A data catalog creates and maintains an
inventory of data assets through the
discovery, description and organization of
distributed datasets. The data catalog
provides context to enable data stewards,
data/business analysts, data engineers, data
scientists and other line of business (LOB) data
consumers to find and understand
relevant datasets for the purpose of extracting
business value.
In a nutshell,a data catalog is a place that shows what data assets you have and where they are
located.You might be asking,what is a data asset? That is any entity (i.e.,reports,databases,
websites) that contains data.
Data Catalogs Are the “New Black” in Data Management and Analytics
7. METADATA / CATALOG ROLE
IN DATA MANAGEMENT
• To understand Metadata’s vital role in data
management, imagine a large library, with
hundreds of thousands of books and
magazines, but no card catalog.Without a card
catalog, readers might not even know how to
start looking for a specific book or even a
specific topic.The card catalog not only
provides the necessary information (which
books and materials the library owns and
where they are shelved) it also enables
patrons to find materials using different
starting points (subject area, author, or title).
Without the catalog, finding a specific book
would be difficult if not impossible.
An organization without Metadata is like a library without a card catalog
8.
9. THE CASE FOR DATA CATALOGS
Analyze Data not chase Data – Many data scientists spend over 2/3rd of their time understanding and
finding the data.The main reason for this problem in an organization is the poor mechanism of handling
and tracking all the data. A good Catalog helps the Data Scientist or Business Analyst understand the
data and answer the question they have.
Efficient Access Control – When an organization grows, role-based policies are needed, don’t want
everybody to modify the data. Access Control should be implemented while building the Data Lake.
Roles are assigned to the users, and according to those roles, Data Access should be controlled.
Eliminate Data Redundancies – A good Catalogue Tool helped us find the data redundancies and
eliminate them.This can help us to save storage costs and data management costs.
To follow Laws – There are different protection laws to follow as per the data, such as GDPR, BASEL,
GDSN, HIPAA, and many more.These laws must be followed while dealing with any data. But these laws
stand for different use cases and don’t imply every data set, to understand that we need to know about
the data set. A good Catalog helps us make sure that Data Compliance’s followed by giving a view on
Data Lineage and using Access Control.
10. Phase
1
Catalog and
Lineage
• Infrastructure
and
Installation of
Catalog tool
• Data
Architects to
initiate the
collection of
data assets,
catalog and
identify
lineage
Phase
2
Data
Stewardship,
Business
Glossary
•Appoint Part-
time
Governance
Lead role
(cross-
functional
business facing)
•Supporting
Analyst
•Manage
Governance
activities
Phase
3
Operationalize
Governance
activities
•Accountability,
Ownership of
Data
•Operationalize
Data
Governance
activities
•Report Metrics
•Iterate
activities for all
information /
data projects
Improve / Enhance
Data Governance
HOW TO ADOPT DATA CATALOGS INTO GOVERNANCE
Manage Data Lifecycle
Establish
Data Governance
Sustain Data Governance
Communicate
Manage Return
On Investment
Maintain Organization &
Sponsorship
Review/Update Processes
Review//Update Scope
(Quarterly Workshop)
Business Change
Management
Review & Approve New Projects
Maintain Data Definitions
Maintain Metrics
Identify Data Stewards
Conflict Resolution, Escalation
Plan
Organize
Organize
Define
Deploy
Core Foundation
Enterprise Data Asset Catalog
Phased approach
Data Cataloging is a journey……
11. DATA
CATALOG
BEST
PRACTICES
Assigning Ownership for the data set – Ownership of
each data set must be defined.There must be a person
to whom the user contacts in case of an issue. A good
Catalog also must talk about the owner of any data set.
Human Touch – After building a Catalog, the users must
verify the data sets to make them more accurate.
Searchability –The Catalog should support searchability.
Searchability enables Data Asset Discovery; data
consumers easily find assets that meet their needs.
Data Protection – Define Access policies to prevent
unauthorized data access.
12. HIGH ROI FOR MULTI-SOURCE DATA WITH DATA
CATALOG
Graphic
Source:
CEB
analysis
Weather,
Highway safety
Industry
Enterprise Data Integration and Data Lake
External data empowers teams to make better data-driven decisions, especially when it’s integrated with first-party data.
Single source data has value in relation to other data in the organization, and the ability to search and analyze across
multiple information sources provides tremendous insight
Traditional DW
•Driving Tracker
•Nest Protect
•GPS Fleet
Tracking
D
A
T
A
C
A
T
A
L
O
G
13. AI powered process for curating, verifying, and classifying data that enhances speed and usability
How does it work?
What is it?
Use Algorithms (Advanced Statistics and Deep
Learning) to learn from the large scale data to:
Applicable to large, complex and
often streaming data sets
3rd party data, sensor data, customer
data, transactions
• Algorithmic sampling of data to
identify key patterns and business
rules
• Continuous monitoring to alert Data Stewards
of exceptions for timely resolution
• Correlation of data concepts across domains
and data sources to track usage and establish
lineage
• Ability to ingest and apply quality rules to
third party and unstructured data sources
• Establishes feedback loop that refines the
machine learning models to improve data quality
over time
Identify patterns Quality issues and anomalies
across massive, complex and
often streaming data sets
Business rules
BUILD AN INTELLIGENT DATA CATALOG BY
INTEGRATING ARTIFICIAL INTELLIGENCE INTO IT
14. DATA CATALOG
THE NUCLEI OF A DATA MESH*
• A data product must be easily discoverable
especially with a data catalogue, with their meta
information such as their owners, source of origin,
lineage, sample datasets, etc.This centralized
discoverability service allows data consumers,
engineers and scientists in an organization, to find
a dataset of their interest easily. Each domain data
product must register itself with this centralized
data catalogue for easy discoverability.
• Note the perspective shift here is from a single
platform extracting and owning the data for its use,
to each domain providing its data as a product in a
discoverable fashion.
• Data catalog platforms provide central
discoverability, access control and governance of
distributed domain datasets.
*Data Mesh (concept founded by Zhamak Dehghani) is a sociotechnical approach to share, access and manage analytical data in complex and large-scale environments - within or across organizations