More Related Content Similar to Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise (20) More from DataWorks Summit/Hadoop Summit (20) Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Top Three Big Data Governance
Issues and How Apache ATLAS
resolves it for the Enterprise
June 28, 2016
Apache Atlas
2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Disclaimer
This document may contain product features and technology directions that are under development, may be
under development in the future or may ultimately not be developed.
Project capabilities are based on information that is publicly available within the Apache Software Foundation
project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release
through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache
Software Foundation community development process can all effect timing and final delivery.
This document’s description of these features and technology directions does not represent a contractual
commitment, promise or obligation from Hortonworks to deliver these features in any generally available
product.
Product features and technology directions are subject to change, and must not be included in contracts,
purchase orders, or sales agreements of any kind.
Since this document contains an outline of general product development plans, customers should not rely upon it
when making purchasing decisions.
3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Atlas Data Governance
Organizations need data governance to understand its information to answer
questions such as:
• What do we know about our information?
• Where did this data come from and who can use it?
• Does this data adhere to company policies and rules?
4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
STRUCTURED
UNSTRUCTURED
Vision - Enterprise Data Governance Across Platforms
TRADITIONAL
RDBMS
METADATA
MPP
APPLIANCES
Project 1
Project 5
Project 4
Project 3
METADATA
Project 6
DATA
LAKE
Atlas: Metadata Truth in Hadoop
Data Management
along the entire data lifecycle with integrated
provenance and lineage capability
Modeling with Metadata
enables comprehensive data lineage through
a hybrid approach with enhanced tagging and
attribute capabilities
Interoperable Solutions
across the Hadoop ecosystem, through a
common metadata store
5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas Overview
6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Atlas Data Governance
Data governance practices provide a holistic approach to managing,
improving and leveraging information to help you gain insight and build
confidence in business decisions and operations.
Atlas helps customers discover information about data objects, their
meaning, location, characteristics, and usage.
7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Atlas timeline: from DGI to present
May
2015
Apache
Atlas
Incubation
DGI group
Kickoff
Dec
2014
July
2015
HDP 2.3
Foundation
GA Release
First kickoff to
GA in 7 months
Global
Financial
Company
* DGI: Data Governance Initiative
Key Benefits:
• Co-Dev = Built for
real customer use
cases
• Faster & Safer =
Customers know
business + HWX
knows Hadoop
Jan
2016
HDP 2.4
Kafka/Storm
Sqoop
Falcon
Tag Based
Security
Summer
2016
HDP 2.5
Business Catalog
AD integration
Versioning
8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Data Management Through Metadata
Management Scalability
Many traditional tools and patterns do not scale when applied to multi-
tenant data lakes. Many enterprise have silo’d data and metadata
stores that collide in the data lake. This is compounded by the ability to
have very large windows (years). Can traditional EDW tools manage
100 million entities effectively with room to grow ?
Metadata Tools
Scalable, decoupled, de-centralized manage driven through metadata
is the only via solution. This allows quick integration with automation
and other metamodels
Tags for Management, Discovery and Security
Proper metadata is the foundation for business taxonomy, stewardship,
attribute based security and self-service.
Key Benefits:
Modern Data Lakes
need new ways to
govern because:
• Cost – Traditional staff ratio
to data size not possible
• Diversity – Only way to
manage velocity of new
datasets
• Agility – Quick change based
on tags / taxonomy
9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
High Level Architecture: 4 Key points
Type System
Repository
Search DSL
Bridge
Hive Storm
Falcon
Custo
m
REST API
Graph DB
Search
Kafka
Sqoop
Connectors
MessagingFramework
3 REST API
Modern, flexible
access to Atlas
services, HDP
components, UI &
external tools
1 Data Lineage
Only product that
captures lineage
across Hadoop
components at
platform level.
4 Exchange
Leverage existing
metadata / models by
importing it from
current tools. Export
metadata to
downstream systems
2 Agile Data
Modeling:
Type system allows
custom metadata
structures in a
hierarchy taxonomy
10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Governance Ready Certification Program
Discovery
Tagging
Prep /
Cleanse
ETL
Governance
BPM
Self Service
Visualization
Choice: Customers choose features that they want to
deploy—a la carte versus vendor lock
Curated & Fast: Selected group of vendor partners to
provide rich, complimentary and complete features ready
to deploy
Agile: Low switching costs, Faster deployment and
innovation
Centralized: Common SLA & common open metadata
store
Flexibility: Interoperability of products through Atlas
metadata
Safe: HDP at core to provide stability and interoperability
11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Governance Ready Certification Program
Completed:
• Waterline
• Dataguise
• Attivo
Next:
• SAP ILM,VORA
• IBM IGC
Work in progress:
• Collibra
• Alation
• Meta Integration
(Miti)
• Paxata
• Syncsort
• Trifacta
12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Near Term Roadmap:
Summer 2016
13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summer 2016 Release Summary
• Dynamic Access Policies
• Cross component lineage
• Enterprise Readiness
• Business Catalog
Differentiato
r
Differentiato
r
Differentiato
r
14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Access Policy
Apache Ranger + Atlas Integration
15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary of Dynamic Access Policies
• Basic Tag policy – PII example. Permission
mapped to re-useable tag not resource
• Geo-based policy – Policy based on IP address
mappings. Rule enforcement dynamically geo
aware.
• Time-based policy – Timer for data access for
resource management, compliance reporting
• Prohibitions – Prevention of toxic combinations
of Hive tables or columns that may pose a risk
together.
Key Benefits:
New scalable metadata
based security paradigm
Dynamic, real-time
policy
Automatically updates to
changes in metadata
Centralized and simple
to manage policy
16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How does Atlas work with Ranger at scale?
Atlas provides: Metadata
• Business Classification (taxonomy): Company > HR > Driver
• Hierarchy with Inheritance of attribute to child objects:
Sensitive “PII” tag of department HR will be inherited by group
HR> Driver
• Atlas will notify Ranger via Kafka Topic for changes
Apache Atlas
Hive
Ranger
Falcon
Kafka
Storm
Atlas provides the
metadata tag to
create policies
Ranger provides: Access & Entitlements
• Ranger will cache tags and asset mapping for performance
• Ranger will have a policy based on tags instead of roles.
• Example: PII = <group> This can work for a may assets.
17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scalable Access Control – Reusable Tag Policy
User group
• AD
• Linux
Resources:
• Files
• Tables
• Topologies
Atlas Tag
• PII
ANY asset PII
• Files
• Tables
• Topologies
Single Admin Group
Assigns
Many Stewards Tag +
Single point of
enforcement and
audit
All future tagging
is covered by
existing policy
18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Automatic update of policies – active protection
Metastore
• Tags
• Assets
• Entities
Notification
Framework
Kafka Topics
Atlas
Atlas Client
• Subscribes to
Topic
• Gets Metadata
Updates
PDP
Resource Cache
Ranger
Notification Metadata
updates
Message
durability
Optimized
for Speed
Event driven
updates
19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop Cross Component
Data Lineage
20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas Component Integration
• Cross- component dataset lineage. Centralized
location for all metadata inside HDP
• Single Interface point for Metadata Exchange with
platforms outside of HDP
Apache Atlas
Hive
Ranger
Falcon
Sqoop
Storm
Kafka
Spark
NiFi
HBase
HDP 2.3
HDP 2.5
Beyond HDP 2.5
21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Users in the upcoming release of HDP 2.5 will be able to
track lineage across the following components using
Atlas:
Sqoop – Import from and export to relational databases, and
additional package that leverages sqoop. ATLAS-184 , SQOOP-
2609
Hive - Dataset lineage with entity versioning (including schema
changes) ATLAS-75. ATLAS-183, ATLAS-492
Kafka/ Storm - IoT event-level processing, such as syslogs, or
sensor data ATLAS-181 , ATLAS-183, STORM-1381
Falcon - Data lifecycle at Feed and Process entity level for
replication, and repeating workflows. Tracks period-icy,
throttling, ecviction. ATLAS-69 , FALCON-1570
Summary of Data Lineage
Key Benefits:
Enterprises need open
solutions, not single app
vendor
More native connectors
than anyone else with
more coming
Hardened metadata
infrastructure
22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop
Teradata
Connector
Apache
Kafka
Expanded Native Connector: Dataset Lineage
Custom
Activity
Reporter
Metadata
Repository
RDBMS
Any process
using Sqoop is
covered
No other tool
tracks IOT of
the box
23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summer 2016 Release Summary
• Dynamic Access Policies
• Cross component lineage
• Enterprise Readiness
• Business Catalog
Differentiator
Differentiator
Differentiator
25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security/Enterprise Readiness
• Highly reliable and scalable components
• Authorization with AD via Ranger
• Rolling upgrade support HDP 2.5 +
• BC & DR capabilities
• Improved performance of 5x from previous version
26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enterprise Readiness:
Scalable and Highly Reliable Components
Solr
Cloud
Kafka
Quorum
Type System
Repository
Search DSL
Bridge
Hive Storm
Falcon Custom
REST API
Graph DB
Search
Kafka
SqoopConnectors
MessagingFramework
HBase
27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summer 2016 Release Summary
• Dynamic Access Policies
• Cross component lineage
• Enterprise Readiness
• Business Catalog
Differentiator
Differentiator
Differentiator
28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Business Taxonomy (Catalog)
29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key Concepts
Business Taxonomy (Catalog)
The practice and science of classification of things or
concepts, including the principles that underlie such
classification. The business organization model is
hierarchical making authoritative with no duplication.
Data Lineage (Provenance)
Data lineage is defined as a data life cycle that includes the
data's origins and where it moves over time. It describes
what happens to data as it goes through diverse processes. It
helps provide visibility into the analytics pipeline and
simplifies tracing errors back to their sources
Tags: Traits vs. Labels vs. Business Taxonomy
Atlas has Tags that are authorative and prevent duplication.
Tag can span different parts of the business taxonomy. A tag
PII can be used in HR as well Finance or Sales.
Benefits:
A view of data assets
organized by business
language
Impact analysis, Compliance,
Acceptable use
Common tag though Hadoop
components
30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Taxonomies Benefits:
• Search / Discovery – Business catalog of
conceptual, logical and physical assets
• Security --Dynamic metadata based
Access control
31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
We conduct open-ended user interviews so that we can learn more
about who are users are and what their needs are. This helps us
validate whether or not we’re solving the right problem.
Research: Focused on Hadoop
32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
We test our prototype in InVision - a click through prototyping tool
that allows users to interact with static mockups.
Usability Testing
33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Principle Roles & Activities
• Data Steward – Curator, responsible
for catalog veracity
• Data Scientist – Analyst, primary
consumer of Business Catalog
• Administrator – Role management
only
• Data Engineer – Data ingress and
egress, semantic data quality
• 50% - 80%+ Time
spend looking
for data
• Profit Center • Primary User
of Atlas
• Enables
Scientist
Goal: < 25% spent on
finding data
=
Empowering scientist to
spend their time
uncovering insights --
faster
34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Atlas Value
• Designed for Hadoop at platform, not application level
• High Confidence data in Hadoop for regulated verticals
• Compliance and business objectives aligned to data organization
• Faster discovery for analysts – reduce time to value
• Agile and adaptable – ensures information is current by native
connectors
• Dynamic protection with Ranger in simple audited policies
35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Additional Atlas Sessions
• Extend Governance in Hadoop with the Atlas Ecosystem:
integrations with partners Waterline, Trifacta and Attivo:
Thursday 4:10PM @ Room 210A
• BOF: Apache Knox and Apache Ranger provide Hadoop security
while Atlas provides a Hadoop metadata store and enterprise
compliance. Come learn and discuss security & governance
innovations and future directions.
Thursday 5-7 PM @ Room 210A
36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Learn More:
• Hortonworks links: http://hortonworks.com/solutions/security-and-
governance/
• Tutorials: https://github.com/hortonworks/tutorials/tree/atlas-ranger-
tp/tutorials/hortonworks/atlas-ranger-preview
Editor's Notes 4 How fast ? 7 months !
Apache Atlas is the only open source project created to solve the governance challenge in the open. The founding members of the project include all the members of the data governance initiative and others from the Hadoop community. The core functionality defined by the project includes the following:
Data Classification – create an understanding of the data within Hadoop and provide a classification of this data to external and internal sources
Centralized Auditing – provide a framework to capture and report on access to and modifications of data within Hadoop
Search & Lineage – allow pre-defined and ad hoc exploration of data and metadata while maintaining a history of how a data source or explicit data was constructed
Security and Policy Engine – implement engines to protect and rationalize data access and according to compliance policy
Which Vendors would you be interested in ?
The point of Atlas is to leverage metadata to drive exchange, agility and scalability in the HDP gov solution. The paradigm shift requires that in a true data lake with multi-tenant environment with 10K+ of objects, conventional management of entitlement and enforcement will not work and new patterns must be used. One group cannot both understand the data and manage policy efficiently — the domain is too large. These activities must be de-coupled. The data stewards curate the data as they are the SMEs (tagging), and the policy folks create a policy once based on tags (access rules). In our thinking, this the ONLY scalable solution. We have it and CDH does not. Apache Atlas = low level service like yarn. It will be common to the whole HDP platform, providing core metadata services and enriching the whole HDP stack. We start with Hive in HDP 2.3 and will extend to Ranger and Falcon in M10 and continue with Kafka and Storm by the end of 2015.
Yellow + Atlas = governance features. Show – clearly identify customer metadata. Change
Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagnosis
** bring meta from external systems into hadoop – keep it together Show – clearly identify customer metadata. Change
Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagnosis
** bring meta from external systems into hadoop – keep it together Show – clearly identify customer metadata. Change
Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagnosis
** bring meta from external systems into hadoop – keep it together Apache Atlas is the only open source project created to solve the governance challenge in the open. The founding members of the project include all the members of the data governance initiative and others from the Hadoop community. The core functionality defined by the project includes the following:
Data Classification – create an understanding of the data within Hadoop and provide a classification of this data to external and internal sources
Centralized Auditing – provide a framework to capture and report on access to and modifications of data within Hadoop
Search & Lineage – allow pre-defined and ad hoc exploration of data and metadata while maintaining a history of how a data source or explicit data was constructed
Security and Policy Engine – implement engines to protect and rationalize data access and according to compliance policy
- Learn about who are users are and what are their needs to validate if we are solving the right problem
Open ended half hour discussions about processes, challenges and current tools
We record the interviews so that we can focus on the conversation and analyis them afterward
- Test our prototype in Invision - A click through prototyping tool
- Walk users through scenarios and watch how they respond
- Remind our participants that we aren’t testing them, we’re testing the design and encourage thinking aloud
Is the product was well understood?
Is the product something they would use?
Where is the value?