Data Governance Initiative

© Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP: Data Governance
We Do Hadoop

Enterprise Data Governance Goals
GOAL: Provide a common approach to
data governance across all systems
and data within the organization
• Transparent
Governance standards & protocols must be
clearly defined and available to all
• Reproducible
Recreate the relevant data landscape at a
point in time
• Auditable
All relevant events and assets but be
traceable with appropriate historical lineage
• Consistent
Compliance practices must be consistent
ETL/DQ
BPM
Business
Analytics
Visualization
& Dashboards
ERP
CRM
SCM
MDM
ARCHIVE
Governance
Framework

Data Governance Challenges WITHIN Hadoop
• No comprehensive governance within
the Hadoop stack
• Mostly disjoint as each project defines its own
future and there is no common framework
• Disparate tools, such as HCatalog, Ranger and
Falcon provide pieces of the overall solution
• No integration with external governance
frameworks
• Difficult to get right because each
project is autonomous and you need
insight into traditional IT
ApachePig
ApacheHive
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm

Data Governance Initiative for Hadoop
ETL/DQ
BPM
Business
Analytics
Visualization
& Dashboards
ERP
CRM
SCM
MDM
ARCHIVE
Data Governance Initiative
Common
Governance
Framework
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
°
°
ApachePig
ApacheHive
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
TWO Requirements
1. Hadoop must snap in to
the existing frameworks
and be a good citizen
2. Hadoop must also provide
governance within its own
stack of technologies
A group of companies dedicated to meeting
these requirements in the open

Apache Atlas Overview
We Do Hadoop

New Project Proposal: Apache Atlas
Apache Atlas
Proposed open source project
aimed at solving the Hadoop
data governance challenge in
the open.
Key Capabilities
• Data Classification
• Metadata Exchange
• Centralized Auditing
• Search & Lineage (Browse)
• Security & Policy Engine
Apache Atlas
Knowledge Store
Audit Store
ModelsType-System
Policy RulesTaxonomies
Tag Based
Policies
Data Lifecycle
Management
Real Time Tag Based Access Control
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Energy
PPDM
Retail
PCI
PII
Other
CWM
Essential Timeline
Phase-3
• Collaboration Features
• Self Service
• Steward Delegation
• Profiling & Pattern Analysis
• Visualization
Phase-2
• Advance audit reporting
• Advanced Policy Engine
• Row / Column Masking
• 3rd party Metadata exchange
1H 2015 GA
• Rest API
• Centralized Taxonomy
• Import / export metadata
• Basic Policy Rules Engine
• Real-time access control
• Column Level Tagging

Apache Atlas Capabilities: Overview
Data Classification
• Import or define taxonomy business-oriented annotations for data
• Define, annotate, and automate capture of relationships between data sets and underlying
elements including source, target, and derivation processes
• Export metadata to third-party systems
Centralized Auditing
• Capture security access information for every application, process, and interaction with data
• Capture the operational information for execution, steps, and activities
Search & Lineage (Browse)
• Pre-defined navigation paths to explore the data classification and audit information
• Text-based search features locates relevant data and audit event across Data Lake quickly
and accurately
• Browse visualization of data set lineage allowing users to drill-down into operational, security,
and provenance related information
Security & Policy Engine
• Rationalize compliance policy at runtime based on data classification schemes
• Advanced definition of policies for preventing data derivation based on classification (i.e. re-
identification)
Apache Atlas
Knowledge Store
Audit Store
ModelsType-System
Policy RulesTaxonomies
Tag Based
Policies
Data Lifecycle
Management
Real Time Tag Based Access Control
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Energy
PPDM
Retail
PCI
PII
Other
CWM

Governance Ready Certification Program
Curated group of vendor partners to provide
rich & complete features
Customers choose features that they want to
deploy – a la carte.
Low switching costs !
HDP at core to provide stability and
interoperability
Discovery
Tagging
Prep /
Cleanse
ETL
Governance
BPM
Self
Service
Visual-
ization

Architecture

High Level Architecture
Type System
Repository
Search DSL
Bridge
Hive Storm OthersSqoop
REST API
Titan / HBase
Solr/Elastic

Type System – Overview of Types
• Class
• Struct
• Trait
• Primitives
• Collections
• Map
• Array
• Instances (Entity)
• Referenceable

Type System – Data Types

Type System - Examples
HierarchicalTypeDefinition<ClassType> tblClsDef =
TypesUtil.createClassTypeDef(TABLE_TYPE, null,
attrDef("name", DataTypes.STRING_TYPE),
attrDef("description", DataTypes.STRING_TYPE),
new AttributeDefinition("db", DATABASE_TYPE,
Multiplicity.REQUIRED, false, null),
new AttributeDefinition("sd", STORAGE_DESC_TYPE,
Multiplicity.REQUIRED, true, null),
attrDef("owner", DataTypes.STRING_TYPE),
attrDef("createTime", DataTypes.INT_TYPE),
attrDef("lastAccessTime", DataTypes.INT_TYPE),
attrDef("retention", DataTypes.INT_TYPE),
attrDef("viewOriginalText", DataTypes.STRING_TYPE),
attrDef("viewExpandedText", DataTypes.STRING_TYPE),
attrDef("tableType", DataTypes.STRING_TYPE),
attrDef("temporary", DataTypes.BOOLEAN_TYPE),
new AttributeDefinition("columns",
DataTypes.arrayTypeName(COLUMN_TYPE),
Multiplicity.COLLECTION, true, null)
);

Repository
• Graph Database
• Titan with storage backed by HBase
• Types and instances are mapped to the Graph DB
• Classes, Structs and Traits map to a vertex
• Relationships are mapped as edges
• Search - plugin enabled
• Indexing based on type annotations
• Solr
• Elastic search

Search
• DSL with SQL Like Syntax
• from $type is $trait where $clause select|has $attributes loop $loopExpression withPath, repeat
• Examples
• from DB
• DB where name="Reporting" select name, owner
• DB has name
• DB is JdbcAccess
• Column where Column isa PII
• Table where name="sales_fact", columns
• Table where name="sales_fact", columns as column select column.name, column.dataType,
column.comment
• Full-text search

Lineage
• Uses Search DSL Loop expression
• Everything results in search
• Named Queries
• inputs
• Table where (name = "sales_fact_monthly_mv") as src loop (LoadProcess->outputTable
inputTables) as dest select src.name as src_name, dest.name as dest_name withPath
• outputs
• Table where (name = "sales_fact") as src loop (LoadProcess->inputTables outputTables) as dest
select src.name as src_name, dest.name as dest_name withPath
• schema
• Table where name="sales_fact", columns

Hive Integration
Apache Atlas
Hive Bridge
(Client)
Hive Hook
(Post-execution)
REST API

Prototype Demonstration

Demo – Class Diagram

Demo – Instance Graph

Demo
• http://162.212.133.230:21000/dashboard/test1/Search.html

Questions and Answers

Data Governance Initiative

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Data Governance Initiative

Similar a Data Governance Initiative (20)

Más de DataWorks Summit

Más de DataWorks Summit (20)

Último

Último (20)

Data Governance Initiative

Notas del editor