SlideShare una empresa de Scribd logo
1 de 22
Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP: Data Governance
We Do Hadoop
Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enterprise Data Governance Goals
GOAL: Provide a common approach to
data governance across all systems
and data within the organization
• Transparent
Governance standards & protocols must be
clearly defined and available to all
• Reproducible
Recreate the relevant data landscape at a
point in time
• Auditable
All relevant events and assets but be
traceable with appropriate historical lineage
• Consistent
Compliance practices must be consistent
ETL/DQ
BPM
Business
Analytics
Visualization
& Dashboards
ERP
CRM
SCM
MDM
ARCHIVE
Governance
Framework
Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Governance Challenges WITHIN Hadoop
• No comprehensive governance within
the Hadoop stack
• Mostly disjoint as each project defines its own
future and there is no common framework
• Disparate tools, such as HCatalog, Ranger and
Falcon provide pieces of the overall solution
• No integration with external governance
frameworks
• Difficult to get right because each
project is autonomous and you need
insight into traditional IT
ApachePig
ApacheHive
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Governance Initiative for Hadoop
ETL/DQ
BPM
Business
Analytics
Visualization
& Dashboards
ERP
CRM
SCM
MDM
ARCHIVE
Data Governance Initiative
Common
Governance
Framework
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
°
°
ApachePig
ApacheHive
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
TWO Requirements
1. Hadoop must snap in to
the existing frameworks
and be a good citizen
2. Hadoop must also provide
governance within its own
stack of technologies
A group of companies dedicated to meeting
these requirements in the open
Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas Overview
We Do Hadoop
Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
New Project Proposal: Apache Atlas
Apache Atlas
Proposed open source project
aimed at solving the Hadoop
data governance challenge in
the open.
Key Capabilities
• Data Classification
• Metadata Exchange
• Centralized Auditing
• Search & Lineage (Browse)
• Security & Policy Engine
Apache Atlas
Knowledge Store
Audit Store
ModelsType-System
Policy RulesTaxonomies
Tag Based
Policies
Data Lifecycle
Management
Real Time Tag Based Access Control
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Energy
PPDM
Retail
PCI
PII
Other
CWM
Essential Timeline
Phase-3
• Collaboration Features
• Self Service
• Steward Delegation
• Profiling & Pattern Analysis
• Visualization
Phase-2
• Advance audit reporting
• Advanced Policy Engine
• Row / Column Masking
• 3rd party Metadata exchange
1H 2015 GA
• Rest API
• Centralized Taxonomy
• Import / export metadata
• Basic Policy Rules Engine
• Real-time access control
• Column Level Tagging
Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas Capabilities: Overview
Data Classification
• Import or define taxonomy business-oriented annotations for data
• Define, annotate, and automate capture of relationships between data sets and underlying
elements including source, target, and derivation processes
• Export metadata to third-party systems
Centralized Auditing
• Capture security access information for every application, process, and interaction with data
• Capture the operational information for execution, steps, and activities
Search & Lineage (Browse)
• Pre-defined navigation paths to explore the data classification and audit information
• Text-based search features locates relevant data and audit event across Data Lake quickly
and accurately
• Browse visualization of data set lineage allowing users to drill-down into operational, security,
and provenance related information
Security & Policy Engine
• Rationalize compliance policy at runtime based on data classification schemes
• Advanced definition of policies for preventing data derivation based on classification (i.e. re-
identification)
Apache Atlas
Knowledge Store
Audit Store
ModelsType-System
Policy RulesTaxonomies
Tag Based
Policies
Data Lifecycle
Management
Real Time Tag Based Access Control
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Energy
PPDM
Retail
PCI
PII
Other
CWM
Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Governance Ready Certification Program
Curated group of vendor partners to provide
rich & complete features
Customers choose features that they want to
deploy – a la carte.
Low switching costs !
HDP at core to provide stability and
interoperability
Discovery
Tagging
Prep /
Cleanse
ETL
Governance
BPM
Self
Service
Visual-
ization
Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Architecture
Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
High Level Architecture
Type System
Repository
Search DSL
Bridge
Hive Storm OthersSqoop
REST API
Titan / HBase
Solr/Elastic
Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Type System – Overview of Types
• Class
• Struct
• Trait
• Primitives
• Collections
• Map
• Array
• Instances (Entity)
• Referenceable
Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Type System – Data Types
Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Type System - Examples
HierarchicalTypeDefinition<ClassType> tblClsDef =
TypesUtil.createClassTypeDef(TABLE_TYPE, null,
attrDef("name", DataTypes.STRING_TYPE),
attrDef("description", DataTypes.STRING_TYPE),
new AttributeDefinition("db", DATABASE_TYPE,
Multiplicity.REQUIRED, false, null),
new AttributeDefinition("sd", STORAGE_DESC_TYPE,
Multiplicity.REQUIRED, true, null),
attrDef("owner", DataTypes.STRING_TYPE),
attrDef("createTime", DataTypes.INT_TYPE),
attrDef("lastAccessTime", DataTypes.INT_TYPE),
attrDef("retention", DataTypes.INT_TYPE),
attrDef("viewOriginalText", DataTypes.STRING_TYPE),
attrDef("viewExpandedText", DataTypes.STRING_TYPE),
attrDef("tableType", DataTypes.STRING_TYPE),
attrDef("temporary", DataTypes.BOOLEAN_TYPE),
new AttributeDefinition("columns",
DataTypes.arrayTypeName(COLUMN_TYPE),
Multiplicity.COLLECTION, true, null)
);
Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Repository
• Graph Database
• Titan with storage backed by HBase
• Types and instances are mapped to the Graph DB
• Classes, Structs and Traits map to a vertex
• Relationships are mapped as edges
• Search - plugin enabled
• Indexing based on type annotations
• Solr
• Elastic search
Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Search
• DSL with SQL Like Syntax
• from $type is $trait where $clause select|has $attributes loop $loopExpression withPath, repeat
• Examples
• from DB
• DB where name="Reporting" select name, owner
• DB has name
• DB is JdbcAccess
• Column where Column isa PII
• Table where name="sales_fact", columns
• Table where name="sales_fact", columns as column select column.name, column.dataType,
column.comment
• Full-text search
Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Lineage
• Uses Search DSL Loop expression
• Everything results in search
• Named Queries
• inputs
• Table where (name = "sales_fact_monthly_mv") as src loop (LoadProcess->outputTable
inputTables) as dest select src.name as src_name, dest.name as dest_name withPath
• outputs
• Table where (name = "sales_fact") as src loop (LoadProcess->inputTables outputTables) as dest
select src.name as src_name, dest.name as dest_name withPath
• schema
• Table where name="sales_fact", columns
Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive Integration
Apache Atlas
Hive Bridge
(Client)
Hive Hook
(Post-execution)
REST API
Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Prototype Demonstration
Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demo – Class Diagram
Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demo – Instance Graph
Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demo
• http://162.212.133.230:21000/dashboard/test1/Search.html
Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Questions and Answers

Más contenido relacionado

La actualidad más candente

Strategic Business Requirements for Master Data Management Systems
Strategic Business Requirements for Master Data Management SystemsStrategic Business Requirements for Master Data Management Systems
Strategic Business Requirements for Master Data Management SystemsBoris Otto
 
How to Strengthen Enterprise Data Governance with Data Quality
How to Strengthen Enterprise Data Governance with Data QualityHow to Strengthen Enterprise Data Governance with Data Quality
How to Strengthen Enterprise Data Governance with Data QualityDATAVERSITY
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?DATAVERSITY
 
Data Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data QualityData Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data QualityDATAVERSITY
 
Data Audit Approach To Developing An Enterprise Data Strategy
Data Audit Approach To Developing An Enterprise Data StrategyData Audit Approach To Developing An Enterprise Data Strategy
Data Audit Approach To Developing An Enterprise Data StrategyAlan McSweeney
 
Master Data Management – Aligning Data, Process, and Governance
Master Data Management – Aligning Data, Process, and GovernanceMaster Data Management – Aligning Data, Process, and Governance
Master Data Management – Aligning Data, Process, and GovernanceDATAVERSITY
 
Data Governance
Data GovernanceData Governance
Data GovernanceRob Lux
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?Precisely
 
Data Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and RoadmapsData Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and RoadmapsDATAVERSITY
 
How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...Christopher Bradley
 
Implementing Effective Data Governance
Implementing Effective Data GovernanceImplementing Effective Data Governance
Implementing Effective Data GovernanceChristopher Bradley
 
The Importance of Metadata
The Importance of MetadataThe Importance of Metadata
The Importance of MetadataDATAVERSITY
 
Best Practices in Metadata Management
Best Practices in Metadata ManagementBest Practices in Metadata Management
Best Practices in Metadata ManagementDATAVERSITY
 
Improving Data Literacy Around Data Architecture
Improving Data Literacy Around Data ArchitectureImproving Data Literacy Around Data Architecture
Improving Data Literacy Around Data ArchitectureDATAVERSITY
 
Master Data Management - Gartner Presentation
Master Data Management - Gartner PresentationMaster Data Management - Gartner Presentation
Master Data Management - Gartner Presentation303Computing
 
The Role of Data Governance in a Data Strategy
The Role of Data Governance in a Data StrategyThe Role of Data Governance in a Data Strategy
The Role of Data Governance in a Data StrategyDATAVERSITY
 
Data Quality
Data QualityData Quality
Data Qualityjerdeb
 
Master Data Management - Practical Strategies for Integrating into Your Data ...
Master Data Management - Practical Strategies for Integrating into Your Data ...Master Data Management - Practical Strategies for Integrating into Your Data ...
Master Data Management - Practical Strategies for Integrating into Your Data ...DATAVERSITY
 
Becoming a Data-Driven Organization - Aligning Business & Data Strategy
Becoming a Data-Driven Organization - Aligning Business & Data StrategyBecoming a Data-Driven Organization - Aligning Business & Data Strategy
Becoming a Data-Driven Organization - Aligning Business & Data StrategyDATAVERSITY
 

La actualidad más candente (20)

Strategic Business Requirements for Master Data Management Systems
Strategic Business Requirements for Master Data Management SystemsStrategic Business Requirements for Master Data Management Systems
Strategic Business Requirements for Master Data Management Systems
 
How to Strengthen Enterprise Data Governance with Data Quality
How to Strengthen Enterprise Data Governance with Data QualityHow to Strengthen Enterprise Data Governance with Data Quality
How to Strengthen Enterprise Data Governance with Data Quality
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 
Data Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data QualityData Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data Quality
 
Data Audit Approach To Developing An Enterprise Data Strategy
Data Audit Approach To Developing An Enterprise Data StrategyData Audit Approach To Developing An Enterprise Data Strategy
Data Audit Approach To Developing An Enterprise Data Strategy
 
Master Data Management – Aligning Data, Process, and Governance
Master Data Management – Aligning Data, Process, and GovernanceMaster Data Management – Aligning Data, Process, and Governance
Master Data Management – Aligning Data, Process, and Governance
 
Data Governance
Data GovernanceData Governance
Data Governance
 
8 Steps to Creating a Data Strategy
8 Steps to Creating a Data Strategy8 Steps to Creating a Data Strategy
8 Steps to Creating a Data Strategy
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
 
Data Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and RoadmapsData Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and Roadmaps
 
How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...
 
Implementing Effective Data Governance
Implementing Effective Data GovernanceImplementing Effective Data Governance
Implementing Effective Data Governance
 
The Importance of Metadata
The Importance of MetadataThe Importance of Metadata
The Importance of Metadata
 
Best Practices in Metadata Management
Best Practices in Metadata ManagementBest Practices in Metadata Management
Best Practices in Metadata Management
 
Improving Data Literacy Around Data Architecture
Improving Data Literacy Around Data ArchitectureImproving Data Literacy Around Data Architecture
Improving Data Literacy Around Data Architecture
 
Master Data Management - Gartner Presentation
Master Data Management - Gartner PresentationMaster Data Management - Gartner Presentation
Master Data Management - Gartner Presentation
 
The Role of Data Governance in a Data Strategy
The Role of Data Governance in a Data StrategyThe Role of Data Governance in a Data Strategy
The Role of Data Governance in a Data Strategy
 
Data Quality
Data QualityData Quality
Data Quality
 
Master Data Management - Practical Strategies for Integrating into Your Data ...
Master Data Management - Practical Strategies for Integrating into Your Data ...Master Data Management - Practical Strategies for Integrating into Your Data ...
Master Data Management - Practical Strategies for Integrating into Your Data ...
 
Becoming a Data-Driven Organization - Aligning Business & Data Strategy
Becoming a Data-Driven Organization - Aligning Business & Data StrategyBecoming a Data-Driven Organization - Aligning Business & Data Strategy
Becoming a Data-Driven Organization - Aligning Business & Data Strategy
 

Similar a Data Governance Initiative

Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceHortonworks
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Hortonworks
 
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...DataWorks Summit/Hadoop Summit
 
What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it DataWorks Summit/Hadoop Summit
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...DataWorks Summit/Hadoop Summit
 
Atlas and ranger epam meetup
Atlas and ranger epam meetupAtlas and ranger epam meetup
Atlas and ranger epam meetupAlex Zeltov
 
Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?DataWorks Summit/Hadoop Summit
 
Classification based security in Hadoop
Classification based security in HadoopClassification based security in Hadoop
Classification based security in HadoopMadhan Neethiraj
 
Tableau and hadoop
Tableau and hadoopTableau and hadoop
Tableau and hadoopCraig Jordan
 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks
 
Apache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop componentsApache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop componentsDataWorks Summit/Hadoop Summit
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDataWorks Summit
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Seetharam Venkatesh
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Victor Holman
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies DataWorks Summit/Hadoop Summit
 
Building a data-driven authorization framework
Building a data-driven authorization frameworkBuilding a data-driven authorization framework
Building a data-driven authorization frameworkDataWorks Summit
 

Similar a Data Governance Initiative (20)

HDP Next: Governance
HDP Next: GovernanceHDP Next: Governance
HDP Next: Governance
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
 
Enterprise Data Classification and Provenance
Enterprise Data Classification and ProvenanceEnterprise Data Classification and Provenance
Enterprise Data Classification and Provenance
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015
 
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
 
Apache Atlas: Governance for your Data
Apache Atlas: Governance for your DataApache Atlas: Governance for your Data
Apache Atlas: Governance for your Data
 
What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
Atlas and ranger epam meetup
Atlas and ranger epam meetupAtlas and ranger epam meetup
Atlas and ranger epam meetup
 
Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?
 
Classification based security in Hadoop
Classification based security in HadoopClassification based security in Hadoop
Classification based security in Hadoop
 
Tableau and hadoop
Tableau and hadoopTableau and hadoop
Tableau and hadoop
 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration
 
Apache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop componentsApache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop components
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
 
Building a data-driven authorization framework
Building a data-driven authorization frameworkBuilding a data-driven authorization framework
Building a data-driven authorization framework
 

Más de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Último (20)

Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Data Governance Initiative

  • 1. Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP: Data Governance We Do Hadoop
  • 2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Enterprise Data Governance Goals GOAL: Provide a common approach to data governance across all systems and data within the organization • Transparent Governance standards & protocols must be clearly defined and available to all • Reproducible Recreate the relevant data landscape at a point in time • Auditable All relevant events and assets but be traceable with appropriate historical lineage • Consistent Compliance practices must be consistent ETL/DQ BPM Business Analytics Visualization & Dashboards ERP CRM SCM MDM ARCHIVE Governance Framework
  • 3. Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Governance Challenges WITHIN Hadoop • No comprehensive governance within the Hadoop stack • Mostly disjoint as each project defines its own future and there is no common framework • Disparate tools, such as HCatalog, Ranger and Falcon provide pieces of the overall solution • No integration with external governance frameworks • Difficult to get right because each project is autonomous and you need insight into traditional IT ApachePig ApacheHive ApacheHBase ApacheAccumulo ApacheSolr ApacheSpark ApacheStorm
  • 4. Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Governance Initiative for Hadoop ETL/DQ BPM Business Analytics Visualization & Dashboards ERP CRM SCM MDM ARCHIVE Data Governance Initiative Common Governance Framework 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ApachePig ApacheHive ApacheHBase ApacheAccumulo ApacheSolr ApacheSpark ApacheStorm TWO Requirements 1. Hadoop must snap in to the existing frameworks and be a good citizen 2. Hadoop must also provide governance within its own stack of technologies A group of companies dedicated to meeting these requirements in the open
  • 5. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Overview We Do Hadoop
  • 6. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved New Project Proposal: Apache Atlas Apache Atlas Proposed open source project aimed at solving the Hadoop data governance challenge in the open. Key Capabilities • Data Classification • Metadata Exchange • Centralized Auditing • Search & Lineage (Browse) • Security & Policy Engine Apache Atlas Knowledge Store Audit Store ModelsType-System Policy RulesTaxonomies Tag Based Policies Data Lifecycle Management Real Time Tag Based Access Control REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Energy PPDM Retail PCI PII Other CWM Essential Timeline Phase-3 • Collaboration Features • Self Service • Steward Delegation • Profiling & Pattern Analysis • Visualization Phase-2 • Advance audit reporting • Advanced Policy Engine • Row / Column Masking • 3rd party Metadata exchange 1H 2015 GA • Rest API • Centralized Taxonomy • Import / export metadata • Basic Policy Rules Engine • Real-time access control • Column Level Tagging
  • 7. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Capabilities: Overview Data Classification • Import or define taxonomy business-oriented annotations for data • Define, annotate, and automate capture of relationships between data sets and underlying elements including source, target, and derivation processes • Export metadata to third-party systems Centralized Auditing • Capture security access information for every application, process, and interaction with data • Capture the operational information for execution, steps, and activities Search & Lineage (Browse) • Pre-defined navigation paths to explore the data classification and audit information • Text-based search features locates relevant data and audit event across Data Lake quickly and accurately • Browse visualization of data set lineage allowing users to drill-down into operational, security, and provenance related information Security & Policy Engine • Rationalize compliance policy at runtime based on data classification schemes • Advanced definition of policies for preventing data derivation based on classification (i.e. re- identification) Apache Atlas Knowledge Store Audit Store ModelsType-System Policy RulesTaxonomies Tag Based Policies Data Lifecycle Management Real Time Tag Based Access Control REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Energy PPDM Retail PCI PII Other CWM
  • 8. Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Governance Ready Certification Program Curated group of vendor partners to provide rich & complete features Customers choose features that they want to deploy – a la carte. Low switching costs ! HDP at core to provide stability and interoperability Discovery Tagging Prep / Cleanse ETL Governance BPM Self Service Visual- ization
  • 9. Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Architecture
  • 10. Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved High Level Architecture Type System Repository Search DSL Bridge Hive Storm OthersSqoop REST API Titan / HBase Solr/Elastic
  • 11. Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Type System – Overview of Types • Class • Struct • Trait • Primitives • Collections • Map • Array • Instances (Entity) • Referenceable
  • 12. Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Type System – Data Types
  • 13. Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Type System - Examples HierarchicalTypeDefinition<ClassType> tblClsDef = TypesUtil.createClassTypeDef(TABLE_TYPE, null, attrDef("name", DataTypes.STRING_TYPE), attrDef("description", DataTypes.STRING_TYPE), new AttributeDefinition("db", DATABASE_TYPE, Multiplicity.REQUIRED, false, null), new AttributeDefinition("sd", STORAGE_DESC_TYPE, Multiplicity.REQUIRED, true, null), attrDef("owner", DataTypes.STRING_TYPE), attrDef("createTime", DataTypes.INT_TYPE), attrDef("lastAccessTime", DataTypes.INT_TYPE), attrDef("retention", DataTypes.INT_TYPE), attrDef("viewOriginalText", DataTypes.STRING_TYPE), attrDef("viewExpandedText", DataTypes.STRING_TYPE), attrDef("tableType", DataTypes.STRING_TYPE), attrDef("temporary", DataTypes.BOOLEAN_TYPE), new AttributeDefinition("columns", DataTypes.arrayTypeName(COLUMN_TYPE), Multiplicity.COLLECTION, true, null) );
  • 14. Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Repository • Graph Database • Titan with storage backed by HBase • Types and instances are mapped to the Graph DB • Classes, Structs and Traits map to a vertex • Relationships are mapped as edges • Search - plugin enabled • Indexing based on type annotations • Solr • Elastic search
  • 15. Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Search • DSL with SQL Like Syntax • from $type is $trait where $clause select|has $attributes loop $loopExpression withPath, repeat • Examples • from DB • DB where name="Reporting" select name, owner • DB has name • DB is JdbcAccess • Column where Column isa PII • Table where name="sales_fact", columns • Table where name="sales_fact", columns as column select column.name, column.dataType, column.comment • Full-text search
  • 16. Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Lineage • Uses Search DSL Loop expression • Everything results in search • Named Queries • inputs • Table where (name = "sales_fact_monthly_mv") as src loop (LoadProcess->outputTable inputTables) as dest select src.name as src_name, dest.name as dest_name withPath • outputs • Table where (name = "sales_fact") as src loop (LoadProcess->inputTables outputTables) as dest select src.name as src_name, dest.name as dest_name withPath • schema • Table where name="sales_fact", columns
  • 17. Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hive Integration Apache Atlas Hive Bridge (Client) Hive Hook (Post-execution) REST API
  • 18. Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Prototype Demonstration
  • 19. Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Demo – Class Diagram
  • 20. Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Demo – Instance Graph
  • 21. Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Demo • http://162.212.133.230:21000/dashboard/test1/Search.html
  • 22. Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Questions and Answers

Notas del editor

  1. A data governance framework in any organization comprises a combination of people, process and technology that are in place to establish decision rights and accountabilities for information. A governance policy defines who can take what actions with what information, and when, under what circumstances, using what methods.   The technology goals for a data governance framework are to provide a platform for a common approach across all systems and data within the organization, Explicitly they need to be: - Transparent: Governance standards & protocols must be clearly defined and available to all - Reproducible: Recreate the relevant data landscape at a point in time - Auditable: All relevant events and assets but be traceable with appropriate historical lineage - Consistent: Compliance practices must be consistent    
  2. One of the most important and challenging goals is consistency and when we inspect implementation of data governance for a Hadoop implementation this consistency seems almost impossible.   While there are constructs within the various Hadoop related projects to aid governance, there is no single, consistent approach to implementation/architecture of this important framework. Disparate tools, such as HCatalog, Ranger and Falcon provide pieces of the overall solution, but are not holistic across the stack in approach.   Further, with Hadoop gaining traction in the enterprise, it has only recently been asked to integrate with existing frameworks such as operations, security and now governance. It is now a foundational concern.    
  3. Consistent with our approach, Hortonworks recently led the development Data Governance Initiative which brings together industry experts from some of the largest enterprise, traditional vendors and Hadoop experts to address governance within Hadoop.   The members have already defined a comprehensive solution for data governance in Hadoop that will provide the base functions of metadata services, deep audit store and an advanced policy rules engine, but also and more importantly, the solution will interoperate with and extend existing third-party data governance and management tools by shedding light on the access of data within Hadoop.   This approach allows organizations to apply comprehensive governance policy across their data.   In order to provide these functions for Hadoop, much work will need to be accomplished throughout all of the Hadoop projects. The initiative is unique in that it is not a spot solution, but will build on top of existing projects such as Apache Falcon for data lifecycle management and Apache Ranger for global security policies. The founding members of this initiative include, Aetna, Target, Merck, SAS and Hortonworks and many more are lined up to contribute. This is a broad community based initiative set to deliver the right features for the right requirements and is perfectly representative of the Hortonworks open community approach.  
  4. Apache Atlas is the only open source project created to solve the governance challenge in the open. The founding members of the project include all the members of the data governance intiative and others from the Hadoop community. The core functionality defined by the project includes the following: Data Classification – create an understanding of the data within Hadoop and provide a classification of this data to external and internal sources Centralized Auditing – provide a framework to capture and report on access to and modifications of data within Hadoop Search & Lineage – allow pre-defined and ad hoc exploration of data and metadata while maintaining a history of how a data source or explicit data was constructed. Security and Policy Engine – implement engines to protect and rationalize data access and according to compliance policy.
  5. Roll tax into data disco