Enabling Data Management in a Big Data World

•Descargar como PPTX, PDF•

1 recomendación•1,415 vistas

adoop has enabled a new scale of data processing that is paving the way for data driven businesses. However, business data is often riddled with compliance and regulatory requirements that can be easily lost as data is manipulated, transformed, and re-written within the Hadoop eco-system. Furthermore, enterprise data is often scattered across a wide array of systems, each with their own techniques for policy management. As data from these disparate systems is loaded into Hadoop, all of the carefully crafted policy is immediately lost, creating a potential risk for the business. Data provenance is widely recognized as a technique for applying policy in more traditional industries such as storage, databases and high-performance computing. By tracking data from its origin and across various transformations and computations, provenance tracking systems can answer questions such as: Who has seen a given piece of data? Where did this data come from? What policies existed on this data? In this talk, we will discuss traditional data management solutions, the challenges of bringing them to an eco-system like Hadoop, and approaches to enable data management in the growing Big Data world.

Tecnología Empresariales

Enabling data management in a
big data world
Craig Soules
Garth Goodson
Tanya Shastri

The problem with data management
• Hadoop is a collection of tools
– Not tightly integrated
– Everyone’s stack looks a little different
– Everything falls back to files

Agenda
• Traditional data management
• Hadoop’s eco-system
• Natero’s approach to data management

What is data management?
• What do you have?
– What data sets exist?
– Where are they stored?
– What properties do they have?
• Are you doing the right thing with it?
– Who can access data?
– Who has accessed data?
– What did they do with it?
– What rules apply to this data?

Traditional data management
External
Data
Sources
Extract
Transform
Load
DataWarehouse
Integrated storage
Data processing
Users
SQL

Key lessons of traditional systems
• Data requires the right abstraction
– Schemas have value
– Tables are easy to reason about
• Referenced by name, not location
• Narrow interface
– SQL defines the data sources and the processing
• But not where and how the data is kept!

Hadoop eco-system
External
Data
Sources
HDFS storage layer
Processing Framework
(Map-Reduce)
Users
HBase
Sqoop
+
Flume
Pig HiveQL Mahout
Hive
Metastore
(HCatalog)
Oozie
Cloudera
Navigator

Key challenges
External
Data
Sources
HDFS storage layer
Users
Sqoop
+
Flume
More varied data
sources with many
more access / retention
requirements
Processing Framework
(Map-Reduce)
HBase
Pig
Hive
Metastore
(HCatalog)
Oozie
Cloudera
Navigator
HiveQL Mahout

Key challenges
External
Data
Sources
HDFS storage layer
Users
Sqoop
+
Flume
Data accessed through
multiple entry points
Processing Framework
(Map-Reduce)
HBase
Pig
Hive
Metastore
(HCatalog)
Oozie
Cloudera
Navigator
HiveQL Mahout

Key challenges
External
Data
Sources
HDFS storage layer
Users
Sqoop
+
Flume
Processing Framework
(Map-Reduce)
HBase
Pig
Hive
Metastore
(HCatalog)
Oozie
Cloudera
Navigator
Lots of new
consumers of the
data
HiveQL Mahout

Steps to data management
• Provide access at the right level
• Limit the processing interfaces
• Schemas and provenance provide control
• Enforce policy
1
3
2
4

Case study: Natero
• Cloud-based analytics service
– Enable business users to take advantage of big data
– UI-driven workflow creation and automation
• Single shared Hadoop eco-system
– Need customer-level isolation and user-level access controls
• Goals:
– Provide the appropriate level of abstraction for our users
– Finer granularity of access control
– Enable policy enforcement
– Users shouldn’t have to think about policy
• Source-driven policy management

Natero application stack
External
Data
Sources
HDFS storage layer
Processing Framework
(Map-Reduce)
Users
HBase
Sqoop
+
Flume
Pig
Access-aware workflow compiler
Schema
Extraction
Policy
and
Metadata
Manager
Provenance-aware scheduler
HiveQL Mahout
1
3
2
4

Natero execution example
Job
Sources
Job
Compiler
Metadata
Manager
Scheduler
• Fine-grain
access control
• Auditing
• Enforceable policy
• Easy for users
Natero
UI

The right level of abstraction
• Our abstraction comes with trade-offs
– More control, compliance
– No more raw Map-Reduce
• Possible to integrate with Pig/Hive
• What’s the right level of abstraction for you?
– Kinds of execution

Hadoop projects to watch
• HCatalog
– Data discovery / schema management / access
• Falcon
– Lifecycle management / workflow execution
• Knox
– Centralized access control
• Navigator
– Auditing / access management

Lessons learned
• If you want control over your data, you also
need control over data processing
• File-based access control is not enough
• Metadata is crucial
• Users aren’t motivated by policy
– Policy shouldn’t get in the way of use
– But you might get IT to reason about the sources

Más contenido relacionado

Más de DataWorks Summit

Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit

HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit

Security Framework for Multitenant ArchitectureDataWorks Summit

Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

Extending Twitter's Data Platform to Google CloudDataWorks Summit

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

Computer Vision: Coming to a Store Near YouDataWorks Summit

Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit

Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit

Applying Noisy Knowledge Graphs to Real ProblemsDataWorks Summit

Open Source, Open Data: Driving Innovation in Smart CitiesDataWorks Summit

Data Protection in Hybrid Enterprise Data Lake EnvironmentDataWorks Summit

Big Data Technologies in Support of a Medical School Data Science InstituteDataWorks Summit

Hadoop Storage in the Cloud Native EraDataWorks Summit

Más de DataWorks Summit (20)

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...

Applying Noisy Knowledge Graphs to Real Problems

Open Source, Open Data: Driving Innovation in Smart Cities

Data Protection in Hybrid Enterprise Data Lake Environment

Big Data Technologies in Support of a Medical School Data Science Institute

Hadoop Storage in the Cloud Native Era

Último

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

How to convert PDF to text with Nanonetsnaman860154

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Slack Application Development 101 Slidespraypatel2

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Enabling Data Management in a Big Data World

1. Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri

2. The problem with data management • Hadoop is a collection of tools – Not tightly integrated – Everyone’s stack looks a little different – Everything falls back to files

3. Agenda • Traditional data management • Hadoop’s eco-system • Natero’s approach to data management

4. What is data management? • What do you have? – What data sets exist? – Where are they stored? – What properties do they have? • Are you doing the right thing with it? – Who can access data? – Who has accessed data? – What did they do with it? – What rules apply to this data?

5. Traditional data management External Data Sources Extract Transform Load DataWarehouse Integrated storage Data processing Users SQL

6. Key lessons of traditional systems • Data requires the right abstraction – Schemas have value – Tables are easy to reason about • Referenced by name, not location • Narrow interface – SQL defines the data sources and the processing • But not where and how the data is kept!

7. Hadoop eco-system External Data Sources HDFS storage layer Processing Framework (Map-Reduce) Users HBase Sqoop + Flume Pig HiveQL Mahout Hive Metastore (HCatalog) Oozie Cloudera Navigator

8. Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume More varied data sources with many more access / retention requirements Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Oozie Cloudera Navigator HiveQL Mahout

9. Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Data accessed through multiple entry points Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Oozie Cloudera Navigator HiveQL Mahout

10. Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Oozie Cloudera Navigator Lots of new consumers of the data HiveQL Mahout

11. Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Oozie Cloudera Navigator One access control mechanism: files HiveQL Mahout

12. Steps to data management • Provide access at the right level • Limit the processing interfaces • Schemas and provenance provide control • Enforce policy 1 3 2 4

13. Case study: Natero • Cloud-based analytics service – Enable business users to take advantage of big data – UI-driven workflow creation and automation • Single shared Hadoop eco-system – Need customer-level isolation and user-level access controls • Goals: – Provide the appropriate level of abstraction for our users – Finer granularity of access control – Enable policy enforcement – Users shouldn’t have to think about policy • Source-driven policy management

14. Natero application stack External Data Sources HDFS storage layer Processing Framework (Map-Reduce) Users HBase Sqoop + Flume Pig Access-aware workflow compiler Schema Extraction Policy and Metadata Manager Provenance-aware scheduler HiveQL Mahout 1 3 2 4

15. Natero execution example Job Sources Job Compiler Metadata Manager Scheduler • Fine-grain access control • Auditing • Enforceable policy • Easy for users Natero UI

16. The right level of abstraction • Our abstraction comes with trade-offs – More control, compliance – No more raw Map-Reduce • Possible to integrate with Pig/Hive • What’s the right level of abstraction for you? – Kinds of execution

17. Hadoop projects to watch • HCatalog – Data discovery / schema management / access • Falcon – Lifecycle management / workflow execution • Knox – Centralized access control • Navigator – Auditing / access management

18. Lessons learned • If you want control over your data, you also need control over data processing • File-based access control is not enough • Metadata is crucial • Users aren’t motivated by policy – Policy shouldn’t get in the way of use – But you might get IT to reason about the sources

Notas del editor

Data is abstracted to tablesSQL is a narrow interface that describes processingTaken a while for traditional systems to get hereStructured data world has evolved into the enterprise… molded to fit its needs
Pig, HiveQL, Mahout, MR: different processing interfacesOozie: Workflow dependency management and automationCloudera Navigator: centralized access control and auditingHive Metastore / Hcatalog: centralized data access and schema management
Pig, HiveQL, Mahout, MR: different processing interfacesOozie: Workflow dependency management and automationCloudera Navigator: centralized access control and auditingHive Metastore / Hcatalog: centralized data access and schema management
Pig, HiveQL, Mahout, MR: different processing interfacesOozie: Workflow dependency management and automationCloudera Navigator: centralized access control and auditingHive Metastore / Hcatalog: centralized data access and schema management
Don’t need to be so restrictive on the use, because it scales out… the DW might get over-provisioned.More people doing different kinds of things… more shift towards exploration.People are using Hadoop as an ETL into something else, but what policies should be in place for that data when it goes into the DW?
MAKE THIS SLIDE ABOUT THE FACT THAT THERE’S ONE ACCESS CONTROL MECHANISMS: FILES. IF YOU HAVE ACCESS TO IT, THEN YOU DO. CONTRAST THAT TO THE DW WORLD WHERE IF YOU HAVE ACCESS TO THE TABLE, YOU STILL DON’T HAVE ACCESS TO THE STORAGE. THERE’S AN ABSTRACTION THAT FORCES EVERYTHING TO PASS THROUGH THE PROCESSING LAYER.These tools live within an open eco-system, which prevents them from being complete solutions (at least today). And the shared access control mechanisms mean that control beyond file-read access is difficult / impossible.Workflows make the data very stepped.
Step 1: Integrate processing and storage Prevents direct access to storage… add access has to go through the data processing toolsWithout doing this you can’t: effectively track data as it moves through the system, enable fine-grained access control or source-based access controlExample: Oracle + NetApp w/ single oracle userStep 2: Limit the interfacesThe many entry-points to Hadoop increase the IT complexity of data management, and in some cases completely disable step 1. Reducing the entry points provides a way to control access and properly track behavior.Step 3: Metadata collection and trackingUnderstanding what is in your data enables the use of fine-grained access control, prevents the incorrect declassification of dataProvenance is crucial to effective policy enforcementExample: table schemas, logging: column-level access controlStep 4: Policy enforcementWe apply policy to the sources and allow those policies to cascade through the provenance to the result data. There’s an enormous body of work on this, but currently there’s nothing in Hadoop enables this. (Handful of companies) (e.g., Cloudera) is working on some stuff, but it’s still under wraps.
Pig, HiveQL, Mahout, MR: different processing interfacesOozie: Workflow dependency management and automationCloudera Navigator: centralized access control and auditingHive Metastore / Hcatalog: centralized data access and schema managementWatch what’s said to make it clear that the devs still have control over who sees what and what runs, its more about helping them enforce it.Put the numbers from the steps onto the diagram

Enabling Data Management in a Big Data World

Recomendados

Recomendados

Más contenido relacionado

Más de DataWorks Summit

Más de DataWorks Summit (20)

Último

Último (20)

Enabling Data Management in a Big Data World

Notas del editor