SlideShare una empresa de Scribd logo
1 de 18
Enabling data management in a
big data world
Craig Soules
Garth Goodson
Tanya Shastri
The problem with data management
• Hadoop is a collection of tools
– Not tightly integrated
– Everyone’s stack looks a little different
– Everything falls back to files
Agenda
• Traditional data management
• Hadoop’s eco-system
• Natero’s approach to data management
What is data management?
• What do you have?
– What data sets exist?
– Where are they stored?
– What properties do they have?
• Are you doing the right thing with it?
– Who can access data?
– Who has accessed data?
– What did they do with it?
– What rules apply to this data?
Traditional data management
External
Data
Sources
Extract
Transform
Load
DataWarehouse
Integrated storage
Data processing
Users
SQL
Key lessons of traditional systems
• Data requires the right abstraction
– Schemas have value
– Tables are easy to reason about
• Referenced by name, not location
• Narrow interface
– SQL defines the data sources and the processing
• But not where and how the data is kept!
Hadoop eco-system
External
Data
Sources
HDFS storage layer
Processing Framework
(Map-Reduce)
Users
HBase
Sqoop
+
Flume
Pig HiveQL Mahout
Hive
Metastore
(HCatalog)
Oozie
Cloudera
Navigator
Key challenges
External
Data
Sources
HDFS storage layer
Users
Sqoop
+
Flume
More varied data
sources with many
more access / retention
requirements
Processing Framework
(Map-Reduce)
HBase
Pig
Hive
Metastore
(HCatalog)
Oozie
Cloudera
Navigator
HiveQL Mahout
Key challenges
External
Data
Sources
HDFS storage layer
Users
Sqoop
+
Flume
Data accessed through
multiple entry points
Processing Framework
(Map-Reduce)
HBase
Pig
Hive
Metastore
(HCatalog)
Oozie
Cloudera
Navigator
HiveQL Mahout
Key challenges
External
Data
Sources
HDFS storage layer
Users
Sqoop
+
Flume
Processing Framework
(Map-Reduce)
HBase
Pig
Hive
Metastore
(HCatalog)
Oozie
Cloudera
Navigator
Lots of new
consumers of the
data
HiveQL Mahout
Key challenges
External
Data
Sources
HDFS storage layer
Users
Sqoop
+
Flume
Processing Framework
(Map-Reduce)
HBase
Pig
Hive
Metastore
(HCatalog)
Oozie
Cloudera
Navigator
One access control mechanism: files
HiveQL Mahout
Steps to data management
• Provide access at the right level
• Limit the processing interfaces
• Schemas and provenance provide control
• Enforce policy
1
3
2
4
Case study: Natero
• Cloud-based analytics service
– Enable business users to take advantage of big data
– UI-driven workflow creation and automation
• Single shared Hadoop eco-system
– Need customer-level isolation and user-level access controls
• Goals:
– Provide the appropriate level of abstraction for our users
– Finer granularity of access control
– Enable policy enforcement
– Users shouldn’t have to think about policy
• Source-driven policy management
Natero application stack
External
Data
Sources
HDFS storage layer
Processing Framework
(Map-Reduce)
Users
HBase
Sqoop
+
Flume
Pig
Access-aware workflow compiler
Schema
Extraction
Policy
and
Metadata
Manager
Provenance-aware scheduler
HiveQL Mahout
1
3
2
4
Natero execution example
Job
Sources
Job
Compiler
Metadata
Manager
Scheduler
• Fine-grain
access control
• Auditing
• Enforceable policy
• Easy for users
Natero
UI
The right level of abstraction
• Our abstraction comes with trade-offs
– More control, compliance
– No more raw Map-Reduce
• Possible to integrate with Pig/Hive
• What’s the right level of abstraction for you?
– Kinds of execution
Hadoop projects to watch
• HCatalog
– Data discovery / schema management / access
• Falcon
– Lifecycle management / workflow execution
• Knox
– Centralized access control
• Navigator
– Auditing / access management
Lessons learned
• If you want control over your data, you also
need control over data processing
• File-based access control is not enough
• Metadata is crucial
• Users aren’t motivated by policy
– Policy shouldn’t get in the way of use
– But you might get IT to reason about the sources

Más contenido relacionado

Más de DataWorks Summit

Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsDataWorks Summit
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesDataWorks Summit
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentDataWorks Summit
 
Big Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science InstituteBig Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science InstituteDataWorks Summit
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraDataWorks Summit
 

Más de DataWorks Summit (20)

Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart Cities
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake Environment
 
Big Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science InstituteBig Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science Institute
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native Era
 

Último

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Último (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Enabling Data Management in a Big Data World

  • 1. Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri
  • 2. The problem with data management • Hadoop is a collection of tools – Not tightly integrated – Everyone’s stack looks a little different – Everything falls back to files
  • 3. Agenda • Traditional data management • Hadoop’s eco-system • Natero’s approach to data management
  • 4. What is data management? • What do you have? – What data sets exist? – Where are they stored? – What properties do they have? • Are you doing the right thing with it? – Who can access data? – Who has accessed data? – What did they do with it? – What rules apply to this data?
  • 6. Key lessons of traditional systems • Data requires the right abstraction – Schemas have value – Tables are easy to reason about • Referenced by name, not location • Narrow interface – SQL defines the data sources and the processing • But not where and how the data is kept!
  • 7. Hadoop eco-system External Data Sources HDFS storage layer Processing Framework (Map-Reduce) Users HBase Sqoop + Flume Pig HiveQL Mahout Hive Metastore (HCatalog) Oozie Cloudera Navigator
  • 8. Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume More varied data sources with many more access / retention requirements Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Oozie Cloudera Navigator HiveQL Mahout
  • 9. Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Data accessed through multiple entry points Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Oozie Cloudera Navigator HiveQL Mahout
  • 10. Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Oozie Cloudera Navigator Lots of new consumers of the data HiveQL Mahout
  • 11. Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Oozie Cloudera Navigator One access control mechanism: files HiveQL Mahout
  • 12. Steps to data management • Provide access at the right level • Limit the processing interfaces • Schemas and provenance provide control • Enforce policy 1 3 2 4
  • 13. Case study: Natero • Cloud-based analytics service – Enable business users to take advantage of big data – UI-driven workflow creation and automation • Single shared Hadoop eco-system – Need customer-level isolation and user-level access controls • Goals: – Provide the appropriate level of abstraction for our users – Finer granularity of access control – Enable policy enforcement – Users shouldn’t have to think about policy • Source-driven policy management
  • 14. Natero application stack External Data Sources HDFS storage layer Processing Framework (Map-Reduce) Users HBase Sqoop + Flume Pig Access-aware workflow compiler Schema Extraction Policy and Metadata Manager Provenance-aware scheduler HiveQL Mahout 1 3 2 4
  • 15. Natero execution example Job Sources Job Compiler Metadata Manager Scheduler • Fine-grain access control • Auditing • Enforceable policy • Easy for users Natero UI
  • 16. The right level of abstraction • Our abstraction comes with trade-offs – More control, compliance – No more raw Map-Reduce • Possible to integrate with Pig/Hive • What’s the right level of abstraction for you? – Kinds of execution
  • 17. Hadoop projects to watch • HCatalog – Data discovery / schema management / access • Falcon – Lifecycle management / workflow execution • Knox – Centralized access control • Navigator – Auditing / access management
  • 18. Lessons learned • If you want control over your data, you also need control over data processing • File-based access control is not enough • Metadata is crucial • Users aren’t motivated by policy – Policy shouldn’t get in the way of use – But you might get IT to reason about the sources

Notas del editor

  1. Data is abstracted to tablesSQL is a narrow interface that describes processingTaken a while for traditional systems to get hereStructured data world has evolved into the enterprise… molded to fit its needs
  2. Pig, HiveQL, Mahout, MR: different processing interfacesOozie: Workflow dependency management and automationCloudera Navigator: centralized access control and auditingHive Metastore / Hcatalog: centralized data access and schema management
  3. Pig, HiveQL, Mahout, MR: different processing interfacesOozie: Workflow dependency management and automationCloudera Navigator: centralized access control and auditingHive Metastore / Hcatalog: centralized data access and schema management
  4. Pig, HiveQL, Mahout, MR: different processing interfacesOozie: Workflow dependency management and automationCloudera Navigator: centralized access control and auditingHive Metastore / Hcatalog: centralized data access and schema management
  5. Don’t need to be so restrictive on the use, because it scales out… the DW might get over-provisioned.More people doing different kinds of things… more shift towards exploration.People are using Hadoop as an ETL into something else, but what policies should be in place for that data when it goes into the DW?
  6. MAKE THIS SLIDE ABOUT THE FACT THAT THERE’S ONE ACCESS CONTROL MECHANISMS: FILES. IF YOU HAVE ACCESS TO IT, THEN YOU DO. CONTRAST THAT TO THE DW WORLD WHERE IF YOU HAVE ACCESS TO THE TABLE, YOU STILL DON’T HAVE ACCESS TO THE STORAGE. THERE’S AN ABSTRACTION THAT FORCES EVERYTHING TO PASS THROUGH THE PROCESSING LAYER.These tools live within an open eco-system, which prevents them from being complete solutions (at least today). And the shared access control mechanisms mean that control beyond file-read access is difficult / impossible.Workflows make the data very stepped.
  7. Step 1: Integrate processing and storage Prevents direct access to storage… add access has to go through the data processing toolsWithout doing this you can’t: effectively track data as it moves through the system, enable fine-grained access control or source-based access controlExample: Oracle + NetApp w/ single oracle userStep 2: Limit the interfacesThe many entry-points to Hadoop increase the IT complexity of data management, and in some cases completely disable step 1. Reducing the entry points provides a way to control access and properly track behavior.Step 3: Metadata collection and trackingUnderstanding what is in your data enables the use of fine-grained access control, prevents the incorrect declassification of dataProvenance is crucial to effective policy enforcementExample: table schemas, logging: column-level access controlStep 4: Policy enforcementWe apply policy to the sources and allow those policies to cascade through the provenance to the result data. There’s an enormous body of work on this, but currently there’s nothing in Hadoop enables this. (Handful of companies) (e.g., Cloudera) is working on some stuff, but it’s still under wraps.
  8. Pig, HiveQL, Mahout, MR: different processing interfacesOozie: Workflow dependency management and automationCloudera Navigator: centralized access control and auditingHive Metastore / Hcatalog: centralized data access and schema managementWatch what’s said to make it clear that the devs still have control over who sees what and what runs, its more about helping them enforce it.Put the numbers from the steps onto the diagram