SlideShare una empresa de Scribd logo
1 de 25
Innovation in the Data Warehouse
Kit Menke, Software Architect
StampedeCon 2016
July 27, 2016
Agenda
▪ Use Case
▪ Architectures
▪ Decision Points
Enterprise Holdings, Inc.
▪ Our Business
• 9 thousand locations
• 80 countries
• 93 thousand employees
• 1.7 million vehicles
▪ Data Warehouse
• Near capacity: Used about 75+ of 80 Terabytes
• Streaming and batch data feeds from over 50 internal systems &
external sources
• 100+ databases and 22+ thousand tables
• Around 1 billion queries executed per month
• Over 45,000 reporting users with 5+ million report executions
every month.
• Statistical Modeling & Advanced Analytics - 40+ Projects
Implemented for Predictive & Diagnostic Analytics
Data Warehouse - Present
Data Warehouse Growth
Challenges – Current Platform
▪ System Capacity Constraints
• Overall Current System Utilization is High
• Space & CPU Constraints
• Most of these challenges can be overcome by adding
more Teradata capacity or doing augmentation
▪ Use Cases not good fit for Teradata EDW
• Unstructured data
• Source structures changing frequently
• Data for exploration, discovery, & analytics
• Staging, transient, & history data
• These challenges can be overcome by augmentation
▪ Bottom-line: Improved agility & greater value
Augmentation Recommendation: Hadoop
▪ Leverage Hadoop to complement Teradata
EDW
• Hybrid Approach
▪ The Hortonworks distribution of Hadoop
• Compatibility/integration with Teradata EDW to
achieve high degree of interoperability
▪ Intent is not to have a centralized Hadoop
service
• EDW Augmentation Only
7
Data Warehouse - Future
Architectures
▪ Data warehouse augmentation contains
streaming and batch use cases
▪ Three Big Data architectures to explore:
1. Batch
2. Lambda
3. Kappa
Batch
Batch
▪ Land data into Hadoop first
▪ ETL in Hadoop to build reporting tables and
publish to Teradata
▪ Archive old data from Teradata DB
▪ Data available for analysis in Hive
▪ Great for semi-structured data files
▪ But… too slow for streaming data
Lambda
Lambda
▪ Attempts to combine batch and streaming
to get benefits from both
▪ Batch layer is comprehensive and accurate
▪ Streaming layer is fast but might only be
able to keep recent data
▪ Potentially have to maintain two codebases
Kappa
Kappa
▪ Everything is a stream (no batch!)
▪ Depends largely on your log data store
usually Kafka
▪ All raw data is stored in Kafka
▪ Much simpler architecture than lambda
• New version? Re-deploy app and start
reprocessing from the start and generate new
output table
• Once complete point app to new output table
Choosing an Architecture
▪ Batch – process data in batches
• All data processed in batches to create an
output
▪ Lambda – split streaming data into batch
and real-time
• Stream processing for the data you need fast
and the rest is batch processed
▪ Kappa – everything is a stream
• All data is processed as a stream even when it
needs to be reprocessed
Implementing an Architecture
▪ Requirements for the use case drives
architecture
▪ Walk through decision points
1. Cloud or on premises
2. Physical or virtual machines
3. Cluster workload
▪ Plus others!
Cloud vs on premises
▪ Scalability
• Much easier to scale a Cloud solution
• Physical hardware requires an infrastructure team to manage
▪ Data source location (data gravity) / integration points
• Cluster should be as close as possible to your data source
• Cloud is good option for internet data sources
▪ Cloud offerings
• Hadoop: Azure HDInsight, Amazon EMR, Google Cloud
• Integration with other PaaS services
▪ Network
• Bandwidth to/from cloud implementation
Physical vs virtual
▪ Performance
• Physical hardware will perform better, Hadoop is
designed with physical hardware in mind
▪ Maintenance
• No hardware to maintain for virtual servers
▪ Time to market
• Virtual machines much faster to provision
• For physical hardware if infrastructure team is a
roadblock then appliance is good option instead of
commodity
▪ Development and test environments make more
sense to virtualize
Workload
▪ Streaming
• Running 24/7
• Need dedicated resources
▪ Batch
• Scheduled
• Periods of high utilization (scalability)
▪ Multi-Tenancy
• Blended workloads
• YARN (queues, node labels)
• Think about Isolating nodes for real-time
Other considerations
▪ Disaster recovery
• Data is locally redundant
• Backups not usually required unless you need geo-redundancy
▪ Security - Many different things to secure!
• Kerberos for user, service, and host authentication
• Authorization: Apache Ranger (Hortonworks) or Apache Sentry
(Cloudera) or MapR Control System
• Network isolation for Hadoop services
• Data at rest (HFDS encryption)
▪ Hadoop Distribution - Race to include the most Apache projects
• Top 3: Hortonworks, Cloudera, MapR
• Big companies with Hadoop offering:
– Teradata Hadoop aka TDH (Hortonworks, Cloudera, MapR)
– Oracle Big Data Applicance (Cloudera)
Spectrum of Options
▪ Cloud PaaS
• No hardware or software to manage
• Amazon S3, Azure Data Lake
▪ Cloud
• Weird space between IaaS and PaaS
• Amazon EMR
• HDInsight is more PaaS
▪ Cloud IaaS
• All virtual, no hardware to manage
• You manage all software
▪ Third party hosted
• Rackspace
• Software managed by you
▪ Appliance
• Infrastructure handled for you
• Dell, HP, Cisco, Teradata, Oracle
• Software (varies depending on vendor)
▪ Commodity
• DIY
Lessons Learned
▪ Workload isolation is hard
• Multi-tenancy is possible
• Takes work to make sure batch jobs don’t impact
the real-time streaming processes
▪ Things we like: Hive, Hbase
▪ Things we don’t like: SOLR, debugging
▪ Debugging / development is hard
• Lots of moving pieces
• Logs spread out across many machines
• Development environments require a lot of software
• Distributed systems just work differently
Questions?
▪ Hortonworks Community
• https://community.hortonworks.com/answers/
index.html
▪ Kit Menke
• @kitmenke on Twitter
Resources
▪ Lambda Architecture
• http://lambda-architecture.net/
▪ Kappa Architecture
• http://kappa-architecture.com/
▪ Kappa Architecture - Our Experience by ASPgems
• http://events.linuxfoundation.org/sites/events/files/slides/
ASPgems%20-%20Kappa%20Architecture.pdf
▪ Apache Hadoop YARN – Multi-Tenancy, Capacity
Scheduler & Preemption - StampedeCon 2015
• http://www.slideshare.net/StampedeCon/apache-hadoop-
yarn-multitenancy-capacity-scheduler-preemption-
stampedecon-2015

Más contenido relacionado

La actualidad más candente

Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data Architectures
DataWorks Summit
 
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0   virtual - subhadeep duttaCWIN17 India / Insights platform architecture v1 0   virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
Capgemini
 

La actualidad más candente (19)

Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS Cloud
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
The EDW Ecosystem
The EDW EcosystemThe EDW Ecosystem
The EDW Ecosystem
 
The Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsThe Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data Platforms
 
Beyond TCO
Beyond TCOBeyond TCO
Beyond TCO
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryNavigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data Discovery
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle Database
 
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data Architectures
 
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0   virtual - subhadeep duttaCWIN17 India / Insights platform architecture v1 0   virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
 
Data-In-Motion Unleashed
Data-In-Motion UnleashedData-In-Motion Unleashed
Data-In-Motion Unleashed
 

Destacado

The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
StampedeCon
 

Destacado (20)

Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Intuit Analytics Cloud 101
Intuit Analytics Cloud 101Intuit Analytics Cloud 101
Intuit Analytics Cloud 101
 
Driving Innovation with Open Data
Driving Innovation with Open DataDriving Innovation with Open Data
Driving Innovation with Open Data
 
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
 
Floods of Twitter Data - StampedeCon 2016
Floods of Twitter Data - StampedeCon 2016Floods of Twitter Data - StampedeCon 2016
Floods of Twitter Data - StampedeCon 2016
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in ClustersNode Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
 
Visualizing Big Data – The Fundamentals
Visualizing Big Data – The FundamentalsVisualizing Big Data – The Fundamentals
Visualizing Big Data – The Fundamentals
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
Real time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid CloudReal time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid Cloud
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best Practices
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
 
Insurtech presentation
Insurtech presentationInsurtech presentation
Insurtech presentation
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Voldemort : Prototype to Production
Voldemort : Prototype to ProductionVoldemort : Prototype to Production
Voldemort : Prototype to Production
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Interplay of Big Data and IoT - StampedeCon 2016
Interplay of Big Data and IoT - StampedeCon 2016Interplay of Big Data and IoT - StampedeCon 2016
Interplay of Big Data and IoT - StampedeCon 2016
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
 
Kappa Architecture, IoT of the cars - LibreCon 2016
Kappa Architecture, IoT of the cars - LibreCon 2016Kappa Architecture, IoT of the cars - LibreCon 2016
Kappa Architecture, IoT of the cars - LibreCon 2016
 

Similar a Innovation in the Data Warehouse - StampedeCon 2016

Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge
 

Similar a Innovation in the Data Warehouse - StampedeCon 2016 (20)

Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Piranha vs. mammoth predator appliances that chew up big data
Piranha vs. mammoth   predator appliances that chew up big dataPiranha vs. mammoth   predator appliances that chew up big data
Piranha vs. mammoth predator appliances that chew up big data
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 

Más de StampedeCon

Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
StampedeCon
 

Más de StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
Resource Management in Impala - StampedeCon 2016
Resource Management in Impala - StampedeCon 2016Resource Management in Impala - StampedeCon 2016
Resource Management in Impala - StampedeCon 2016
 
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Innovation in the Data Warehouse - StampedeCon 2016

  • 1. Innovation in the Data Warehouse Kit Menke, Software Architect StampedeCon 2016 July 27, 2016
  • 2. Agenda ▪ Use Case ▪ Architectures ▪ Decision Points
  • 3. Enterprise Holdings, Inc. ▪ Our Business • 9 thousand locations • 80 countries • 93 thousand employees • 1.7 million vehicles ▪ Data Warehouse • Near capacity: Used about 75+ of 80 Terabytes • Streaming and batch data feeds from over 50 internal systems & external sources • 100+ databases and 22+ thousand tables • Around 1 billion queries executed per month • Over 45,000 reporting users with 5+ million report executions every month. • Statistical Modeling & Advanced Analytics - 40+ Projects Implemented for Predictive & Diagnostic Analytics
  • 6. Challenges – Current Platform ▪ System Capacity Constraints • Overall Current System Utilization is High • Space & CPU Constraints • Most of these challenges can be overcome by adding more Teradata capacity or doing augmentation ▪ Use Cases not good fit for Teradata EDW • Unstructured data • Source structures changing frequently • Data for exploration, discovery, & analytics • Staging, transient, & history data • These challenges can be overcome by augmentation ▪ Bottom-line: Improved agility & greater value
  • 7. Augmentation Recommendation: Hadoop ▪ Leverage Hadoop to complement Teradata EDW • Hybrid Approach ▪ The Hortonworks distribution of Hadoop • Compatibility/integration with Teradata EDW to achieve high degree of interoperability ▪ Intent is not to have a centralized Hadoop service • EDW Augmentation Only 7
  • 9. Architectures ▪ Data warehouse augmentation contains streaming and batch use cases ▪ Three Big Data architectures to explore: 1. Batch 2. Lambda 3. Kappa
  • 10. Batch
  • 11. Batch ▪ Land data into Hadoop first ▪ ETL in Hadoop to build reporting tables and publish to Teradata ▪ Archive old data from Teradata DB ▪ Data available for analysis in Hive ▪ Great for semi-structured data files ▪ But… too slow for streaming data
  • 13. Lambda ▪ Attempts to combine batch and streaming to get benefits from both ▪ Batch layer is comprehensive and accurate ▪ Streaming layer is fast but might only be able to keep recent data ▪ Potentially have to maintain two codebases
  • 14. Kappa
  • 15. Kappa ▪ Everything is a stream (no batch!) ▪ Depends largely on your log data store usually Kafka ▪ All raw data is stored in Kafka ▪ Much simpler architecture than lambda • New version? Re-deploy app and start reprocessing from the start and generate new output table • Once complete point app to new output table
  • 16. Choosing an Architecture ▪ Batch – process data in batches • All data processed in batches to create an output ▪ Lambda – split streaming data into batch and real-time • Stream processing for the data you need fast and the rest is batch processed ▪ Kappa – everything is a stream • All data is processed as a stream even when it needs to be reprocessed
  • 17. Implementing an Architecture ▪ Requirements for the use case drives architecture ▪ Walk through decision points 1. Cloud or on premises 2. Physical or virtual machines 3. Cluster workload ▪ Plus others!
  • 18. Cloud vs on premises ▪ Scalability • Much easier to scale a Cloud solution • Physical hardware requires an infrastructure team to manage ▪ Data source location (data gravity) / integration points • Cluster should be as close as possible to your data source • Cloud is good option for internet data sources ▪ Cloud offerings • Hadoop: Azure HDInsight, Amazon EMR, Google Cloud • Integration with other PaaS services ▪ Network • Bandwidth to/from cloud implementation
  • 19. Physical vs virtual ▪ Performance • Physical hardware will perform better, Hadoop is designed with physical hardware in mind ▪ Maintenance • No hardware to maintain for virtual servers ▪ Time to market • Virtual machines much faster to provision • For physical hardware if infrastructure team is a roadblock then appliance is good option instead of commodity ▪ Development and test environments make more sense to virtualize
  • 20. Workload ▪ Streaming • Running 24/7 • Need dedicated resources ▪ Batch • Scheduled • Periods of high utilization (scalability) ▪ Multi-Tenancy • Blended workloads • YARN (queues, node labels) • Think about Isolating nodes for real-time
  • 21. Other considerations ▪ Disaster recovery • Data is locally redundant • Backups not usually required unless you need geo-redundancy ▪ Security - Many different things to secure! • Kerberos for user, service, and host authentication • Authorization: Apache Ranger (Hortonworks) or Apache Sentry (Cloudera) or MapR Control System • Network isolation for Hadoop services • Data at rest (HFDS encryption) ▪ Hadoop Distribution - Race to include the most Apache projects • Top 3: Hortonworks, Cloudera, MapR • Big companies with Hadoop offering: – Teradata Hadoop aka TDH (Hortonworks, Cloudera, MapR) – Oracle Big Data Applicance (Cloudera)
  • 22. Spectrum of Options ▪ Cloud PaaS • No hardware or software to manage • Amazon S3, Azure Data Lake ▪ Cloud • Weird space between IaaS and PaaS • Amazon EMR • HDInsight is more PaaS ▪ Cloud IaaS • All virtual, no hardware to manage • You manage all software ▪ Third party hosted • Rackspace • Software managed by you ▪ Appliance • Infrastructure handled for you • Dell, HP, Cisco, Teradata, Oracle • Software (varies depending on vendor) ▪ Commodity • DIY
  • 23. Lessons Learned ▪ Workload isolation is hard • Multi-tenancy is possible • Takes work to make sure batch jobs don’t impact the real-time streaming processes ▪ Things we like: Hive, Hbase ▪ Things we don’t like: SOLR, debugging ▪ Debugging / development is hard • Lots of moving pieces • Logs spread out across many machines • Development environments require a lot of software • Distributed systems just work differently
  • 24. Questions? ▪ Hortonworks Community • https://community.hortonworks.com/answers/ index.html ▪ Kit Menke • @kitmenke on Twitter
  • 25. Resources ▪ Lambda Architecture • http://lambda-architecture.net/ ▪ Kappa Architecture • http://kappa-architecture.com/ ▪ Kappa Architecture - Our Experience by ASPgems • http://events.linuxfoundation.org/sites/events/files/slides/ ASPgems%20-%20Kappa%20Architecture.pdf ▪ Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - StampedeCon 2015 • http://www.slideshare.net/StampedeCon/apache-hadoop- yarn-multitenancy-capacity-scheduler-preemption- stampedecon-2015

Notas del editor

  1. Explain our use case Expanding reporting windows and shrinking ETL windows