SlideShare una empresa de Scribd logo
1 de 59
Descargar para leer sin conexión
A company of Daimler AG
ARE DATA LAKES THE NEW CORE DWHS?
ANDREAS BUCKENHOFER, DAIMLER TSS
ORACLE DATA VISION - NEUSS 2017
DOAG BIG DATA, REPORTING, GEODATA DAYS - KASSEL 2017
ABOUT ME
https://de.linkedin.com/in/buckenhofer
https://twitter.com/ABuckenhofer
https://www.doag.org/de/themen/datenbank/in-memory/
http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/
https://www.xing.com/profile/Andreas_Buckenhofer2
Andreas Buckenhofer
Senior DB Professional
andreas.buckenhofer@daimler.com
Since 2009 at Daimler TSS
Department: Big Data
Business Unit: Analytics
DAIMLER TSS. IT EXCELLENCE: COMPREHENSIVE, INNOVATIVE, CLOSE.
We're a specialist and strategic business partner for innovative IT Solutions within Daimler –
not just another supplier!
As a 100% subsidiary of Daimler, we live the culture of excellence and aspire to take an
innovative and technological lead.
With our outstanding technological and methodical know-how we are a competent provider of
services that help those who benefit from them to stand out from the competition. When it
comes to demanding IT questions we create impetus, especially in the core fields car IT and
mobility, information security, analytics, shared services and Digital Customer Experience.
Are Data Lakes the new Core DWHs?Daimler TSS GmbH 3
TSS 2 0 2 0 ALWAYS ON THE MOVE.
Daimler TSS GmbH 4
LOCATIONS
Are Data Lakes the new Core DWHs?
Daimler TSS China
Hub Beijing
6 Employees
Daimler TSS Malaysia
Hub Kuala Lumpur
38 Employees
Daimler TSS India
Hub Bangalore
16 Employees
Daimler TSS Germany
More than 1000 Employees
Ulm (Headquarters)
Stuttgart Area
Böblingen, Echterdingen,
Leinfelden, Möhringen
Berlin
Karlsruhe
AGENDA
1. Introduction/Motivation
2. From the classic DWH architecture to the Data Lake
3. Data Lake usage scenarios
4. Summary
• Software is becoming more and
more important
• 100Mio lines of code
• Physical products
• are significantly enhanced with
digital service capabilities, e.g. the
value of the car comes increasingly
from digital assets
• become digital services, e.g. car2go
• IOT, Robotics, etc.
DIGITIZATION – DATA AS AN ASSET FOR ANALYTICAL
DECISIONS
Are Data Lakes the new Core DWHs?Daimler TSS 6
Source image: https://www.linkedin.com/pulse/20140626152045-3625632-car-software-100m-lines-of-code-and-counting
Agility
• Is the Organization ready? IT (Dev + Ops) and Business
Flexibility
• Data Modeling under pressure, model as you go
• New data formats coming from logs, sensors, etc.
Performance
• Right Time
• Scale to high volumes
• Integrate data arriving at high speed
DWH AS INTEGRATION SYSTEM FOR DIGITAL ASSETS SOME
OF TODAY’S MAIN CHALLENGES
Are Data Lakes the new Core DWHs?Daimler TSS 7
IS THE DATA WAREHOUSE DEAD? AND ETL, TOO?
Are Data Lakes the new Core DWHs?Daimler TSS 8
Sources: https://www.linkedin.com/groups/45685/45685-6224210695295168512?trk=hp-feed-group-discussion&_mSplash=1
https://speakerdeck.com/nehanarkhede/etl-is-dead-long-live-streams
https://gcn.com/blogs/reality-check/2014/01/hadoop-vs-data-warehousing.aspx
AGENDA
1. Introduction/Motivation
2. From the classic DWH architecture to the Data Lake
3. Data Lake usage scenarios
4. Summary
REFERENCE DATA WAREHOUSE ARCHITECTURE
Are Data Lakes the new Core DWHs?Daimler TSS 10
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging
Layer
(Input
Layer)
OLTP
OLTP
Core
Warehouse
Layer
(Storage
Layer)
Mart Layer
(Output
Layer)
(Reporting
Layer)
Integration
Layer
(Cleansing
Layer)
Aggregation
Layer
Metadata Management
Security
DWH Manager
subject-
oriented,
integrated,
time-
variant,
non-
volatile
REFERENCE DATA WAREHOUSE ARCHITECTURE
Are Data Lakes the new Core DWHs?Daimler TSS 11
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging
Layer
(Input
Layer)
OLTP
OLTP
Core
Warehouse
Layer
(Storage
Layer)
Mart Layer
(Output
Layer)
(Reporting
Layer)
Integration
Layer
(Cleansing
Layer)
Aggregation
Layer
Metadata Management
Security
DWH Manager
subject-
oriented,
integrated,
time-
variant,
non-
volatile
Are Data Lakes the new Core DWHs?Daimler TSS 12
Data Lake on Hadoop
Data Swamp
Data Reservoir
Landing Zone
Data Library
Data Repository
Data Archive
Data Lake on Spark
Data Lake 3.0
DATA LAKE REFERENCE ARCHITECTURE
DATA LAKE OVERALL ARCHITECTURE VS DATA LAKE LAYER
Are Data Lakes the new Core DWHs?Daimler TSS 13
Landing Zone
DataGovernance
Data Reservoir / Presentation
Data Lake
MetadataManagement
DataArchival
DataSecurity
DATA LAKE REFERENCE ARCHITECTURE
Are Data Lakes the new Core DWHs?Daimler TSS 14
Landing Zone
DataGovernance
Data Reservoir /Presentation
Data Lake
Metadata
Management
DataArchival
DataSecurity
Firewall
Firewall
Sqoop Kafka
Knox
Rest API
ODBC/JDBC Restful Client
Sources
•Architecture, conceptData Lake
•Tools (that can be used to
implement a Lake)
Hadoop, Spark,
Elastic Stack
DATA LAKE VS HADOOP
Are Data Lakes the new Core DWHs?Daimler TSS 15
• Data has a structure: schema-less does not exist
• You apply
• schema-on-read
e.g. copy files (csv, json, html, …) into HDFS
• schema-on-write
e.g. create table on data files in HDFS
HOW TO STRUCTURE THE DATA LAKE?
SCHEMA-LESS REVOLUTION?
Are Data Lakes the new Core DWHs?Daimler TSS 16
Flexibility
• For whom? Writing the data vs reading the data
Simplicity
• For whom? Writing the data vs reading the data
• Human mistakes while trying to reading the data
Agility / Model as you go
• Just copy files into the directory
SCHEMA-ON-READ
Are Data Lakes the new Core DWHs?Daimler TSS 17
LAMBDA ARCHITECTURE
AN EARLY COMPREHENSIVE BIG DATA ARCHITECTURE
Are Data Lakes the new Core DWHs?Daimler TSS 18
Source image: Nathan Marz, James Warren: Big Data: Principles and best practices of scalable realtime data systems, Manning Publications 2015
• It can be argued about the complexity of the
Lambda architecture
• More interesting is the author’s view on data
• Rawness
Store the data as it is. No transformations.
• Immutability
Don’t update or delete data, just add more.
• Graph-like schema recommended
LAMBDA ARCHITECTURE
Are Data Lakes the new Core DWHs?Daimler TSS 19
Source image: Nathan Marz, James Warren: Big Data: Principles and best practices of scalable realtime data systems, Manning Publications 2015
• It can be argued about the complexity of the
Lambda architecture
• More interesting is the author’s view on data
• Rawness
Store the data as it is. No transformations.
• Immutability
Don’t update or delete data, just add more.
• Graph-like schema recommended
„Many developers go down the path of
writing their raw data in a schemaless
format like JSON. This is appealing because
of how easy it is to get started, but this
approach quickly leads to problems.
Whether due to bugs or misunderstandings
between different developers, data
corruption inevitably occurs“
(see page 103, Nathan Marz, „Big Data:
Principles and best practices of scalable
realtime data systems", Manning
Publications)
Just dumping data into the Lake?
• General Data Protection Regulation, e.g. Privacy by Design
• Vehicle identifier VIN is already sensitive data that needs to be protected
(anonymized) depending from usage
• Earmarked use of data
Schema-on-read: How do you protect data assets if you are not
aware that the data exists or where it exists?
STRUCTURING THE DATA LAKE
DATA SECURITY
Are Data Lakes the new Core DWHs?Daimler TSS 20
DATA LAKE REFERENCE ARCHITECTURE
Are Data Lakes the new Core DWHs?Daimler TSS 21
Landing Zone
DataGovernance
Data Presentation
Data Lake
MetadataManagement
DataArchival
DataSecurity
load
structure
transform
archive
archive
archive
access
Temporary storage
Immutable, modeled data
Tool neutral
Structured data for fast
access
Rawdata
Distinguish Data Lake as overall concept vs Data Lake as a layer
• Landing Zone
• Source data programmatically loaded
• Data is partitioned for processing
• Governance includes catalog and ILM (Security, Retention)
• Data Lake
• Lightly integrated by Keys
• Data accessible via SQL-on-Hadoop or using SerDes on raw data
• Data is partitioned for access
• Governance includes catalog, ILM, lightweight model
DATA LAKE HAS LAYERS (1)
DATA LAKE AS CONCEPT VS DATA LAKE AS LAYER
Are Data Lakes the new Core DWHs?Daimler TSS 22
• Presentation Zone
• Data is structured and partitioned/tuned for data access
• Full Governance including e.g. catalog, ILM, model
• Known schema including metadata about tables and columns
• Lineage
• Documented quality
DATA LAKE HAS LAYERS (2)
Are Data Lakes the new Core DWHs?Daimler TSS 23
GOVERNANCE BY DAIMLER AG / COE
E.G. SAMPLE HDFS LAYOUT
Are Data Lakes the new Core DWHs?Daimler TSS 24
/
scripts
data
Source_system
Landing_zone
scripts
data
Source_system
Data_archive
scripts
data
Source_system_object
Data_lake
model
data
Data_science_results
scripts
data
Use_case
Data_reservoir
scripts
data
Data_science_sandbox
AGENDA
1. Introduction/Motivation
2. From the classic DWH architecture to the Data Lake
3. Data Lake usage scenarios
4. Summary
USE CASES
WHAT IS THE BUSINESS PROBLEM TO SOLVE?
Are Data Lakes the new Core DWHs?Daimler TSS 26
Source:http://www.azquotes.com/
USE CASE: ANALYSIS BATTERY AGING
Are Data Lakes the new Core DWHs?Daimler TSS 27
Max capacity
Current capacity
• CSV data ingested into HDFS, Hive tables on files
• Identify breaks (“> 8h”) and compute current drain
• Sensor data format change without notice
• Sensors get regularly updated with new versions
• Names of metrics may change
• Sensors with various versions in the field
• Sensors from different suppliers
• Often many fields >>100 and increasing with new sensor versions
• Easy storing of data in HDFS and applying schema later
• Data from Robots, vehicles, …
STRUCTURING THE DATA LAKE
NEW DATA SOURCES – SENSOR DATA
Are Data Lakes the new Core DWHs?Daimler TSS 28
• Sensor data format change
without notice
• Time consuming and error-prone
data integration into the Data Lake
• Therefore preparation of data for
usage in the Data Reservoir
required: “Data Engineer”
STRUCTURING THE DATA LAKE
“SCHEMA-ON-READ”
Are Data Lakes the new Core DWHs?Daimler TSS 29
Landing Zone
DataGovernance
Data Reservoir
Data Lake
MetadataManagement
DataArchival
DataSecurity
csv
Samp-
ling /
filter
Hive tables
Hive tables
Struc-
ture
R Python
USE CASE: OPTIMIZE CYCLE TIME FOR LIGHTWEIGHT
ROBOTS
Are Data Lakes the new Core DWHs?Daimler TSS 30
• JSON data from Orient NoSQL-DB ingested into HDFS, Hive tables on files
• Partly automatize the diagnosis of anomalies (e.g. the identification of
reasons for idle times)
USE CASE: BOM EXPLOSION
HADOOP COMPUTING POWER
Are Data Lakes the new Core DWHs?Daimler TSS 31
• PLMXML files supplied by source systems
• Compute changes by comparing last BOM with current BOM
• Data Lake contains data across all tiers
• Data Reservoir contains “dedicated, secured” views for tiers
• Transfer changes to local relational DBs
USE CASE: BOM EXPLOSION
HADOOP COMPUTING POWER
Are Data Lakes the new Core DWHs?Daimler TSS 32
• Several stakeholders, e.g. different (independent) truck units
• Dumping existing systems (or new data sources like logs) into the Data
Lake
• Data is available fast, but
• Different data models
• No integration: IF ETL is reduced to EL, then T is performed by Data Scientists
many times
• Some lightweight data integration required
Data Vault
STRUCTURING THE DATA LAKE LAYER
EXISTING INTERNAL DATA FOR ANALYTICS
Are Data Lakes the new Core DWHs?Daimler TSS 33
• Hub and Link tables: how to ensure uniqueness?
• No unique constraints or indexes like RDBMS
• Use View with distinct or group by on Hub or Link table
• Don’t create Hub or Link table. Create view with distinct or group by on original
persisted incoming files
• Use HBase NoSQL wide-column store for Hub, Link (+ Sat) and Phoenix for SQL
access via Hive
• Hub and Link in RDBMS only
• Data Reservoir needs different structure or export data into Data Mart in
RDBMS for faster access
STRUCTURING THE DATA LAKE LAYER
DATA VAULT CHALLENGES WITH HADOOP
Are Data Lakes the new Core DWHs?Daimler TSS 34
• Vision: One central Enterprise DWH
• Reality for many organizations: Many DWHs
• more flexible
• acquisition of companies. Merge of systems?
• units with different (innovation) speeds and different interests, e.g. trucks
(Mercedes Benz LKW, Freightliner, Fuso, BharatBenz, Western Star, Fleetboard)
• legal requirements (e.g. data export)
• Vision: One central Data Lake
• Reality: ?
DATA LAKE IN ANALOGY TO AN ENTERPRISE DWH?
Are Data Lakes the new Core DWHs?Daimler TSS 35
“The long-term vision was clear –
the data warehouse should not be confined physically to a single
database or machine” (09-MAR-2017)
BARRY DEVLIN – LOGICAL DATA WAREHOUSE
Are Data Lakes the new Core DWHs?Daimler TSS 36
Source: https://upside.tdwi.org/articles/2017/03/09/making-the-most-of-a-logical-data-warehouse.aspx
Barry Devlin wrote the first published article describing a data warehouse
architecture in 1988 ( http://www.9sight.com/1988/02/art-ibmsj-ebis/ )
AGENDA
1. Introduction/Motivation
2. From the classic DWH architecture to the Data Lake
3. Data Lake usage scenarios
4. Summary
“Data modeling is the process of learning about the data, and regardless of technology,
this process must be performed for a successful application.”
• Learn about the data and promote collective data understanding
• Derive security classification and measures
• Design for performance
• Accelerate development
• Improve Software quality
• Reduce maintenance costs
• Generate code
• NoSQL Schema-on-read: understand model versions after years
WHY DATA MODELING?
Are Data Lakes the new Core DWHs?Daimler TSS 38
Source quote: Steve Hoberman: Data Modeling for Mongo DB, Technics Publications 2014
DWH AND DATA LAKE
Are Data Lakes the new Core DWHs?Daimler TSS 39
DWH on RDBMS
Slowly Changing Dimension
ELT vs ETL
3-Layer vs 2-Layer
Kimball Approach
Inmon Definition
Star Schema
Data Vault
Anchor Modeling
etc
Data Lake on Hadoop
Schema-on-Read
Agility
Parquet
Hive
Hbase
SQL-on-Hadoop
Impala
Oozie
Zoekeeper
Methods,
Concepts,
Techniques
Tools,
Tools,
Tools
Many ETL problems are home-made, e.g.
• Inefficient: ETL vs ETL / row-based vs set-based
• Expensive: repetitive tasks should be accomplished with generators
NO DATA INTEGRATION - IS ETL DEAD?
DATA SCIENCE REQUIRES PROPER DATA ENGINEERING
Are Data Lakes the new Core DWHs?Daimler TSS 40
Most people in AI forget that the hardest
part of building a new AI solution or
product is not the AI or algorithms— it’s
the data collection and labeling.
Source: https://medium.com/startup-grind/fueling-the-ai-gold-rush-7ae438505bc2#.ywjvuca6z (Luke de
Oliveira)
Data Lakes currently focus too much on tools instead on concepts and methods
•Tools come and go
•Flexibility / Schema-on read: Integration just postponed to Data Reservoir or in the worst case even
later to end user
PoCs vs production-ready implementation
•Many tools, but still low-productivity tools (Oozie, etc)
•Error handling coding nightmare across tools
Data Lakes and Core DWHs will coexist
•Another choice that makes sense for many use cases
•DWH: e.g. Data Vault 2.0 architecture with storing raw data and postponing data cleansing /
harmonization for lightweight data integration has similar ideas
IS THE CLASSICAL DWH DEAD?
ARE DATA LAKES THE NEW CORE DWHS?
Are Data Lakes the new Core DWHs?Daimler TSS 41
Daimler TSS GmbH
Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99
tss@daimler.com / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSS
Domicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle
Are Data Lakes the new Core DWHs?Daimler TSS 42
THANK YOU
GARTNER DATA LAKE ARCHITECTURE STYLES
Are Data Lakes the new Core DWHs?Daimler TSS 43
Source: http://blogs.gartner.com/nick-heudecker/data-lake-webinar-recap/
• Inflow Lake: accommodates a collection of data ingested from many
different sources that are disconnected outside the lake but can be used
together by being colocated within a single place
• Outflow Lake: a landing area for freshly arrived data available for
immediate access or via streaming. It employs schema-on-read for the
downstream data interpretation and refinement.
• Data Science Lab: most suitable for data discovery and for developing
new advanced analytics models
GARTNER DATA LAKE ARCHITECTURE STYLES
Source: http://blogs.gartner.com/nick-heudecker/data-lake-webinar-recap/ and https://www.asug.com/news/gartner-separate-data-lakes-myths-from-facts-before-you-dive-in
Slide 12: Creative Commons Licence, Hernán Piñera
https://www.flickr.com/photos/hernanpc/7175577368/in/photolist-bW5Hab-JF9HNW-a2LHAF-pwWNjx-oC1Jq8-noeV4d-oLsHUa-gUjhFx-qNB2Sw-jKLDCR-DB3B8-pRUpx2-crB6A7-nTUuNp-cXdPgN-
bX7mA4-7oHeKJ-arQCtK-njdhWh-nSadX3-dykooG-sjSZHV-eq69Ux-oW44NF-i2eUbE-5AyaGL-QkmoFh-nU7KcU-QEG6Nf-oziZ4t-oUbQi4-e2NWAT-i3Yna1-eJchKZ-pGC8eC-GDux8r-5FQt95-cWdzfh-ciwtqL-
jQg8BL-4X83Uc-nBZXBA-nogVER-oekb6A-9F7w4M-jKPnYQ-bAGrjd-qNB4Hq-8gJRqp-ahC2fg
Slide 47: Creative Commons Licence, James Loesch
https://www.flickr.com/photos/jal33/5182574275/in/photolist-8TY3LT-7M8Fb9-4jWYv1-hrdbHV-4jSWSn-6cHmvc-m4NnDV-s9Efoy-ccFCcW-5t3Csw-8R87fq-mT6WNq-89mMuL-pzzDjq-2iq7ti-bBA7PT-
rjPdnX-buU2V9-aottwt-4zHTZv-mT6gA6-5hLzzx-9aWGiZ-s9DJRY-jwfgr3-7WZA75-bVmho1-bXkF7U-9aWGba-3mJSwv-sa4Esa-4jWZaA-aottqr-8bj7rS-5NiZbm-oowJXV-3vp25c-5t3EkQ-NnLMaJ-naLPJm-
m78nWk-nqnUYk-mT7Wso-o54T1J-bVmgA9-emeyU1-5hQFV5-akhQQL-naLDim-pPeh93
IMAGE ATTRIBUTION
Are Data Lakes the new Core DWHs?Daimler TSS 45
Are Data Lakes the new Core DWHs?Daimler TSS 46
DWH = inflexible development,
bad performance,
complex architecture with 3 layers
Failure to talk to business to obtain proper requirements
Ingestion of wrong data
Storage of data with errors
Business Keys (independent object) nested into document
Read performance
SCHEMA-ON-READ
OR WHY MODELING CAN STILL BE USEFUL
Are Data Lakes the new Core DWHs?Daimler TSS 47
SCHEMA-ON-READ
OR WHICH BUSINESS PROBLEMS ARE SOLVED
Are Data Lakes the new Core DWHs?Daimler TSS 48
Schema-on-read Remark
Data storage Yes, flexible Store data from various systems
Data integration no Integrate data from various systems
Has to be done during each access by each user
Data historization Yes, auditable Stamp data with timestamp
Information delivery no Turn data into valuable information.
Has to be done during each access by each user
DATA MODELS IN THE DWH
Are Data Lakes the new Core DWHs?Daimler TSS 49
Layer Characteristics Data Model
Staging Layer Temporary storage
Ingest of source data
Normally 1:1 copy of source table structure –
usually without constraints and indexes
Core Warehouse
Layer
Historization / bitemporal data
Integration
Tool-independent
Non-redundant data storage
Historization
3NF with historization
Head and Version modelling
Data Vault
Anchor modeling
Dimensional model with historization (possible)
Data Mart Layer Performance for end user queries
required, Tool-dependent
Lots of joins necessary to answer
complex questions
Flat structures, esp. Dimensional model
(ROLAP / MOLAP / HOLAP)
Understand business requirements
Understand problem space
Design solution space
Think ideas (incl. alternatives) through
WHY MODEL?
Are Data Lakes the new Core DWHs?Daimler TSS 50
SQL is universal language to access and manipulate data in a
RDBMS
SQL is a language not only for DBAs or developers
SQL is standard for OLTP and OLAP, especially for BI tools
MAKE SQL GREAT AGAIN OR WHY SQL ON BIG DATA?
Are Data Lakes the new Core DWHs?Daimler TSS 51
STRATA 2012 VS 2016
Are Data Lakes the new Core DWHs?Daimler TSS 52
Source: http://www.cazena.com/blog/strata-word-cloud-2012-vs-2016-data-lakes-spark-real-time-and-other-trends
• Architecture with Atlas
• Supports the classical tools:
• Hive
• Sqoop
• HDFS?
• Schema-on-read?
ATLAS FOR METADATA MANAGEMENT
Are Data Lakes the new Core DWHs?Daimler TSS 53
NO DATA INTEGRATION NECESSARY OR
WHO REALLY DOES UNDERSTANDS DATA MODELS?
Are Data Lakes the new Core DWHs?Daimler TSS 54
Source: Corr / Stagnitto: Agile Data Warehouse Design, DecisionOne Press, 2011, page 5
• 3NF is inefficient for query processing
• 3NF models are difficult to
understand
• 3NF gets even more complicated with
history added
• Many ways from person to order
“Data modeling is the process of learning about the data, and regardless of technology,
this process must be performed for a successful application.”
• Learn about the data and promote collective data understanding
• Derive security classification and measures
• Design for performance
• Accelerate development
• Improve Software quality
• Reduce maintenance costs
• Generate code
• NoSQL Schema-on-read: understand model versions after years
WHY DATA MODELING?
Are Data Lakes the new Core DWHs?Daimler TSS 55
Source quote: Steve Hoberman: Data Modeling for Mongo DB, Technics Publications 2014
„Expanding your
modeling skills
enables you to
reduce documentation.“
Scott Ambler
• Standard approach in Data Marts in DWH
• Not just for performance reasons
• Performance is also an issue on Hadoop-based systems, e.g. Hive, Spark
• Joins!
• But also due to understandability for end users
• Understandability is also an issue on Hadoop-based systems
DIMENSIONAL MODELING
Are Data Lakes the new Core DWHs?Daimler TSS 56
A prime motivation for this evolution towards a more “database-like”
system was driven by the experiences of Google developers trying to build
on previous “key-value” storage systems. The prototypical example of such
a key-value system is Bigtable, which continues to see massive usage at
Google for a variety of applications. However, developers of many OLTP
applications found it difficult to build these applications without a
strong schema system, cross-row transactions, consistent replication and
a powerful query language.
Source: https://research.google.com/pubs/pub46103.html
IMPORTANCE OF STRONG SCHEMA @GOOGLE
Are Data Lakes the new Core DWHs?Daimler TSS 57
HADOOP VS CLASSIC DWH
SQL APPROACH
Are Data Lakes the new Core DWHs?Daimler TSS 58
Classic DWH Hadoop
Tables Yes Yes
SQL language Yes Yes, SQL-on-Hadoop
Query Optimizer Yes Yes
Indexes, Pks Yes No
Data “Owner” Proprietary RDBMS Open data format
Access by many engines like Spark, Hive
Many open formats like Parquet, Avro
Metadata dictionary User data + dictionary
in RDBMS
User data and dictionary (“Hive
Metastore”) separate
New data sources
• Sensors, Logs, NoSQL, etc. as data source
• Schema-on-read useful as sensor data format change frequent
Existing internal data
• Dump RDBMS exports into Data Lake for data analytics
• Schema-on-read does not make any sense as data is already in a
documented data model
STRUCTURING THE DATA LAKE
Are Data Lakes the new Core DWHs?Daimler TSS 59

Más contenido relacionado

La actualidad más candente

Data warehouse design
Data warehouse designData warehouse design
Data warehouse designines beltaief
 
Steps To Build A Datawarehouse
Steps To Build A DatawarehouseSteps To Build A Datawarehouse
Steps To Build A DatawarehouseHendra Saputra
 
Data Warehouse Design on Cloud ,A Big Data approach Part_One
Data Warehouse Design on Cloud ,A Big Data approach Part_OneData Warehouse Design on Cloud ,A Big Data approach Part_One
Data Warehouse Design on Cloud ,A Big Data approach Part_OnePanchaleswar Nayak
 
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLDesigning Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLVenu Anuganti
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
How To Buy Data Warehouse
How To Buy Data WarehouseHow To Buy Data Warehouse
How To Buy Data WarehouseEric Sun
 
From Traditional Data Warehouse To Real Time Data Warehouse
From Traditional Data Warehouse To Real Time Data WarehouseFrom Traditional Data Warehouse To Real Time Data Warehouse
From Traditional Data Warehouse To Real Time Data WarehouseOsama Hussein
 
Architecting a Data Warehouse: A Case Study
Architecting a Data Warehouse: A Case StudyArchitecting a Data Warehouse: A Case Study
Architecting a Data Warehouse: A Case StudyMark Ginnebaugh
 
Scalable data pipeline
Scalable data pipelineScalable data pipeline
Scalable data pipelineGreenM
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta
 
Making MySQL Great For Business Intelligence
Making MySQL Great For Business IntelligenceMaking MySQL Great For Business Intelligence
Making MySQL Great For Business IntelligenceCalpont
 
The Database Environment Chapter 13
The Database Environment Chapter 13The Database Environment Chapter 13
The Database Environment Chapter 13Jeanie Arnoco
 
Tableau Architecture
Tableau ArchitectureTableau Architecture
Tableau ArchitectureVivek Mohan
 

La actualidad más candente (20)

Data warehouse design
Data warehouse designData warehouse design
Data warehouse design
 
Steps To Build A Datawarehouse
Steps To Build A DatawarehouseSteps To Build A Datawarehouse
Steps To Build A Datawarehouse
 
Data Warehouse Design on Cloud ,A Big Data approach Part_One
Data Warehouse Design on Cloud ,A Big Data approach Part_OneData Warehouse Design on Cloud ,A Big Data approach Part_One
Data Warehouse Design on Cloud ,A Big Data approach Part_One
 
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLDesigning Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQL
 
080827 abramson inmon vs kimball
080827 abramson   inmon vs kimball080827 abramson   inmon vs kimball
080827 abramson inmon vs kimball
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
How To Buy Data Warehouse
How To Buy Data WarehouseHow To Buy Data Warehouse
How To Buy Data Warehouse
 
From Traditional Data Warehouse To Real Time Data Warehouse
From Traditional Data Warehouse To Real Time Data WarehouseFrom Traditional Data Warehouse To Real Time Data Warehouse
From Traditional Data Warehouse To Real Time Data Warehouse
 
SAS/Tableau integration
SAS/Tableau integrationSAS/Tableau integration
SAS/Tableau integration
 
Architecting a Data Warehouse: A Case Study
Architecting a Data Warehouse: A Case StudyArchitecting a Data Warehouse: A Case Study
Architecting a Data Warehouse: A Case Study
 
Scalable data pipeline
Scalable data pipelineScalable data pipeline
Scalable data pipeline
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 
Making MySQL Great For Business Intelligence
Making MySQL Great For Business IntelligenceMaking MySQL Great For Business Intelligence
Making MySQL Great For Business Intelligence
 
The Database Environment Chapter 13
The Database Environment Chapter 13The Database Environment Chapter 13
The Database Environment Chapter 13
 
Informatica doc
Informatica docInformatica doc
Informatica doc
 
Tableau Architecture
Tableau ArchitectureTableau Architecture
Tableau Architecture
 
OLAP technology
OLAP technologyOLAP technology
OLAP technology
 
An Introduction To BI
An Introduction To BIAn Introduction To BI
An Introduction To BI
 
Datastage ppt
Datastage pptDatastage ppt
Datastage ppt
 
Oracle: DW Design
Oracle: DW DesignOracle: DW Design
Oracle: DW Design
 

Similar a Are Data Lakes the new Core DWHs?

Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architectureMilos Milovanovic
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos MilovanovicInstitute of Contemporary Sciences
 
Trivadis Azure Data Lake
Trivadis Azure Data LakeTrivadis Azure Data Lake
Trivadis Azure Data LakeTrivadis
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
AWS Data Lakes & Best Practices - GoDgtl
AWS Data Lakes & Best Practices - GoDgtlAWS Data Lakes & Best Practices - GoDgtl
AWS Data Lakes & Best Practices - GoDgtlMezzybatliwala
 
AWS Data Lakes and Best Practices
AWS Data Lakes and Best PracticesAWS Data Lakes and Best Practices
AWS Data Lakes and Best PracticesPeeterParkar
 
Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!Steve Keil
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 
Data Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data LakeData Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data LakeDenodo
 
DWH: stop wasting time!
DWH: stop wasting time!DWH: stop wasting time!
DWH: stop wasting time!Sadas
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL TechnologiesAmit Singh
 
IT + Line of Business - Driving Faster, Deeper Insights Together
IT + Line of Business - Driving Faster, Deeper Insights TogetherIT + Line of Business - Driving Faster, Deeper Insights Together
IT + Line of Business - Driving Faster, Deeper Insights TogetherDATAVERSITY
 
Data Engineering
Data EngineeringData Engineering
Data Engineeringkiansahafi
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata Hortonworks
 
Prague data management meetup #30 2019-10-04
Prague data management meetup #30 2019-10-04Prague data management meetup #30 2019-10-04
Prague data management meetup #30 2019-10-04Martin Bém
 
Foundation for Success: How Big Data Fits in an Information Architecture
Foundation for Success: How Big Data Fits in an Information ArchitectureFoundation for Success: How Big Data Fits in an Information Architecture
Foundation for Success: How Big Data Fits in an Information ArchitectureInside Analysis
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantagePrecisely
 
La creación de una capa operacional con MongoDB
La creación de una capa operacional con MongoDBLa creación de una capa operacional con MongoDB
La creación de una capa operacional con MongoDBMongoDB
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
 

Similar a Are Data Lakes the new Core DWHs? (20)

Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 
Trivadis Azure Data Lake
Trivadis Azure Data LakeTrivadis Azure Data Lake
Trivadis Azure Data Lake
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
AWS Data Lakes & Best Practices - GoDgtl
AWS Data Lakes & Best Practices - GoDgtlAWS Data Lakes & Best Practices - GoDgtl
AWS Data Lakes & Best Practices - GoDgtl
 
AWS Data Lakes and Best Practices
AWS Data Lakes and Best PracticesAWS Data Lakes and Best Practices
AWS Data Lakes and Best Practices
 
Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
Data Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data LakeData Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data Lake
 
DWH: stop wasting time!
DWH: stop wasting time!DWH: stop wasting time!
DWH: stop wasting time!
 
Benefits of a data lake
Benefits of a data lake Benefits of a data lake
Benefits of a data lake
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL Technologies
 
IT + Line of Business - Driving Faster, Deeper Insights Together
IT + Line of Business - Driving Faster, Deeper Insights TogetherIT + Line of Business - Driving Faster, Deeper Insights Together
IT + Line of Business - Driving Faster, Deeper Insights Together
 
Data Engineering
Data EngineeringData Engineering
Data Engineering
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
 
Prague data management meetup #30 2019-10-04
Prague data management meetup #30 2019-10-04Prague data management meetup #30 2019-10-04
Prague data management meetup #30 2019-10-04
 
Foundation for Success: How Big Data Fits in an Information Architecture
Foundation for Success: How Big Data Fits in an Information ArchitectureFoundation for Success: How Big Data Fits in an Information Architecture
Foundation for Success: How Big Data Fits in an Information Architecture
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
 
La creación de una capa operacional con MongoDB
La creación de una capa operacional con MongoDBLa creación de una capa operacional con MongoDB
La creación de una capa operacional con MongoDB
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 

Más de Andreas Buckenhofer

Metadaten und Data Vault (Meta Vault)
Metadaten und Data Vault (Meta Vault)Metadaten und Data Vault (Meta Vault)
Metadaten und Data Vault (Meta Vault)Andreas Buckenhofer
 
CDC und Data Vault für den Aufbau eines DWH in der Automobilindustrie
CDC und Data Vault für den Aufbau eines DWH in der AutomobilindustrieCDC und Data Vault für den Aufbau eines DWH in der Automobilindustrie
CDC und Data Vault für den Aufbau eines DWH in der AutomobilindustrieAndreas Buckenhofer
 
Caching: In-Memory Column Store oder im BI Server
Caching: In-Memory Column Store oder im BI ServerCaching: In-Memory Column Store oder im BI Server
Caching: In-Memory Column Store oder im BI ServerAndreas Buckenhofer
 
Fehlerbehandlung mittels DML Error Logging
Fehlerbehandlung mittels DML Error LoggingFehlerbehandlung mittels DML Error Logging
Fehlerbehandlung mittels DML Error LoggingAndreas Buckenhofer
 
Wide-column Stores für Architekten (HBase, Cassandra)
Wide-column Stores für Architekten (HBase, Cassandra)Wide-column Stores für Architekten (HBase, Cassandra)
Wide-column Stores für Architekten (HBase, Cassandra)Andreas Buckenhofer
 

Más de Andreas Buckenhofer (6)

Metadaten und Data Vault (Meta Vault)
Metadaten und Data Vault (Meta Vault)Metadaten und Data Vault (Meta Vault)
Metadaten und Data Vault (Meta Vault)
 
CDC und Data Vault für den Aufbau eines DWH in der Automobilindustrie
CDC und Data Vault für den Aufbau eines DWH in der AutomobilindustrieCDC und Data Vault für den Aufbau eines DWH in der Automobilindustrie
CDC und Data Vault für den Aufbau eines DWH in der Automobilindustrie
 
Lambdaarchitektur für BigData
Lambdaarchitektur für BigDataLambdaarchitektur für BigData
Lambdaarchitektur für BigData
 
Caching: In-Memory Column Store oder im BI Server
Caching: In-Memory Column Store oder im BI ServerCaching: In-Memory Column Store oder im BI Server
Caching: In-Memory Column Store oder im BI Server
 
Fehlerbehandlung mittels DML Error Logging
Fehlerbehandlung mittels DML Error LoggingFehlerbehandlung mittels DML Error Logging
Fehlerbehandlung mittels DML Error Logging
 
Wide-column Stores für Architekten (HBase, Cassandra)
Wide-column Stores für Architekten (HBase, Cassandra)Wide-column Stores für Architekten (HBase, Cassandra)
Wide-column Stores für Architekten (HBase, Cassandra)
 

Último

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Último (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Are Data Lakes the new Core DWHs?

  • 1. A company of Daimler AG ARE DATA LAKES THE NEW CORE DWHS? ANDREAS BUCKENHOFER, DAIMLER TSS ORACLE DATA VISION - NEUSS 2017 DOAG BIG DATA, REPORTING, GEODATA DAYS - KASSEL 2017
  • 3. DAIMLER TSS. IT EXCELLENCE: COMPREHENSIVE, INNOVATIVE, CLOSE. We're a specialist and strategic business partner for innovative IT Solutions within Daimler – not just another supplier! As a 100% subsidiary of Daimler, we live the culture of excellence and aspire to take an innovative and technological lead. With our outstanding technological and methodical know-how we are a competent provider of services that help those who benefit from them to stand out from the competition. When it comes to demanding IT questions we create impetus, especially in the core fields car IT and mobility, information security, analytics, shared services and Digital Customer Experience. Are Data Lakes the new Core DWHs?Daimler TSS GmbH 3 TSS 2 0 2 0 ALWAYS ON THE MOVE.
  • 4. Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub Kuala Lumpur 38 Employees Daimler TSS India Hub Bangalore 16 Employees Daimler TSS Germany More than 1000 Employees Ulm (Headquarters) Stuttgart Area Böblingen, Echterdingen, Leinfelden, Möhringen Berlin Karlsruhe
  • 5. AGENDA 1. Introduction/Motivation 2. From the classic DWH architecture to the Data Lake 3. Data Lake usage scenarios 4. Summary
  • 6. • Software is becoming more and more important • 100Mio lines of code • Physical products • are significantly enhanced with digital service capabilities, e.g. the value of the car comes increasingly from digital assets • become digital services, e.g. car2go • IOT, Robotics, etc. DIGITIZATION – DATA AS AN ASSET FOR ANALYTICAL DECISIONS Are Data Lakes the new Core DWHs?Daimler TSS 6 Source image: https://www.linkedin.com/pulse/20140626152045-3625632-car-software-100m-lines-of-code-and-counting
  • 7. Agility • Is the Organization ready? IT (Dev + Ops) and Business Flexibility • Data Modeling under pressure, model as you go • New data formats coming from logs, sensors, etc. Performance • Right Time • Scale to high volumes • Integrate data arriving at high speed DWH AS INTEGRATION SYSTEM FOR DIGITAL ASSETS SOME OF TODAY’S MAIN CHALLENGES Are Data Lakes the new Core DWHs?Daimler TSS 7
  • 8. IS THE DATA WAREHOUSE DEAD? AND ETL, TOO? Are Data Lakes the new Core DWHs?Daimler TSS 8 Sources: https://www.linkedin.com/groups/45685/45685-6224210695295168512?trk=hp-feed-group-discussion&_mSplash=1 https://speakerdeck.com/nehanarkhede/etl-is-dead-long-live-streams https://gcn.com/blogs/reality-check/2014/01/hadoop-vs-data-warehousing.aspx
  • 9. AGENDA 1. Introduction/Motivation 2. From the classic DWH architecture to the Data Lake 3. Data Lake usage scenarios 4. Summary
  • 10. REFERENCE DATA WAREHOUSE ARCHITECTURE Are Data Lakes the new Core DWHs?Daimler TSS 10 Data Warehouse FrontendBackend External data sources Internal data sources Staging Layer (Input Layer) OLTP OLTP Core Warehouse Layer (Storage Layer) Mart Layer (Output Layer) (Reporting Layer) Integration Layer (Cleansing Layer) Aggregation Layer Metadata Management Security DWH Manager subject- oriented, integrated, time- variant, non- volatile
  • 11. REFERENCE DATA WAREHOUSE ARCHITECTURE Are Data Lakes the new Core DWHs?Daimler TSS 11 Data Warehouse FrontendBackend External data sources Internal data sources Staging Layer (Input Layer) OLTP OLTP Core Warehouse Layer (Storage Layer) Mart Layer (Output Layer) (Reporting Layer) Integration Layer (Cleansing Layer) Aggregation Layer Metadata Management Security DWH Manager subject- oriented, integrated, time- variant, non- volatile
  • 12. Are Data Lakes the new Core DWHs?Daimler TSS 12 Data Lake on Hadoop Data Swamp Data Reservoir Landing Zone Data Library Data Repository Data Archive Data Lake on Spark Data Lake 3.0
  • 13. DATA LAKE REFERENCE ARCHITECTURE DATA LAKE OVERALL ARCHITECTURE VS DATA LAKE LAYER Are Data Lakes the new Core DWHs?Daimler TSS 13 Landing Zone DataGovernance Data Reservoir / Presentation Data Lake MetadataManagement DataArchival DataSecurity
  • 14. DATA LAKE REFERENCE ARCHITECTURE Are Data Lakes the new Core DWHs?Daimler TSS 14 Landing Zone DataGovernance Data Reservoir /Presentation Data Lake Metadata Management DataArchival DataSecurity Firewall Firewall Sqoop Kafka Knox Rest API ODBC/JDBC Restful Client Sources
  • 15. •Architecture, conceptData Lake •Tools (that can be used to implement a Lake) Hadoop, Spark, Elastic Stack DATA LAKE VS HADOOP Are Data Lakes the new Core DWHs?Daimler TSS 15
  • 16. • Data has a structure: schema-less does not exist • You apply • schema-on-read e.g. copy files (csv, json, html, …) into HDFS • schema-on-write e.g. create table on data files in HDFS HOW TO STRUCTURE THE DATA LAKE? SCHEMA-LESS REVOLUTION? Are Data Lakes the new Core DWHs?Daimler TSS 16
  • 17. Flexibility • For whom? Writing the data vs reading the data Simplicity • For whom? Writing the data vs reading the data • Human mistakes while trying to reading the data Agility / Model as you go • Just copy files into the directory SCHEMA-ON-READ Are Data Lakes the new Core DWHs?Daimler TSS 17
  • 18. LAMBDA ARCHITECTURE AN EARLY COMPREHENSIVE BIG DATA ARCHITECTURE Are Data Lakes the new Core DWHs?Daimler TSS 18 Source image: Nathan Marz, James Warren: Big Data: Principles and best practices of scalable realtime data systems, Manning Publications 2015 • It can be argued about the complexity of the Lambda architecture • More interesting is the author’s view on data • Rawness Store the data as it is. No transformations. • Immutability Don’t update or delete data, just add more. • Graph-like schema recommended
  • 19. LAMBDA ARCHITECTURE Are Data Lakes the new Core DWHs?Daimler TSS 19 Source image: Nathan Marz, James Warren: Big Data: Principles and best practices of scalable realtime data systems, Manning Publications 2015 • It can be argued about the complexity of the Lambda architecture • More interesting is the author’s view on data • Rawness Store the data as it is. No transformations. • Immutability Don’t update or delete data, just add more. • Graph-like schema recommended „Many developers go down the path of writing their raw data in a schemaless format like JSON. This is appealing because of how easy it is to get started, but this approach quickly leads to problems. Whether due to bugs or misunderstandings between different developers, data corruption inevitably occurs“ (see page 103, Nathan Marz, „Big Data: Principles and best practices of scalable realtime data systems", Manning Publications)
  • 20. Just dumping data into the Lake? • General Data Protection Regulation, e.g. Privacy by Design • Vehicle identifier VIN is already sensitive data that needs to be protected (anonymized) depending from usage • Earmarked use of data Schema-on-read: How do you protect data assets if you are not aware that the data exists or where it exists? STRUCTURING THE DATA LAKE DATA SECURITY Are Data Lakes the new Core DWHs?Daimler TSS 20
  • 21. DATA LAKE REFERENCE ARCHITECTURE Are Data Lakes the new Core DWHs?Daimler TSS 21 Landing Zone DataGovernance Data Presentation Data Lake MetadataManagement DataArchival DataSecurity load structure transform archive archive archive access Temporary storage Immutable, modeled data Tool neutral Structured data for fast access Rawdata
  • 22. Distinguish Data Lake as overall concept vs Data Lake as a layer • Landing Zone • Source data programmatically loaded • Data is partitioned for processing • Governance includes catalog and ILM (Security, Retention) • Data Lake • Lightly integrated by Keys • Data accessible via SQL-on-Hadoop or using SerDes on raw data • Data is partitioned for access • Governance includes catalog, ILM, lightweight model DATA LAKE HAS LAYERS (1) DATA LAKE AS CONCEPT VS DATA LAKE AS LAYER Are Data Lakes the new Core DWHs?Daimler TSS 22
  • 23. • Presentation Zone • Data is structured and partitioned/tuned for data access • Full Governance including e.g. catalog, ILM, model • Known schema including metadata about tables and columns • Lineage • Documented quality DATA LAKE HAS LAYERS (2) Are Data Lakes the new Core DWHs?Daimler TSS 23
  • 24. GOVERNANCE BY DAIMLER AG / COE E.G. SAMPLE HDFS LAYOUT Are Data Lakes the new Core DWHs?Daimler TSS 24 / scripts data Source_system Landing_zone scripts data Source_system Data_archive scripts data Source_system_object Data_lake model data Data_science_results scripts data Use_case Data_reservoir scripts data Data_science_sandbox
  • 25. AGENDA 1. Introduction/Motivation 2. From the classic DWH architecture to the Data Lake 3. Data Lake usage scenarios 4. Summary
  • 26. USE CASES WHAT IS THE BUSINESS PROBLEM TO SOLVE? Are Data Lakes the new Core DWHs?Daimler TSS 26 Source:http://www.azquotes.com/
  • 27. USE CASE: ANALYSIS BATTERY AGING Are Data Lakes the new Core DWHs?Daimler TSS 27 Max capacity Current capacity • CSV data ingested into HDFS, Hive tables on files • Identify breaks (“> 8h”) and compute current drain
  • 28. • Sensor data format change without notice • Sensors get regularly updated with new versions • Names of metrics may change • Sensors with various versions in the field • Sensors from different suppliers • Often many fields >>100 and increasing with new sensor versions • Easy storing of data in HDFS and applying schema later • Data from Robots, vehicles, … STRUCTURING THE DATA LAKE NEW DATA SOURCES – SENSOR DATA Are Data Lakes the new Core DWHs?Daimler TSS 28
  • 29. • Sensor data format change without notice • Time consuming and error-prone data integration into the Data Lake • Therefore preparation of data for usage in the Data Reservoir required: “Data Engineer” STRUCTURING THE DATA LAKE “SCHEMA-ON-READ” Are Data Lakes the new Core DWHs?Daimler TSS 29 Landing Zone DataGovernance Data Reservoir Data Lake MetadataManagement DataArchival DataSecurity csv Samp- ling / filter Hive tables Hive tables Struc- ture R Python
  • 30. USE CASE: OPTIMIZE CYCLE TIME FOR LIGHTWEIGHT ROBOTS Are Data Lakes the new Core DWHs?Daimler TSS 30 • JSON data from Orient NoSQL-DB ingested into HDFS, Hive tables on files • Partly automatize the diagnosis of anomalies (e.g. the identification of reasons for idle times)
  • 31. USE CASE: BOM EXPLOSION HADOOP COMPUTING POWER Are Data Lakes the new Core DWHs?Daimler TSS 31
  • 32. • PLMXML files supplied by source systems • Compute changes by comparing last BOM with current BOM • Data Lake contains data across all tiers • Data Reservoir contains “dedicated, secured” views for tiers • Transfer changes to local relational DBs USE CASE: BOM EXPLOSION HADOOP COMPUTING POWER Are Data Lakes the new Core DWHs?Daimler TSS 32
  • 33. • Several stakeholders, e.g. different (independent) truck units • Dumping existing systems (or new data sources like logs) into the Data Lake • Data is available fast, but • Different data models • No integration: IF ETL is reduced to EL, then T is performed by Data Scientists many times • Some lightweight data integration required Data Vault STRUCTURING THE DATA LAKE LAYER EXISTING INTERNAL DATA FOR ANALYTICS Are Data Lakes the new Core DWHs?Daimler TSS 33
  • 34. • Hub and Link tables: how to ensure uniqueness? • No unique constraints or indexes like RDBMS • Use View with distinct or group by on Hub or Link table • Don’t create Hub or Link table. Create view with distinct or group by on original persisted incoming files • Use HBase NoSQL wide-column store for Hub, Link (+ Sat) and Phoenix for SQL access via Hive • Hub and Link in RDBMS only • Data Reservoir needs different structure or export data into Data Mart in RDBMS for faster access STRUCTURING THE DATA LAKE LAYER DATA VAULT CHALLENGES WITH HADOOP Are Data Lakes the new Core DWHs?Daimler TSS 34
  • 35. • Vision: One central Enterprise DWH • Reality for many organizations: Many DWHs • more flexible • acquisition of companies. Merge of systems? • units with different (innovation) speeds and different interests, e.g. trucks (Mercedes Benz LKW, Freightliner, Fuso, BharatBenz, Western Star, Fleetboard) • legal requirements (e.g. data export) • Vision: One central Data Lake • Reality: ? DATA LAKE IN ANALOGY TO AN ENTERPRISE DWH? Are Data Lakes the new Core DWHs?Daimler TSS 35
  • 36. “The long-term vision was clear – the data warehouse should not be confined physically to a single database or machine” (09-MAR-2017) BARRY DEVLIN – LOGICAL DATA WAREHOUSE Are Data Lakes the new Core DWHs?Daimler TSS 36 Source: https://upside.tdwi.org/articles/2017/03/09/making-the-most-of-a-logical-data-warehouse.aspx Barry Devlin wrote the first published article describing a data warehouse architecture in 1988 ( http://www.9sight.com/1988/02/art-ibmsj-ebis/ )
  • 37. AGENDA 1. Introduction/Motivation 2. From the classic DWH architecture to the Data Lake 3. Data Lake usage scenarios 4. Summary
  • 38. “Data modeling is the process of learning about the data, and regardless of technology, this process must be performed for a successful application.” • Learn about the data and promote collective data understanding • Derive security classification and measures • Design for performance • Accelerate development • Improve Software quality • Reduce maintenance costs • Generate code • NoSQL Schema-on-read: understand model versions after years WHY DATA MODELING? Are Data Lakes the new Core DWHs?Daimler TSS 38 Source quote: Steve Hoberman: Data Modeling for Mongo DB, Technics Publications 2014
  • 39. DWH AND DATA LAKE Are Data Lakes the new Core DWHs?Daimler TSS 39 DWH on RDBMS Slowly Changing Dimension ELT vs ETL 3-Layer vs 2-Layer Kimball Approach Inmon Definition Star Schema Data Vault Anchor Modeling etc Data Lake on Hadoop Schema-on-Read Agility Parquet Hive Hbase SQL-on-Hadoop Impala Oozie Zoekeeper Methods, Concepts, Techniques Tools, Tools, Tools
  • 40. Many ETL problems are home-made, e.g. • Inefficient: ETL vs ETL / row-based vs set-based • Expensive: repetitive tasks should be accomplished with generators NO DATA INTEGRATION - IS ETL DEAD? DATA SCIENCE REQUIRES PROPER DATA ENGINEERING Are Data Lakes the new Core DWHs?Daimler TSS 40 Most people in AI forget that the hardest part of building a new AI solution or product is not the AI or algorithms— it’s the data collection and labeling. Source: https://medium.com/startup-grind/fueling-the-ai-gold-rush-7ae438505bc2#.ywjvuca6z (Luke de Oliveira)
  • 41. Data Lakes currently focus too much on tools instead on concepts and methods •Tools come and go •Flexibility / Schema-on read: Integration just postponed to Data Reservoir or in the worst case even later to end user PoCs vs production-ready implementation •Many tools, but still low-productivity tools (Oozie, etc) •Error handling coding nightmare across tools Data Lakes and Core DWHs will coexist •Another choice that makes sense for many use cases •DWH: e.g. Data Vault 2.0 architecture with storing raw data and postponing data cleansing / harmonization for lightweight data integration has similar ideas IS THE CLASSICAL DWH DEAD? ARE DATA LAKES THE NEW CORE DWHS? Are Data Lakes the new Core DWHs?Daimler TSS 41
  • 42. Daimler TSS GmbH Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99 tss@daimler.com / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSS Domicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle Are Data Lakes the new Core DWHs?Daimler TSS 42 THANK YOU
  • 43. GARTNER DATA LAKE ARCHITECTURE STYLES Are Data Lakes the new Core DWHs?Daimler TSS 43 Source: http://blogs.gartner.com/nick-heudecker/data-lake-webinar-recap/
  • 44. • Inflow Lake: accommodates a collection of data ingested from many different sources that are disconnected outside the lake but can be used together by being colocated within a single place • Outflow Lake: a landing area for freshly arrived data available for immediate access or via streaming. It employs schema-on-read for the downstream data interpretation and refinement. • Data Science Lab: most suitable for data discovery and for developing new advanced analytics models GARTNER DATA LAKE ARCHITECTURE STYLES Source: http://blogs.gartner.com/nick-heudecker/data-lake-webinar-recap/ and https://www.asug.com/news/gartner-separate-data-lakes-myths-from-facts-before-you-dive-in
  • 45. Slide 12: Creative Commons Licence, Hernán Piñera https://www.flickr.com/photos/hernanpc/7175577368/in/photolist-bW5Hab-JF9HNW-a2LHAF-pwWNjx-oC1Jq8-noeV4d-oLsHUa-gUjhFx-qNB2Sw-jKLDCR-DB3B8-pRUpx2-crB6A7-nTUuNp-cXdPgN- bX7mA4-7oHeKJ-arQCtK-njdhWh-nSadX3-dykooG-sjSZHV-eq69Ux-oW44NF-i2eUbE-5AyaGL-QkmoFh-nU7KcU-QEG6Nf-oziZ4t-oUbQi4-e2NWAT-i3Yna1-eJchKZ-pGC8eC-GDux8r-5FQt95-cWdzfh-ciwtqL- jQg8BL-4X83Uc-nBZXBA-nogVER-oekb6A-9F7w4M-jKPnYQ-bAGrjd-qNB4Hq-8gJRqp-ahC2fg Slide 47: Creative Commons Licence, James Loesch https://www.flickr.com/photos/jal33/5182574275/in/photolist-8TY3LT-7M8Fb9-4jWYv1-hrdbHV-4jSWSn-6cHmvc-m4NnDV-s9Efoy-ccFCcW-5t3Csw-8R87fq-mT6WNq-89mMuL-pzzDjq-2iq7ti-bBA7PT- rjPdnX-buU2V9-aottwt-4zHTZv-mT6gA6-5hLzzx-9aWGiZ-s9DJRY-jwfgr3-7WZA75-bVmho1-bXkF7U-9aWGba-3mJSwv-sa4Esa-4jWZaA-aottqr-8bj7rS-5NiZbm-oowJXV-3vp25c-5t3EkQ-NnLMaJ-naLPJm- m78nWk-nqnUYk-mT7Wso-o54T1J-bVmgA9-emeyU1-5hQFV5-akhQQL-naLDim-pPeh93 IMAGE ATTRIBUTION Are Data Lakes the new Core DWHs?Daimler TSS 45
  • 46. Are Data Lakes the new Core DWHs?Daimler TSS 46 DWH = inflexible development, bad performance, complex architecture with 3 layers
  • 47. Failure to talk to business to obtain proper requirements Ingestion of wrong data Storage of data with errors Business Keys (independent object) nested into document Read performance SCHEMA-ON-READ OR WHY MODELING CAN STILL BE USEFUL Are Data Lakes the new Core DWHs?Daimler TSS 47
  • 48. SCHEMA-ON-READ OR WHICH BUSINESS PROBLEMS ARE SOLVED Are Data Lakes the new Core DWHs?Daimler TSS 48 Schema-on-read Remark Data storage Yes, flexible Store data from various systems Data integration no Integrate data from various systems Has to be done during each access by each user Data historization Yes, auditable Stamp data with timestamp Information delivery no Turn data into valuable information. Has to be done during each access by each user
  • 49. DATA MODELS IN THE DWH Are Data Lakes the new Core DWHs?Daimler TSS 49 Layer Characteristics Data Model Staging Layer Temporary storage Ingest of source data Normally 1:1 copy of source table structure – usually without constraints and indexes Core Warehouse Layer Historization / bitemporal data Integration Tool-independent Non-redundant data storage Historization 3NF with historization Head and Version modelling Data Vault Anchor modeling Dimensional model with historization (possible) Data Mart Layer Performance for end user queries required, Tool-dependent Lots of joins necessary to answer complex questions Flat structures, esp. Dimensional model (ROLAP / MOLAP / HOLAP)
  • 50. Understand business requirements Understand problem space Design solution space Think ideas (incl. alternatives) through WHY MODEL? Are Data Lakes the new Core DWHs?Daimler TSS 50
  • 51. SQL is universal language to access and manipulate data in a RDBMS SQL is a language not only for DBAs or developers SQL is standard for OLTP and OLAP, especially for BI tools MAKE SQL GREAT AGAIN OR WHY SQL ON BIG DATA? Are Data Lakes the new Core DWHs?Daimler TSS 51
  • 52. STRATA 2012 VS 2016 Are Data Lakes the new Core DWHs?Daimler TSS 52 Source: http://www.cazena.com/blog/strata-word-cloud-2012-vs-2016-data-lakes-spark-real-time-and-other-trends
  • 53. • Architecture with Atlas • Supports the classical tools: • Hive • Sqoop • HDFS? • Schema-on-read? ATLAS FOR METADATA MANAGEMENT Are Data Lakes the new Core DWHs?Daimler TSS 53
  • 54. NO DATA INTEGRATION NECESSARY OR WHO REALLY DOES UNDERSTANDS DATA MODELS? Are Data Lakes the new Core DWHs?Daimler TSS 54 Source: Corr / Stagnitto: Agile Data Warehouse Design, DecisionOne Press, 2011, page 5 • 3NF is inefficient for query processing • 3NF models are difficult to understand • 3NF gets even more complicated with history added • Many ways from person to order
  • 55. “Data modeling is the process of learning about the data, and regardless of technology, this process must be performed for a successful application.” • Learn about the data and promote collective data understanding • Derive security classification and measures • Design for performance • Accelerate development • Improve Software quality • Reduce maintenance costs • Generate code • NoSQL Schema-on-read: understand model versions after years WHY DATA MODELING? Are Data Lakes the new Core DWHs?Daimler TSS 55 Source quote: Steve Hoberman: Data Modeling for Mongo DB, Technics Publications 2014 „Expanding your modeling skills enables you to reduce documentation.“ Scott Ambler
  • 56. • Standard approach in Data Marts in DWH • Not just for performance reasons • Performance is also an issue on Hadoop-based systems, e.g. Hive, Spark • Joins! • But also due to understandability for end users • Understandability is also an issue on Hadoop-based systems DIMENSIONAL MODELING Are Data Lakes the new Core DWHs?Daimler TSS 56
  • 57. A prime motivation for this evolution towards a more “database-like” system was driven by the experiences of Google developers trying to build on previous “key-value” storage systems. The prototypical example of such a key-value system is Bigtable, which continues to see massive usage at Google for a variety of applications. However, developers of many OLTP applications found it difficult to build these applications without a strong schema system, cross-row transactions, consistent replication and a powerful query language. Source: https://research.google.com/pubs/pub46103.html IMPORTANCE OF STRONG SCHEMA @GOOGLE Are Data Lakes the new Core DWHs?Daimler TSS 57
  • 58. HADOOP VS CLASSIC DWH SQL APPROACH Are Data Lakes the new Core DWHs?Daimler TSS 58 Classic DWH Hadoop Tables Yes Yes SQL language Yes Yes, SQL-on-Hadoop Query Optimizer Yes Yes Indexes, Pks Yes No Data “Owner” Proprietary RDBMS Open data format Access by many engines like Spark, Hive Many open formats like Parquet, Avro Metadata dictionary User data + dictionary in RDBMS User data and dictionary (“Hive Metastore”) separate
  • 59. New data sources • Sensors, Logs, NoSQL, etc. as data source • Schema-on-read useful as sensor data format change frequent Existing internal data • Dump RDBMS exports into Data Lake for data analytics • Schema-on-read does not make any sense as data is already in a documented data model STRUCTURING THE DATA LAKE Are Data Lakes the new Core DWHs?Daimler TSS 59