Watch full webinar here: https://bit.ly/3kr0oq4
So you’re building a data lake to solve your big data challenges. A data lake will allow you to keep all of your raw, detailed data in a single, consolidated repository; therefore, your problem is solved. Or is it? Is it really that easy?
Data lakes have their use and purpose, and we’re not here to argue that. However, data lakes on their own are constrained by factors such as duplication of data and therefore higher costs, governance limitations, and the risk of becoming another data silo.
With the addition of data virtualization, a physical data lake, can turn into a virtual or logical data like through an abstraction layer. Data virtualization can facilitate and expedite accessing and exploring critical data in a cost-effective manner and assist in deriving a greater return on the data lake investment.
You might still not be convinced. Give us an opportunity and join us as we try to bust this myth!
Watch this webinar as we explore the promises of a data lake as well as its downfalls to draw a final conclusion.
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (ASEAN)
1. W E B I N A R S E R I E S
I’m Building a Data
Lake, So I Don’t Need
Data Virtualization
Paul Moxon
SVP Data Architectures & Chief Evangelist
Denodo
23 February 2021
7. 8
A Bit of History – Etymology of “Data Lake”
https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/ (with my emphasis)
Pentaho’s CTO James Dixon is credited with coining the
term "data lake". He described it in his blog in 2010:
"If you think of a data mart as a store of bottled
water – cleansed and packaged and structured
for easy consumption – the data lake is a large
body of water in a more natural state. The
contents of the data lake stream in from a source
to fill the lake, and various users of the lake can
come to examine, dive in, or take samples."
9. 10
Data Lakes Become Data Science Playgrounds…
The early data scientists saw Hadoop
as their personal supercomputer.
Hadoop-based Data Lakes helped
democratize access to state of the
art supercomputing with off-the-
shelf HW (and later cloud)
The industry push for BI made
Hadoop–based solutions the
standard to bring modern analytics
to any corporation
11. 12
Changing the Data Lake Goals
“The popular view is that
a data lake will be the one
destination for all the
data in their enterprise
and the optimal platform
for all their analytics.”
Nick Heudecker, Gartner
12. 13
…Data lakes lack semantic consistency and
governed metadata. Meeting the needs of
wider audiences require curated repositories
with governance, semantic consistency and
access controls.”
14. 15
Consumers
BI/Visualization
“Bring Your Own
Tool” reporting and
visualization
capabilities
integrated into the
Data Lake
Analytics
Workbench
Self-service analytics
workbench
Data Sources
Any internal or
external data
source that
should be copied
into the Data
Lake
Data Lakes Reference Architecture
Data Sources
Any internal or
external data
source that
should be copied
into the Data
Lake
Search and Browse
Search and browse data sets, explore relationships, sample queries and export results
Data Governance and Catalog
Governance and Cataloging of business and technical data assets (Stewardship, Curation, Profiling, Quality)
Data and Operations Management
Provides a broad set of services across the ecosystem to enable security, auditing, scheduling, version
management, policies, etc.
Data Ingestion
Physical or virtual
services to ingest
and integrate data
rapidly across a
variety of sources
and data types
through a common
‘ingest’ layer
Data Landing
Centralized location to land new
data entering ecosystem
separated via logical partitions,
based on source, data type,
characteristics, and governance
requirements
Raw Zone
Original data received from
the originating system plus
tagging and typing to aid in
understanding
Selection & Provisioning
Services to select and integrate
data objects, including
provisioning and prep of data
ingested in Raw Zone and/or
accessed via Trusted and
Consumption Zones
Trusted Zone
Data is enhanced with
business rules and identifiers
added to enable integration
Standardization
Services to consolidate, enrich,
profile and steward datasets
and metadata for on-going
consumption
Refined Zone
Data is conformed to specific
uses as ‘fit for purpose’ data
sets supporting common
models and standards
Exploratory Zone
Provides a flexible and intuitive way for consumers (data stewards, data engineers, and data scientists)
to research and manage data
Data
Delivery
Services
Services to
connect
deliver data,
metadata, and
insights to
consumers for
specific use
cases
Data Sources Technology Capabilities Delivery Capabilities
Consumers
Data Marketplace
User-friendly, SSO
enabled & multi
tenant front-end
surfacing the data
lifecycle services
supported by the
Data Lake
BI/Visualization
“Bring Your Own
Tool” reporting and
visualization
capabilities
integrated into the
Data Lake
Analytics
Workbench
Self-service analytics
workbench
System/App/
Device
Non-user consumers
of data assets
15. 16
Real World Data Lake Example – Using AWS
Trusted Zone
Raw Data Zone Refined Zone
Transformation Transformation Data Consumers
Networking, Infrastructure & Security
Data Ingestion
Data
Sources
Data Catalog and Search – Asset Registry Workflow Orchestration, DevOps and CI/CD
17. 18
Data Virtualization as the Data Lake ‘Delivery Layer’
1. As the Data Delivery
Services layer
2. In the Refined Zone layer
3. As the self-service Data
Catalog
4. As part of the Exploratory
Zone
18. 19
Data Virtualization as the ‘Data Delivery Services’ Layer
Data
Virtualization
• Delivery Services must support
multiple data delivery styles and
protocols
• Real-time and batch
• Request/response and reactive
(event-driven)
• Ad-hoc queries and APIs
• Data Lake needs a delivery layer
and Data Virtualization fits this
requirement
• Enables access to Data Lake and
non-Data Lake sources through
single, unified access layer
• Data Virtualization provides data
catalog for searching, finding,
and understanding data available
in Data Lake
• Provides security and governance
capabilities for Data Lake
22. 23
Customer Example - FESTO
• Founded 1925
• Annual revenues (FY
2018) €3.2 B
• Over 21,000
employees
• Headquarters in
Germany
• World´s leading
supplier of
automation
technology and
technical education.
BUSINESS NEED
• Optimize operational efficiency, automate manufacturing processes, and deliver
on-demand services to business consumers
• Find smarter ways to aggregate and analyze data
• An agile solution that enables the monetization of customer-facing data products
• Free business users from IT reliance to become self-sufficient with reporting and
analysis
THE CHALLENGE:
Find an agile way to integrate data from existing silos, including an analytical data
lake, machine data in an IoT data lake, and traditional databases and data warehouse,
that will reduce dependencies from business users on IT and provides quick
turnaround and flexibility.
25. 26
Customer - FESTO
SOLUTION:
• Festo developed a Big Data
Analytics Framework to
provide a data marketplace to
better support the business
• Using the Denodo Platform to
integrate data from numerous
on-prem and cloud systems in
real-time, including Cloud-
based IoT Data Lake for
machine data
• A unified layer for consistent
data access and governance
across different data silos
28. 29
Questions to Ask About Your Data Lake…
1. Is all of your data going to be in the Data Lake?
2. Can you copy all of the data into the Data Lake?
3. Do you truly only have one Data Lake? Or will there be Data Lakes in different BUs
or geographies?
4. How do you apply security and governance on the data?
5. How do you deliver ‘fit for purpose’ data sets for all users?
6. Or is the data only for highly technical users (e.g. data scientists)?
29. 1. Large data lake projects are complex environments
that will benefit from a virtual ‘consumption’ layer
2. In most cases, not all the data is going to be in the
data lake, so data lake data will need integrating
with non-lake data.
3. Data virtualization provides a data delivery layer
that simplifies and accelerates data lake access.
4. It provides a governance, management, and
security capability required for successful data lake
implementation
Key Takeaways