This document discusses how a big data fabric can enable machine learning and artificial intelligence by providing a flexible and agile way for users to access and analyze large amounts of data from various sources. It explains that a big data fabric, powered by data virtualization, allows organizations to build a modern data ecosystem that provides governed access to both structured and unstructured data stored in different systems. This helps users develop new production analytics and insights. The document also provides an example of how Logitech used a big data fabric and data virtualization to improve their customer analytics.
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Die Big Data Fabric als Enabler für Machine Learning & AI
1. Die Big Data Fabric als
Enabler für ML & AI
1
Thomas Niewel
Marketing Manager Central Europe, Denodo
Technical Sales Director DACH, Denodo
Daniel Rapp
2. 2
Denodo
The Leader in Data Virtualization
DENODO OFFICES, CUSTOMERS, PARTNERS
Palo Alto, CA.
Global presence throughout North America,
EMEA, APAC, and Latin America.
LEADERSHIP
▪ Longest continuous focus on data
virtualization – since 1999
▪ Leader in 2020 Forrester Wave –
Enterprise Data Fabric
▪ Challenger in 2019 Gartner Magic
Quadrant for Data Integration Tools
▪ Winner of numerous awards
CUSTOMERS
~800+ customers, including many F500
and G2000 companies across every major
industry have gained significant business agility
and ROI.
FINANCIALS
Backed by $4B+ private equity firm.
50+% annual growth; Profitable.
5. 5
The Promise of the Data Lake
The promise of a Data Lake is a place that
you can store data in its raw form,
unencumbered by validation, mastering,
or quality processes, so as to allow
consumers to choose what data is of value
to them with a quick time to market.
6. 6
…Data lakes lack semantic consistency and governed
metadata. Meeting the needs of wider audiences
require curated repositories with governance,
semantic consistency and access controls.”
8. Data Lake (System of Insight)
Information Management and Governance
Data Lakes and Big Data Fabric
8
1
Multiple repositories, organized based on
source and usage; hosted on appropriate
data platform for workload
1
Catalog of data, ownership, meaning and
permitted usage
2
No direct access to repositories3
Curation of all data to define meaning and
classifications
4
Business-led information governance and
management
5
Active monitoring and management of data6
Data-centric security7
Access to raw data to develop new
production analytics
8
Moderated, view-based self-service access to
data and analytics for Line of Business
9
2
3
4 5
6
7
8
9
9. 9
Data Lakes and Big Data Fabric
Data Lake (System of Insight)
Information Management and Governance
11. 11
Data Virtualization Reference Architecture
Data Access
Security
Governance and Metadata management
Data Warehouse
Unstructured Data
Structured Data
RDBMS
Excel
Flat Files
XML
Email
Sensors (IIoT)
Social Media
RFID
Wearables
Storage
Compute
RDBMS
Big Data Lakes
noSQL
IMDG
Data Ingestion
Real Time/Streaming
Batch
CDC
Metadata
Enrichment
Search and Index
Data Virtualization
Data Services
Data Insight
Data Mining
Dashboards
Data Discovery
and Self-Service
Reporting
DATA
VIRTUALIZATION
12. Customer Case Study – Logitech
• Logitech also needed more agile (and cost effective) analytics platform for
customer analytics
• Moved analytics to AWS – utilized many different storage and compute capabilities
on AWS
• Redshift, Snowflake for EDW, RDS for relational data, S3 for data ingestion
• Spark, EMR, NLP for analytics
• Used Data Virtualization as data access layer
• Users could access data irrespective of storage or processing capability
• Data Virtualization Layer provided data security/access permissions
• Abstracted security from different data sources
• Data governance also provided by Data Virtualization Layer
• Data lineage, metadata management, data catalog and discovery, etc.
12
17. 17
What is the optimizer doing?
SELECT c.state, AVG(s.amount)
FROM customer c JOIN sales s
ON c.id = s.customer_id
GROUP BY c.state
Sales Customer
join
group by
Sales Customer
join
group by
ID
Group by
state
Sales Customer
Create
temp
table
join
group by
Option 1?
Option 2? Option 3?
Temp_Custom
er
Customer and Sales are in different sources.
What is the best execution plan?
Partial Aggregation PushdownNaïve Strategy Temporary Data Movement
300 M 2 M
2 M
2 M
2 M
50
18. 18
Performance & Scalability in Big Data Ecosystems!
• Cost-based optimization
• Query rewrite & operations pushdown
• Partition Pruning (Join, Union)
• Automatic Data Movement
• Partial and full data caching
• Work-offloading to MPP systems
• Ressource management
• … and many more …
19. 19
Summary
• Big Data initiatives can add deep analytical insight
to help organizations find new opportunities or
provide more efficient services
• When defining your Big Data initiative, think about
how users will find and access the data
• A Big Data Fabric is key to making this easy
• And making the initiative successful
• Data Virtualization is the right technology for an
agile and flexible Big Data Fabric