SlideShare una empresa de Scribd logo
1 de 20
Descargar para leer sin conexión
Architecture Data Lake
Wael GHAZOUANI
Paris June 6th 2020
Copyright © GHAZOUANI Wael 2020
Introduction
•What is Big Data?
What Data Lake ?
•What is Hadoop?
•What is Hadoop Data Lake?
•When to use Hadoop?
•Getting Started with Big Data
Copyright © GHAZOUANI Wael 2020
Introduction
• To store a large amount of data in a
repository that will be the main, if
not the only, source of information
for a number of applications. In Big
Data lingo, such a repository is called
a Data Lake
Copyright © GHAZOUANI Wael 2020
What is Big Data?
• •A collection of data sets so
large and complex that it becomes
difficult to process using on-hand
database management tools or
traditional data processing
applications
• •Due to its technical nature, the
same challenges arise in Analytics
at much lower volumes than what
is traditionally considered Big
Data.
Copyright © GHAZOUANI Wael 2020
Data Lake
• This lake of data will contain very varied types of data: log files,
images, binary files, etc. the data will be read by a large number of
varied applications. For these applications, data will be the only
source of truth. Indeed, as this data will be large, there will be no
question of duplicating it. The data contained in the data lake is
therefore precious. For this reason, the data must never be modified
or deleted. The data lake will contain datasets which will serve as a
reference point
Copyright © GHAZOUANI Wael 2020
Copyright © GHAZOUANI Wael 2020
Why Now? Open APIs, Platforms, Ecosystems
Copyright © GHAZOUANI Wael 2020
What is Hadoop?
•Apache Hadoop is an open source software platform for distributed
storage and distributed processing of very large data sets on computer
clusters built from commodity hardware.
Hadoop services provide for data storage, data processing, data access,
data governance, security, and operations
Copyright © GHAZOUANI Wael 2020
Why Hadoop?
•Scalability issue when running jobs processing terabytes of data
•Could span dozen days just to read that amount of data on 1 computer
•Need lots of cheap computers
•To Fix speed problem
•But lead to reliability problems
•In large clusters, computers fail every day
•Cluster size is not fixed
•Need common infrastructure
•Must be efficient and reliable
Copyright © GHAZOUANI Wael 2020
Hadoop Platform
Copyright © GHAZOUANI Wael 2020
Copyright © GHAZOUANI Wael 2020
•Unified distributed fault-tolerant and scalable storage
•Cost Effective.
•Hadoop can be 10 to 100 times less expensive to deploy than traditional
data warehouse.
•Just-in-time ad’hoc dataset creation.
•Extract and place data into an Hadoop-based repository without first
transforming the data (as you would do for a traditional EDW).
•All jobs are created in an ad hoc manner, with little or no required
preparation work.
•Labor intensive processes of cleaning up data and developing schema are
deferred until a clear business need is identified.
•Data governance for structured and unstructured “raw” data
Hadoop to the Rescue to offer
Copyright © GHAZOUANI Wael 2020
Copyright © GHAZOUANI Wael 2020
Copyright © GHAZOUANI Wael 2020
Hadoop Data Lake vs. Enterprise
Data Warehouse
Copyright © GHAZOUANI Wael 2020
When to Use Hadoop?
Copyright © GHAZOUANI Wael 2020
Copyright © GHAZOUANI Wael 2020
How to
Implement
Hadoop?
McKinsey Seven
Steps Approach
Copyright © GHAZOUANI Wael 2020
How to Implement Hadoop?
Dataiku Method supported by a tool
http://www.dataiku.com/dss/
Copyright © GHAZOUANI Wael 2020
Copyright © GHAZOUANI Wael 2020
https://www.linkedin.com/in/ghazouani-wael-251270115/
Waelghazouani.soltani@gmail.com

Más contenido relacionado

La actualidad más candente

Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
pcherukumalla
 

La actualidad más candente (20)

Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a Service
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake Architecture
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the Ugly
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricUsing a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on Snowflake
 
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Traditional data warehouse vs data lake
Traditional data warehouse vs data lakeTraditional data warehouse vs data lake
Traditional data warehouse vs data lake
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 

Similar a Data lake

INTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPINTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOP
Krishna Sujeer
 

Similar a Data lake (20)

Big Data Telecom
Big Data TelecomBig Data Telecom
Big Data Telecom
 
Big data edel
Big data edelBig data edel
Big data edel
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Integrated dwh 3
Integrated dwh 3Integrated dwh 3
Integrated dwh 3
 
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
INTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPINTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOP
 
Hadoop(Term Paper)
Hadoop(Term Paper)Hadoop(Term Paper)
Hadoop(Term Paper)
 
Making the Case for Hadoop in a Large Enterprise-British Airways
Making the Case for Hadoop in a Large Enterprise-British AirwaysMaking the Case for Hadoop in a Large Enterprise-British Airways
Making the Case for Hadoop in a Large Enterprise-British Airways
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Big Data
Big DataBig Data
Big Data
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle Database
 
Conflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big DataConflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big Data
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 

Último

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 

Último (20)

Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 

Data lake

  • 1. Architecture Data Lake Wael GHAZOUANI Paris June 6th 2020 Copyright © GHAZOUANI Wael 2020
  • 2. Introduction •What is Big Data? What Data Lake ? •What is Hadoop? •What is Hadoop Data Lake? •When to use Hadoop? •Getting Started with Big Data Copyright © GHAZOUANI Wael 2020
  • 3. Introduction • To store a large amount of data in a repository that will be the main, if not the only, source of information for a number of applications. In Big Data lingo, such a repository is called a Data Lake Copyright © GHAZOUANI Wael 2020
  • 4. What is Big Data? • •A collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications • •Due to its technical nature, the same challenges arise in Analytics at much lower volumes than what is traditionally considered Big Data. Copyright © GHAZOUANI Wael 2020
  • 5. Data Lake • This lake of data will contain very varied types of data: log files, images, binary files, etc. the data will be read by a large number of varied applications. For these applications, data will be the only source of truth. Indeed, as this data will be large, there will be no question of duplicating it. The data contained in the data lake is therefore precious. For this reason, the data must never be modified or deleted. The data lake will contain datasets which will serve as a reference point Copyright © GHAZOUANI Wael 2020
  • 7. Why Now? Open APIs, Platforms, Ecosystems Copyright © GHAZOUANI Wael 2020
  • 8. What is Hadoop? •Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provide for data storage, data processing, data access, data governance, security, and operations Copyright © GHAZOUANI Wael 2020
  • 9. Why Hadoop? •Scalability issue when running jobs processing terabytes of data •Could span dozen days just to read that amount of data on 1 computer •Need lots of cheap computers •To Fix speed problem •But lead to reliability problems •In large clusters, computers fail every day •Cluster size is not fixed •Need common infrastructure •Must be efficient and reliable Copyright © GHAZOUANI Wael 2020
  • 10. Hadoop Platform Copyright © GHAZOUANI Wael 2020
  • 12. •Unified distributed fault-tolerant and scalable storage •Cost Effective. •Hadoop can be 10 to 100 times less expensive to deploy than traditional data warehouse. •Just-in-time ad’hoc dataset creation. •Extract and place data into an Hadoop-based repository without first transforming the data (as you would do for a traditional EDW). •All jobs are created in an ad hoc manner, with little or no required preparation work. •Labor intensive processes of cleaning up data and developing schema are deferred until a clear business need is identified. •Data governance for structured and unstructured “raw” data Hadoop to the Rescue to offer Copyright © GHAZOUANI Wael 2020
  • 14. Copyright © GHAZOUANI Wael 2020 Hadoop Data Lake vs. Enterprise Data Warehouse
  • 15. Copyright © GHAZOUANI Wael 2020 When to Use Hadoop?
  • 17. Copyright © GHAZOUANI Wael 2020 How to Implement Hadoop? McKinsey Seven Steps Approach
  • 18. Copyright © GHAZOUANI Wael 2020 How to Implement Hadoop? Dataiku Method supported by a tool http://www.dataiku.com/dss/
  • 20. Copyright © GHAZOUANI Wael 2020 https://www.linkedin.com/in/ghazouani-wael-251270115/ Waelghazouani.soltani@gmail.com