SlideShare una empresa de Scribd logo
1 de 22
Architecting a Datalake
Laurent Léturgez – Sep 2019
Big Data Meetup - Lille
Whoami
• Database and BigData Architect (Hadoop, Data Science and other
cool topics)
• Former Developer and Consultant
• Owner@Premiseo: Data Management on Premises and in the
Cloud
• Blogger since 2004
• http://laurent-leturgez.com
• Twitter : @lleturgez
What’s on the menu ?
• What is a Datalake ?
• Keys to architect a Datalake
• Design, Security
• Data movement, Data Processing
• Discovery
• Solutions available
• Example
• Datalake Implementation driven by IoT
What is a Datalake ?
• Repository of data stored in natural format
• Single Store of Enterprise data
• Raw Data
• Transformed Data : Reports, DataViz, Results (AI, ML …)
• Data Structure:
• Structured Data : Row, Columns, Relational Data
• Semi Structured Data: CSV, XML, JSON, log files
• Unstructured Data: Mails, Documents, Binaries (Images, Videos)
What is a Datalake ?
• Features
• Data are usually integrated unprocessed
• Processed data can be kept in the Datalake
• Data are kept … ready to be transformed
• Data are saved as long as possible
• A Datalake is
• Organized
• Managed
What is a Datalake ?
• A Datalake is
not a datawarehouse
Source: martinfowler.com
Keys to architect a Datalake
• A well thought design
• Vital for
• Success
• Discovery efficiency
• ETL development effort
• Coupled with Security and business process
Keys to architect a Datalake
• A well thought design … example
• Operational Areas
• Raw Area
• Data landing zone in native raw format
• Data are kept indefinitely in this area
• Data Tagging
• Folder Structure organized by Source, Dataset, Date etc.
• Staging Area
• Data Preparation Area : Decompression, cleansing, aggregation
• Data Quality Management is usually made here
• Hub Area
• Trusted layer of data
• Data is ready for analytics organized functionaly
Keys to architect a Datalake
• A well thought design … example
• (Extra) Supported Area
• Master Data Area
• Customer, Products, Financial Data
• Used by Analytics
• Exploratory Area
• Playground for Data Scientists and Analysts
• Temporary Area
• Testing Data decompression
• Single point of data storage before move accross network
Keys to architect a Datalake
• Security
• Data Access Control
• By User
• By Application
• ETL Softwares
• Analytics
• …
• By Operational zone
• By Source
Key Point: IAM Integration
Keys to architect a Datalake
• Security
• Data Security
• Data Lake Management (Role Control)
• Data Resilience
• Disaster recovery
• Backup / Restore
• SLA: Availability, RTO, RPO
• Data Encryption
• At rest
• In transit
Keys to architect a Datalake
• Data Movement, Data Processing
• Consider the Data Lake as central point for
• Data Ingestion
• Data Processing
Keys to architect a Datalake
• Data Movement, Data Processing
• Consider the Data Lake as central point
• Data Ingestion
• Tools / ETL
• Metadata strategy should be in place (Data Catalog for tagging)
• Data Format
• Naming convention for files/directories: ingestion date, format, source etc.
• Batch or real time
• Many small files or few big files
• Data Partitioning  Maximum query and processing performance
• Cloud or OnPrem ?
• Network issues, hybrid Cloud considerations
• Data Processing
Keys to architect a Datalake
• Data Movement, Data Processing
• Consider the Data Lake as central point
• Data Ingestion
• Data Processing
• Tools
• Hadoop (on Prem / Cloud)
• Legacies Database systems (SQL Server PolyBase, Oracle Connector for Hadoop, AWS
Spectrum/Athena etc.)
• Analytics, DataViz and ML
• Data Bricks, Power BI, SAS, Qlik etc.
• Data Colocation
• Data Format
• Compressed / Uncompressed
• Column oriented
Keys to architect a Datalake
• Orchestration
• Cloud Automation or Job Automation ?
• Batch or real time
• Batch automation
• Monitoring
• Data volume
• Real Time (Usually used for IoT)
• How is built the pipeline ?
• Event based or not ?
• Monitoring
Keys to architect a Datalake
• Discovery
• Tagging and Metadata management : Similar … but different
• MetaData management :
• Data about data : creation and modification date, source, format etc.
• Traditional metadata: source, connection string, data type, length, versions etc.
• Modern metadata: included in files (AVRO For example) or a database
• Advanced metadata: automated processing of metadata
• Tagging
• Set of tag to understand/describe datasets in the datalake
• Usually stored in a Catalog or KV database or through Naming conventions
• Key points: When the data has been tagged ? Who owns the tagging system ?
Solutions available
• Solutions available
• On Prem :
• Hadoop / HDFS
• Cloud
• AWS : S3 Buckets
• Azure : Azure Datalake Store Gen1/Gen2, Storage Accounts
• GCP: Google Cloud Storage
• Oracle Cloud Infrastructure: Object Storage
Implementation
• Example : Solution
• Customer : Industry, Trucks maker
• Project : Parts failure prediction
• Sensors are embedded in trucks
• Data collection for parts health
• Data are integrated real time in the Datalake
• Legacy data are integrated into the datalake (batch mode)
• Parts related data (mostly coming from ERPs) : Serial number, provider, purchases etc.
• Predictive algorithms are designed to replace parts before they broke
Implementation
• Example: Solution
• Azure Datalake Store / Storage Accounts closely integrated with MS SQL
Databases
• Why not on Prem ?
• Infrastructure costs
• Fuzzy Data volume prediction
• Hadoop management
Implementation
• Example: Solution
• Why Azure ?
• Microsoft long time customer
• Many services already used (Legacy databases: MS SQL DWH, Power BI etc.)
• Active Directory Integration: Security, ACL and
• Batch Integration by Talend
• Real Time Integration by Azure Products (Iot Hub + Azure Functions)
• Close integration with DataBricks for Analytics and Data Processing
Conclusion
• DataLake are now central components for enterprises
• Without …
• Organized Data
• Managed Data (Security, design etc.)
• High volume of Data
• No powerful AI or ML algorithms
• No powerful Analytic processes
Questions ?

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Learn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleLearn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML Lifecycle
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 

Similar a Architecting a datalake

Ds03 data analysis
Ds03   data analysisDs03   data analysis
Ds03 data analysis
DotNetCampus
 
Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24
Martin Bém
 

Similar a Architecting a datalake (20)

How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
DA_01_Intro.pptx
DA_01_Intro.pptxDA_01_Intro.pptx
DA_01_Intro.pptx
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
AWS Community Day Poland 2022 - Building a Data Lake.pdf
AWS Community Day Poland 2022 - Building a Data Lake.pdfAWS Community Day Poland 2022 - Building a Data Lake.pdf
AWS Community Day Poland 2022 - Building a Data Lake.pdf
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
AzureDay - Introduction Big Data Analytics.
AzureDay  - Introduction Big Data Analytics.AzureDay  - Introduction Big Data Analytics.
AzureDay - Introduction Big Data Analytics.
 
Ds03 data analysis
Ds03   data analysisDs03   data analysis
Ds03 data analysis
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Hadoop and IDW - When_to_use_which
Hadoop and IDW - When_to_use_whichHadoop and IDW - When_to_use_which
Hadoop and IDW - When_to_use_which
 
Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 

Más de Laurent Leturgez (6)

Python and Oracle : allies for best of data management
Python and Oracle : allies for best of data managementPython and Oracle : allies for best of data management
Python and Oracle : allies for best of data management
 
Oracle hadoop let them talk together !
Oracle hadoop let them talk together !Oracle hadoop let them talk together !
Oracle hadoop let them talk together !
 
Oracle Database : Addressing a performance issue the drilldown approach
Oracle Database : Addressing a performance issue the drilldown approachOracle Database : Addressing a performance issue the drilldown approach
Oracle Database : Addressing a performance issue the drilldown approach
 
Improve oracle 12c security
Improve oracle 12c securityImprove oracle 12c security
Improve oracle 12c security
 
Which cloud provider for your oracle database
Which cloud provider for your oracle databaseWhich cloud provider for your oracle database
Which cloud provider for your oracle database
 
SIMD inside and outside Oracle 12c In Memory
SIMD inside and outside Oracle 12c In MemorySIMD inside and outside Oracle 12c In Memory
SIMD inside and outside Oracle 12c In Memory
 

Último

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 

Último (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 

Architecting a datalake

  • 1. Architecting a Datalake Laurent Léturgez – Sep 2019 Big Data Meetup - Lille
  • 2. Whoami • Database and BigData Architect (Hadoop, Data Science and other cool topics) • Former Developer and Consultant • Owner@Premiseo: Data Management on Premises and in the Cloud • Blogger since 2004 • http://laurent-leturgez.com • Twitter : @lleturgez
  • 3. What’s on the menu ? • What is a Datalake ? • Keys to architect a Datalake • Design, Security • Data movement, Data Processing • Discovery • Solutions available • Example • Datalake Implementation driven by IoT
  • 4. What is a Datalake ? • Repository of data stored in natural format • Single Store of Enterprise data • Raw Data • Transformed Data : Reports, DataViz, Results (AI, ML …) • Data Structure: • Structured Data : Row, Columns, Relational Data • Semi Structured Data: CSV, XML, JSON, log files • Unstructured Data: Mails, Documents, Binaries (Images, Videos)
  • 5. What is a Datalake ? • Features • Data are usually integrated unprocessed • Processed data can be kept in the Datalake • Data are kept … ready to be transformed • Data are saved as long as possible • A Datalake is • Organized • Managed
  • 6. What is a Datalake ? • A Datalake is not a datawarehouse Source: martinfowler.com
  • 7. Keys to architect a Datalake • A well thought design • Vital for • Success • Discovery efficiency • ETL development effort • Coupled with Security and business process
  • 8. Keys to architect a Datalake • A well thought design … example • Operational Areas • Raw Area • Data landing zone in native raw format • Data are kept indefinitely in this area • Data Tagging • Folder Structure organized by Source, Dataset, Date etc. • Staging Area • Data Preparation Area : Decompression, cleansing, aggregation • Data Quality Management is usually made here • Hub Area • Trusted layer of data • Data is ready for analytics organized functionaly
  • 9. Keys to architect a Datalake • A well thought design … example • (Extra) Supported Area • Master Data Area • Customer, Products, Financial Data • Used by Analytics • Exploratory Area • Playground for Data Scientists and Analysts • Temporary Area • Testing Data decompression • Single point of data storage before move accross network
  • 10. Keys to architect a Datalake • Security • Data Access Control • By User • By Application • ETL Softwares • Analytics • … • By Operational zone • By Source Key Point: IAM Integration
  • 11. Keys to architect a Datalake • Security • Data Security • Data Lake Management (Role Control) • Data Resilience • Disaster recovery • Backup / Restore • SLA: Availability, RTO, RPO • Data Encryption • At rest • In transit
  • 12. Keys to architect a Datalake • Data Movement, Data Processing • Consider the Data Lake as central point for • Data Ingestion • Data Processing
  • 13. Keys to architect a Datalake • Data Movement, Data Processing • Consider the Data Lake as central point • Data Ingestion • Tools / ETL • Metadata strategy should be in place (Data Catalog for tagging) • Data Format • Naming convention for files/directories: ingestion date, format, source etc. • Batch or real time • Many small files or few big files • Data Partitioning  Maximum query and processing performance • Cloud or OnPrem ? • Network issues, hybrid Cloud considerations • Data Processing
  • 14. Keys to architect a Datalake • Data Movement, Data Processing • Consider the Data Lake as central point • Data Ingestion • Data Processing • Tools • Hadoop (on Prem / Cloud) • Legacies Database systems (SQL Server PolyBase, Oracle Connector for Hadoop, AWS Spectrum/Athena etc.) • Analytics, DataViz and ML • Data Bricks, Power BI, SAS, Qlik etc. • Data Colocation • Data Format • Compressed / Uncompressed • Column oriented
  • 15. Keys to architect a Datalake • Orchestration • Cloud Automation or Job Automation ? • Batch or real time • Batch automation • Monitoring • Data volume • Real Time (Usually used for IoT) • How is built the pipeline ? • Event based or not ? • Monitoring
  • 16. Keys to architect a Datalake • Discovery • Tagging and Metadata management : Similar … but different • MetaData management : • Data about data : creation and modification date, source, format etc. • Traditional metadata: source, connection string, data type, length, versions etc. • Modern metadata: included in files (AVRO For example) or a database • Advanced metadata: automated processing of metadata • Tagging • Set of tag to understand/describe datasets in the datalake • Usually stored in a Catalog or KV database or through Naming conventions • Key points: When the data has been tagged ? Who owns the tagging system ?
  • 17. Solutions available • Solutions available • On Prem : • Hadoop / HDFS • Cloud • AWS : S3 Buckets • Azure : Azure Datalake Store Gen1/Gen2, Storage Accounts • GCP: Google Cloud Storage • Oracle Cloud Infrastructure: Object Storage
  • 18. Implementation • Example : Solution • Customer : Industry, Trucks maker • Project : Parts failure prediction • Sensors are embedded in trucks • Data collection for parts health • Data are integrated real time in the Datalake • Legacy data are integrated into the datalake (batch mode) • Parts related data (mostly coming from ERPs) : Serial number, provider, purchases etc. • Predictive algorithms are designed to replace parts before they broke
  • 19. Implementation • Example: Solution • Azure Datalake Store / Storage Accounts closely integrated with MS SQL Databases • Why not on Prem ? • Infrastructure costs • Fuzzy Data volume prediction • Hadoop management
  • 20. Implementation • Example: Solution • Why Azure ? • Microsoft long time customer • Many services already used (Legacy databases: MS SQL DWH, Power BI etc.) • Active Directory Integration: Security, ACL and • Batch Integration by Talend • Real Time Integration by Azure Products (Iot Hub + Azure Functions) • Close integration with DataBricks for Analytics and Data Processing
  • 21. Conclusion • DataLake are now central components for enterprises • Without … • Organized Data • Managed Data (Security, design etc.) • High volume of Data • No powerful AI or ML algorithms • No powerful Analytic processes