2. Whoami
• Database and BigData Architect (Hadoop, Data Science and other
cool topics)
• Former Developer and Consultant
• Owner@Premiseo: Data Management on Premises and in the
Cloud
• Blogger since 2004
• http://laurent-leturgez.com
• Twitter : @lleturgez
3. What’s on the menu ?
• What is a Datalake ?
• Keys to architect a Datalake
• Design, Security
• Data movement, Data Processing
• Discovery
• Solutions available
• Example
• Datalake Implementation driven by IoT
4. What is a Datalake ?
• Repository of data stored in natural format
• Single Store of Enterprise data
• Raw Data
• Transformed Data : Reports, DataViz, Results (AI, ML …)
• Data Structure:
• Structured Data : Row, Columns, Relational Data
• Semi Structured Data: CSV, XML, JSON, log files
• Unstructured Data: Mails, Documents, Binaries (Images, Videos)
5. What is a Datalake ?
• Features
• Data are usually integrated unprocessed
• Processed data can be kept in the Datalake
• Data are kept … ready to be transformed
• Data are saved as long as possible
• A Datalake is
• Organized
• Managed
6. What is a Datalake ?
• A Datalake is
not a datawarehouse
Source: martinfowler.com
7. Keys to architect a Datalake
• A well thought design
• Vital for
• Success
• Discovery efficiency
• ETL development effort
• Coupled with Security and business process
8. Keys to architect a Datalake
• A well thought design … example
• Operational Areas
• Raw Area
• Data landing zone in native raw format
• Data are kept indefinitely in this area
• Data Tagging
• Folder Structure organized by Source, Dataset, Date etc.
• Staging Area
• Data Preparation Area : Decompression, cleansing, aggregation
• Data Quality Management is usually made here
• Hub Area
• Trusted layer of data
• Data is ready for analytics organized functionaly
9. Keys to architect a Datalake
• A well thought design … example
• (Extra) Supported Area
• Master Data Area
• Customer, Products, Financial Data
• Used by Analytics
• Exploratory Area
• Playground for Data Scientists and Analysts
• Temporary Area
• Testing Data decompression
• Single point of data storage before move accross network
10. Keys to architect a Datalake
• Security
• Data Access Control
• By User
• By Application
• ETL Softwares
• Analytics
• …
• By Operational zone
• By Source
Key Point: IAM Integration
11. Keys to architect a Datalake
• Security
• Data Security
• Data Lake Management (Role Control)
• Data Resilience
• Disaster recovery
• Backup / Restore
• SLA: Availability, RTO, RPO
• Data Encryption
• At rest
• In transit
12. Keys to architect a Datalake
• Data Movement, Data Processing
• Consider the Data Lake as central point for
• Data Ingestion
• Data Processing
13. Keys to architect a Datalake
• Data Movement, Data Processing
• Consider the Data Lake as central point
• Data Ingestion
• Tools / ETL
• Metadata strategy should be in place (Data Catalog for tagging)
• Data Format
• Naming convention for files/directories: ingestion date, format, source etc.
• Batch or real time
• Many small files or few big files
• Data Partitioning Maximum query and processing performance
• Cloud or OnPrem ?
• Network issues, hybrid Cloud considerations
• Data Processing
14. Keys to architect a Datalake
• Data Movement, Data Processing
• Consider the Data Lake as central point
• Data Ingestion
• Data Processing
• Tools
• Hadoop (on Prem / Cloud)
• Legacies Database systems (SQL Server PolyBase, Oracle Connector for Hadoop, AWS
Spectrum/Athena etc.)
• Analytics, DataViz and ML
• Data Bricks, Power BI, SAS, Qlik etc.
• Data Colocation
• Data Format
• Compressed / Uncompressed
• Column oriented
15. Keys to architect a Datalake
• Orchestration
• Cloud Automation or Job Automation ?
• Batch or real time
• Batch automation
• Monitoring
• Data volume
• Real Time (Usually used for IoT)
• How is built the pipeline ?
• Event based or not ?
• Monitoring
16. Keys to architect a Datalake
• Discovery
• Tagging and Metadata management : Similar … but different
• MetaData management :
• Data about data : creation and modification date, source, format etc.
• Traditional metadata: source, connection string, data type, length, versions etc.
• Modern metadata: included in files (AVRO For example) or a database
• Advanced metadata: automated processing of metadata
• Tagging
• Set of tag to understand/describe datasets in the datalake
• Usually stored in a Catalog or KV database or through Naming conventions
• Key points: When the data has been tagged ? Who owns the tagging system ?
17. Solutions available
• Solutions available
• On Prem :
• Hadoop / HDFS
• Cloud
• AWS : S3 Buckets
• Azure : Azure Datalake Store Gen1/Gen2, Storage Accounts
• GCP: Google Cloud Storage
• Oracle Cloud Infrastructure: Object Storage
18. Implementation
• Example : Solution
• Customer : Industry, Trucks maker
• Project : Parts failure prediction
• Sensors are embedded in trucks
• Data collection for parts health
• Data are integrated real time in the Datalake
• Legacy data are integrated into the datalake (batch mode)
• Parts related data (mostly coming from ERPs) : Serial number, provider, purchases etc.
• Predictive algorithms are designed to replace parts before they broke
19. Implementation
• Example: Solution
• Azure Datalake Store / Storage Accounts closely integrated with MS SQL
Databases
• Why not on Prem ?
• Infrastructure costs
• Fuzzy Data volume prediction
• Hadoop management
20. Implementation
• Example: Solution
• Why Azure ?
• Microsoft long time customer
• Many services already used (Legacy databases: MS SQL DWH, Power BI etc.)
• Active Directory Integration: Security, ACL and
• Batch Integration by Talend
• Real Time Integration by Azure Products (Iot Hub + Azure Functions)
• Close integration with DataBricks for Analytics and Data Processing
21. Conclusion
• DataLake are now central components for enterprises
• Without …
• Organized Data
• Managed Data (Security, design etc.)
• High volume of Data
• No powerful AI or ML algorithms
• No powerful Analytic processes