SlideShare una empresa de Scribd logo
1 de 74
Descargar para leer sin conexión
Specialties / Focus Areas / Passions:
• Performance Tuning & Troubleshooting
• Very Large Databases
• SQL Server Storage Engine
• HA/DR
• Cloud
@sqlbob
bob@bobpusateri.com
bobpusateri.com
bobpusateri
• What Is Azure Data Lake?
• Data Lake Storage
• Data Processing
• Design Considerations
• Security
• DBA Topics
• Cost
• Demos
About Your
Customers
About Your
Business
YOUR
BUSINESS
Valuable
To
You
Valuable To
Others
THE NEW
CURRENCY
https://tinyurl.com/yywejwmy
 A scalable platform for housing all types of data for analysis
 A repository for analyzing many sources of data in its native format
 TL;DR an extremely scalable storage system with helpful tools
 A tool that integrates with many services and sources
 For both input and output
Data Lake
 Performant – enables linear and infinite scale
 Scales on demand
 Supports varying access patterns
 Securable
 RBAC – Groups and Principals
 POSIX ACLs
 Manageable
 Scriptable APIs
 (Preferably REST APIs)
 “Bring your own engine”
 Hadoop
 Apache Spark
 Power BI
 SQL Data Warehouse
 Azure Data Lake Analytics
 etc.
 There is no compute here
 A great data tool to have in your toolbox
 May be doing this work already with SSIS & Archiving
 How much time do you spend writing packages?
OLTP
 Focus
 Operational Transactions
 Writes
 Scope
 Single database system
 Sample Objectives:
 Recall a customer’s orders
 Generate a report
Data Warehouse
 Focus
 Informational/Analytical
 Reads
 Scope
 Integrate multiple systems
 Sample Objectives:
 Identify poorest-selling products
 Compare margin per customer
 Volume
 Growing exponentially every year!
 Variety
 More than what fits neatly into a relational DB
 Structured data being augmented by unstructured data
 Velocity
 Frequency it needs to be processed. Is it a streaming source?
 Veracity
 Is the data trustworthy?
 Value
 Are we creating value from the above 4 things?
Structured: Relational Unstructured: Binary or plain text
Semi-structured: Non-relational
 WASB = Windows Azure Storage Blob
 Optimized for Unstructured Data
 Video, Images, Files
 Flat Structure
 No folder tree*
 Lowest Cost (Cheaper than ADLS)
 Tiered
 Can specify redundancy/performance/cost
 LRS – Locally Redundant Storage
 Multiple copies in single data center
 ZRS – Zone-Redundant Storage
 Multiple data centers within a zone
 GRS – Geo-Redundant Storage
 Replicates data to a data center in another region
 RA-GRS – Read-access Geo-Redundant Storage
 GRS, but can read from that remote region as well
 Hot
 Data that is accessed frequently
 Highest storage cost. Lowest access cost.
 Cool
 For less-frequently-accessed data. Minimum 30 day storage.
 Lower storage cost. Higher access cost.
 Archive
 Lowest storage cost. Highest access cost.
 Slow to retrieve (hours). Minimum 180 day storage.
 Comparable to Amazon Glacier
 ADLS = Azure Data Lake Storage
 Interopterable
 Works with all major MPP platforms
 Security
 Posix ACLs & AAD integration
 Performance
 Optimized for analytics workloads
 No file size limit
 Folder Structure
 A real hierarchical structure
 ABFS = Azure Blob File System
 AKA “ADLS Gen 2”
 Integration
 Supports most* MPP & authentication schemes
 Hierarchical Namespace
 Folder Structure & Atomic File Operations
 File System Tiering
 30% price premium over ADLS
Object Storage
(Azure Blob Storage)
• Flat storage space
• Data represented as blobs
• Highly scalable
• Cost-Effective
File System Storage
Azure Data Lake Storage Gen 1
• Hierarchical storage space
• Data is blocks in a file system
• Security at folder/file levels
• Atomic & metadata operations
Azure Data Lake Storage Gen 2
• Get the features of file system storage
• Get the cost/scale features of blob storage
or
 Text Files!
 Human readable
 Standardized
 Well supported
 .txt, .csv, etc.
 JSON
 Human readable
 Supports complex nesting
 Files can be large without compression
 Two-part file format
 Binary portion for data
 JSON-based schema
 Supports schema evolution
 Requires a custom viewer utility
Parquet
 From Cloudera/Twitter
Orc
 From Hortonworks/Facebook
 Both support compression and predicate pushdown
 Writes are slower in both due to columnar format
 Generally better for analytical workloads
 Test your workload!
 May be worth the effort if
 Querying data multiple times
 Need predicate pushdown or column elimination
 Can reduce/eliminate encoding issues with delimited formats
 Folders!
 Organization is key to success (with ADL and anything else..)
 Common Zones
 archive, raw, reference, curated, temp, stage
 Other Zones
 environment_dev, test, prod
 Sub-zones
 Subsystems, users
 Data here is in its original state from source system
 Sorted by source system and schema
 Immutable – the key rule for this zone
 Data is always appended here
 Can be partitioned by date
 Data here is in a transformed state
 Similar to DW / Datamart
 End users can access here
 Analytics is performed on this data
 Data is governed/blessed
 This zone is flexible based on usage & requirements
 Data Science Sandbox
 Datamarts
 Outbound data for integration, Etc.
 Helpful when incoming data is mutable
 Full/Incremental loads arrive constantly
 Data is batched together here for analytical purposes
 Result can be more efficient for MPP
 Data can be compared against raw to calculate deltas
 Files can be archived/vacuumed/deleted per needs
 Data outside of analytical windows
 May need to be kept for compliance purposes
 Can restrict access to this via ACLs and ACEs
 Generally the last stop before removal or tiering out
 ABFS supports tiering within containers
 Move to slower/cheaper storage and pay less!
 ADF is great at this
 Lots of tools you can orchestrate with it
 Typically you don’t move though, you transform & create new
 Databricks, Oozie
 MPP Engines
 AzCopy
U-SQL
 Cloud-based analytics platform based on Spark
Image: Microsoft
 Cloud-based EDW that uses MPP
 Formerly SQL PDW
 Uses relational tables / columnar storage
 The native language of Azure Data Lake Analytics (ADLA)
 A combination of SQL and C#
 Can be executed in the cloud, or locally
@dataset =
EXTRACT UserID INT,
StartDT DATETIME,
URL STRING
FROM @”/datafile.tsv”
USING Extractors.Tsv();
OUTPUT @dataset
TO @”/datafile_output.tsv”
USING Outputters.Tsv();
 An ETL/Query tool that runs on top of Hadoop
 Has its own query language – HiveQL or HQL
 Very similar to SQL
SELECT plate_number, COUNT(*)
FROM Tickets;
 Use Source and Date/Time into name
 Make Consistent
 This makes things easier, especially for self-service
 Example: ParkingTickets_20171031_2359.csv
 This helps make data “splitable”
 1 Table per file becomes inefficient as size increases
 Harder* have multiple nodes working against a single file
 Split file up by date within zone structure
 Thousands of files in same directory isn’t great either
 Example: raw/ParkingTickets/2017/10/ParkingTickets_20171031_2359.csv
 Avoid creating empty files/folders
 Avoid mixing types within folder hierarchy
raw
ParkingTickets
2017
10
ParkingTickets_20171031_2359.csv
 Some file formats are splitable
 MPP engine can split the file into segments for performance benefits
 Compression can render certain formats unsplitable
 This varies by MPP engine, check documentation
 If file format is unsplitable, you need to manage partitions manually
 Balancing file sizes may matter here too for best performance
 (Think of bad SQL Server statistics & CXPACKET waits)
 Compression can make formats unsplitable
 Some columnar formats have internal compression already
 Sometimes you can layer these with GZIP et. al.
 TEST
 Compression may not improve performance in all use cases
 No Free Lunch!
Storage Space
Memory
I/O
Network CPU
 Data streams in over time, is batched, and gets processed
Data
Sources
Orchestration
Analytics
&
Reporting
Real-Time
Message
Ingestion
Stream
Processing
Data Storage
• Blob Storage
• Data Lake
• SQL Database
• Cosmos DB
Batch
Processing
• U-SQL
• Hive
• Pig
• Spark
Analytical
Data Store
• SQL DW
• Spark SQL
• HBase
• Hive
• Data Factory
• HDInsight
 Data streams in and is processed in near real-time
Data
Sources
Orchestration
Analytics
&
Reporting
Ingestion
• Event Hub
• IoT Hub
• Kafka
Stream
Processing
• Stream Analytics
• Storm
• Spark Streaming
Data Storage Batch
Processing
Analytical
Data Store
• SQL DW
• Spark SQL
• HBase
• Hive
 Multiple file types within a folder
 Poor Organization
 No standard naming
 No governance
 No data catalog
 Makes self-service very difficult
 Lots of small files can
 Add overhead
 Slow down MPP engines
 Optimal file size varies per compute engine
 Check your engine’s documentation
 By themselves, data lakes don’t provide integrated views across org
 Hard to guarantee quality of data in a data lake
 Data lake may become a dumping ground for unused data
 Lack of descriptive metadata can make data hard to consume
 Lack of governance can lead to problems of varying types
 Data lakes are not the best way to integrate all types of data
 Relational Data
 Data lakes are for more than just IoT data!
 Can access a data lake with a relational database too
 Polybase
 Big Data Clusters
 Authentication through Azure Active Directory
 Roles for account management
 POSIX ACL for data access
 Unix-style read/write/execute permissions
 ACLs can be enabled on folders or individual files
 Permissions cannot be inherited from a parent object
 Security groups also exist, can assign users to groups & grant that way
 Can enable a firewall
 Only IP addresses in a defined range can access ADL
 Encryption
 Can enable encryption before data is written on disk
 TLS 1.2 for data in motion
 Builds on security available in ADL Gen 1
 RBAC has been extended into container resources
 Shared Key and Shared Access Signature Authentication
 Shared Keys grant super user access
 SAS tokens include permissions
 Inheritance (limited)
 Objects can inherit permissions from parent objects if default permissions
were set on parent objects ahead of time
 Not a great story for backups at the moment
 Use AzCopy to copy to a blob storage account
 Can manage redundancy within WASB/ABFS
 Azure Data Lake itself provides no indexes
 U-SQL does have an CREATE INDEX statement
 Only clustered indexes
 Essentially re-orders the dataset
 Some file formats support indexing (Columnar)
 Generally that’s what your DW platform is for
 Varies based on the tool you are using
 Azure Portal
 Azure Storage Explorer
 Visual Studio
 Offload archived/unused data from DW/OLTP systems into data lake
 Use data lake as a staging area for DW loads
 Have net-new projects ingest data through a data lake
• You Pay For:
 Storage ($0.039 per GB/month)
 Operations
 Reads ($0.05 per 10,000)
 Writes ($0.004 per 10,000)
 Outbound data transfers (varies by region)
 North Central US: $0.087 per GB
• Additional storage savings with monthly commitments
• Check Azure Portal for most current pricing info
• https://azure.microsoft.com/en-us/pricing/details/data-lake-storage-gen1/
• Prices very based on flat/hierarchical structure & storage tier
 Storage ($0.0208/0.0152/0.002 per GB/month)
 Operations
 Reads ($0.0052/0.013/6.50 per 10,000)
 Writes ($0.065/0.13/0.13 per 10,000)
 Outbound data transfers (varies by region)
 North Central US: $0.087 per GB
• Check Azure Portal for most current pricing info
• https://azure.microsoft.com/en-us/pricing/details/storage/data-lake/
• Run U-SQL locally on your machine for free
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-
analytics-data-lake-tools-local-run
• Chicago Parking Tickets Data:
https://www.bobpusateri.com/archive/2018/09/new-data-set-chicago-parking-tickets/
• Melissa Coates has lots of good blog posts: https://www.sqlchick.com/entries/tag/Azure+Data+Lake
• Run U-SQL Locally:
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-data-lake-tools-local-run
• U-SQL Tutorial: https://saveenr.gitbooks.io/usql-tutorial/content/
• Querying Data in Azure Data Lake with Power BI:
https://www.sqlchick.com/entries/2018/5/6/querying-data-in-azure-data-lake-store-with-power-bi
• 10 Things to know about Azure Data Lake Storage Gen2:
https://www.blue-granite.com/blog/10-things-to-know-about-azure-data-lake-storage-gen2
@sqlbob
bob@bobpusateri.com
bobpusateri.com
bobpusateri

Más contenido relacionado

La actualidad más candente

Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Michael Rys
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Michael Rys
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
RDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business IntelligenceRDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business IntelligenceChristopher Foot
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's includedJames Serra
 
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeMSAdvAnalytics
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure Antonios Chatzipavlis
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overviewJames Serra
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenMS Cloud Summit
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure DatabricksDustin Vannoy
 
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Microsoft Tech Community
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsEduardo Castro
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...Microsoft Tech Community
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveIlyas F ☁☁☁
 
Azure data factory
Azure data factoryAzure data factory
Azure data factoryBizTalk360
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksAlberto Diaz Martin
 
Azure Data Factory
Azure Data FactoryAzure Data Factory
Azure Data FactoryHARIHARAN R
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Microsoft Tech Community
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016James Serra
 
Azure Cosmos DB + Gremlin API in Action
Azure Cosmos DB + Gremlin API in ActionAzure Cosmos DB + Gremlin API in Action
Azure Cosmos DB + Gremlin API in ActionDenys Chamberland
 

La actualidad más candente (20)

Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
RDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business IntelligenceRDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business Intelligence
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data Lake
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analytics
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep Dive
 
Azure data factory
Azure data factoryAzure data factory
Azure data factory
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
 
Azure Data Factory
Azure Data FactoryAzure Data Factory
Azure Data Factory
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016
 
Azure Cosmos DB + Gremlin API in Action
Azure Cosmos DB + Gremlin API in ActionAzure Cosmos DB + Gremlin API in Action
Azure Cosmos DB + Gremlin API in Action
 

Similar a Dipping Your Toes: Azure Data Lake for DBAs

Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Rukmani Gopalan
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
Consolidating File Servers into the Cloud
Consolidating File Servers into the CloudConsolidating File Servers into the Cloud
Consolidating File Servers into the CloudBuurst
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
05 No SQL Sudarshan.ppt
05 No SQL Sudarshan.ppt05 No SQL Sudarshan.ppt
05 No SQL Sudarshan.pptAnandKonj1
 
No SQL Databases sdfghjkl;sdfghjkl;sdfghjkl;'
No SQL Databases sdfghjkl;sdfghjkl;sdfghjkl;'No SQL Databases sdfghjkl;sdfghjkl;sdfghjkl;'
No SQL Databases sdfghjkl;sdfghjkl;sdfghjkl;'sankarapu posibabu
 
No SQL Databases.ppt
No SQL Databases.pptNo SQL Databases.ppt
No SQL Databases.pptssuser8c8fc1
 
Afternoons with Azure - Azure Data Services
Afternoons with Azure - Azure Data ServicesAfternoons with Azure - Azure Data Services
Afternoons with Azure - Azure Data ServicesCCG
 
Overview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational DatabasesOverview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational DatabasesAndrew Kandels
 
Data storage in cloud computing
Data storage in cloud computingData storage in cloud computing
Data storage in cloud computingjamunaashok
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
What is Object storage ?
What is Object storage ?What is Object storage ?
What is Object storage ?Nabil Kassi
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
ParaScale Cloud Storage Customer overview presentation
ParaScale Cloud Storage Customer overview presentationParaScale Cloud Storage Customer overview presentation
ParaScale Cloud Storage Customer overview presentationParaScale Marketing
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Establishing Environment Best Practices T12 Brendan Law
Establishing Environment Best Practices T12 Brendan LawEstablishing Environment Best Practices T12 Brendan Law
Establishing Environment Best Practices T12 Brendan LawFlamer
 
AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS Amazon Web Services
 
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfpbonillo1
 
FOWA Scaling The Lamp Stack Workshop
FOWA Scaling The Lamp Stack WorkshopFOWA Scaling The Lamp Stack Workshop
FOWA Scaling The Lamp Stack Workshopdlieberman
 

Similar a Dipping Your Toes: Azure Data Lake for DBAs (20)

Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Consolidating File Servers into the Cloud
Consolidating File Servers into the CloudConsolidating File Servers into the Cloud
Consolidating File Servers into the Cloud
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
05 No SQL Sudarshan.ppt
05 No SQL Sudarshan.ppt05 No SQL Sudarshan.ppt
05 No SQL Sudarshan.ppt
 
No SQL Databases sdfghjkl;sdfghjkl;sdfghjkl;'
No SQL Databases sdfghjkl;sdfghjkl;sdfghjkl;'No SQL Databases sdfghjkl;sdfghjkl;sdfghjkl;'
No SQL Databases sdfghjkl;sdfghjkl;sdfghjkl;'
 
No SQL Databases.ppt
No SQL Databases.pptNo SQL Databases.ppt
No SQL Databases.ppt
 
Afternoons with Azure - Azure Data Services
Afternoons with Azure - Azure Data ServicesAfternoons with Azure - Azure Data Services
Afternoons with Azure - Azure Data Services
 
Overview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational DatabasesOverview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational Databases
 
Data storage in cloud computing
Data storage in cloud computingData storage in cloud computing
Data storage in cloud computing
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
What is Object storage ?
What is Object storage ?What is Object storage ?
What is Object storage ?
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
ParaScale Cloud Storage Customer overview presentation
ParaScale Cloud Storage Customer overview presentationParaScale Cloud Storage Customer overview presentation
ParaScale Cloud Storage Customer overview presentation
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Establishing Environment Best Practices T12 Brendan Law
Establishing Environment Best Practices T12 Brendan LawEstablishing Environment Best Practices T12 Brendan Law
Establishing Environment Best Practices T12 Brendan Law
 
AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS
 
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdf
 
FOWA Scaling The Lamp Stack Workshop
FOWA Scaling The Lamp Stack WorkshopFOWA Scaling The Lamp Stack Workshop
FOWA Scaling The Lamp Stack Workshop
 

Más de Bob Pusateri

Locks, Blocks, and Snapshots: Maximizing Database Concurrency (New England SQ...
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (New England SQ...Locks, Blocks, and Snapshots: Maximizing Database Concurrency (New England SQ...
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (New England SQ...Bob Pusateri
 
Select Stars: A SQL DBA's Introduction to Azure Cosmos DB (SQL Saturday Orego...
Select Stars: A SQL DBA's Introduction to Azure Cosmos DB (SQL Saturday Orego...Select Stars: A SQL DBA's Introduction to Azure Cosmos DB (SQL Saturday Orego...
Select Stars: A SQL DBA's Introduction to Azure Cosmos DB (SQL Saturday Orego...Bob Pusateri
 
Supercharging Backups and Restores (For Fun and Profit!) (SQL Saturday Boston...
Supercharging Backups and Restores (For Fun and Profit!) (SQL Saturday Boston...Supercharging Backups and Restores (For Fun and Profit!) (SQL Saturday Boston...
Supercharging Backups and Restores (For Fun and Profit!) (SQL Saturday Boston...Bob Pusateri
 
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...Bob Pusateri
 
Select Stars: A DBA's Guide to Azure Cosmos DB (SQL Saturday Oslo 2018)
Select Stars: A DBA's Guide to Azure Cosmos DB (SQL Saturday Oslo 2018)Select Stars: A DBA's Guide to Azure Cosmos DB (SQL Saturday Oslo 2018)
Select Stars: A DBA's Guide to Azure Cosmos DB (SQL Saturday Oslo 2018)Bob Pusateri
 
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (PASSDC User Gr...
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (PASSDC User Gr...Locks, Blocks, and Snapshots: Maximizing Database Concurrency (PASSDC User Gr...
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (PASSDC User Gr...Bob Pusateri
 
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (PASS DBA Virtu...
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (PASS DBA Virtu...Locks, Blocks, and Snapshots: Maximizing Database Concurrency (PASS DBA Virtu...
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (PASS DBA Virtu...Bob Pusateri
 
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (Chicago Suburb...
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (Chicago Suburb...Locks, Blocks, and Snapshots: Maximizing Database Concurrency (Chicago Suburb...
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (Chicago Suburb...Bob Pusateri
 
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (SQL Saturday M...
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (SQL Saturday M...Locks, Blocks, and Snapshots: Maximizing Database Concurrency (SQL Saturday M...
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (SQL Saturday M...Bob Pusateri
 

Más de Bob Pusateri (9)

Locks, Blocks, and Snapshots: Maximizing Database Concurrency (New England SQ...
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (New England SQ...Locks, Blocks, and Snapshots: Maximizing Database Concurrency (New England SQ...
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (New England SQ...
 
Select Stars: A SQL DBA's Introduction to Azure Cosmos DB (SQL Saturday Orego...
Select Stars: A SQL DBA's Introduction to Azure Cosmos DB (SQL Saturday Orego...Select Stars: A SQL DBA's Introduction to Azure Cosmos DB (SQL Saturday Orego...
Select Stars: A SQL DBA's Introduction to Azure Cosmos DB (SQL Saturday Orego...
 
Supercharging Backups and Restores (For Fun and Profit!) (SQL Saturday Boston...
Supercharging Backups and Restores (For Fun and Profit!) (SQL Saturday Boston...Supercharging Backups and Restores (For Fun and Profit!) (SQL Saturday Boston...
Supercharging Backups and Restores (For Fun and Profit!) (SQL Saturday Boston...
 
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
 
Select Stars: A DBA's Guide to Azure Cosmos DB (SQL Saturday Oslo 2018)
Select Stars: A DBA's Guide to Azure Cosmos DB (SQL Saturday Oslo 2018)Select Stars: A DBA's Guide to Azure Cosmos DB (SQL Saturday Oslo 2018)
Select Stars: A DBA's Guide to Azure Cosmos DB (SQL Saturday Oslo 2018)
 
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (PASSDC User Gr...
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (PASSDC User Gr...Locks, Blocks, and Snapshots: Maximizing Database Concurrency (PASSDC User Gr...
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (PASSDC User Gr...
 
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (PASS DBA Virtu...
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (PASS DBA Virtu...Locks, Blocks, and Snapshots: Maximizing Database Concurrency (PASS DBA Virtu...
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (PASS DBA Virtu...
 
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (Chicago Suburb...
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (Chicago Suburb...Locks, Blocks, and Snapshots: Maximizing Database Concurrency (Chicago Suburb...
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (Chicago Suburb...
 
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (SQL Saturday M...
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (SQL Saturday M...Locks, Blocks, and Snapshots: Maximizing Database Concurrency (SQL Saturday M...
Locks, Blocks, and Snapshots: Maximizing Database Concurrency (SQL Saturday M...
 

Último

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Último (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Dipping Your Toes: Azure Data Lake for DBAs

  • 1.
  • 2. Specialties / Focus Areas / Passions: • Performance Tuning & Troubleshooting • Very Large Databases • SQL Server Storage Engine • HA/DR • Cloud @sqlbob bob@bobpusateri.com bobpusateri.com bobpusateri
  • 3.
  • 4. • What Is Azure Data Lake? • Data Lake Storage • Data Processing • Design Considerations • Security • DBA Topics • Cost • Demos
  • 5.
  • 7.
  • 9.
  • 10.
  • 11.  A scalable platform for housing all types of data for analysis  A repository for analyzing many sources of data in its native format  TL;DR an extremely scalable storage system with helpful tools
  • 12.  A tool that integrates with many services and sources  For both input and output Data Lake
  • 13.  Performant – enables linear and infinite scale  Scales on demand  Supports varying access patterns  Securable  RBAC – Groups and Principals  POSIX ACLs  Manageable  Scriptable APIs  (Preferably REST APIs)
  • 14.  “Bring your own engine”  Hadoop  Apache Spark  Power BI  SQL Data Warehouse  Azure Data Lake Analytics  etc.  There is no compute here
  • 15.  A great data tool to have in your toolbox  May be doing this work already with SSIS & Archiving  How much time do you spend writing packages?
  • 16. OLTP  Focus  Operational Transactions  Writes  Scope  Single database system  Sample Objectives:  Recall a customer’s orders  Generate a report Data Warehouse  Focus  Informational/Analytical  Reads  Scope  Integrate multiple systems  Sample Objectives:  Identify poorest-selling products  Compare margin per customer
  • 17.
  • 18.  Volume  Growing exponentially every year!  Variety  More than what fits neatly into a relational DB  Structured data being augmented by unstructured data  Velocity  Frequency it needs to be processed. Is it a streaming source?  Veracity  Is the data trustworthy?  Value  Are we creating value from the above 4 things?
  • 19. Structured: Relational Unstructured: Binary or plain text Semi-structured: Non-relational
  • 20.
  • 21.  WASB = Windows Azure Storage Blob  Optimized for Unstructured Data  Video, Images, Files  Flat Structure  No folder tree*  Lowest Cost (Cheaper than ADLS)  Tiered  Can specify redundancy/performance/cost
  • 22.  LRS – Locally Redundant Storage  Multiple copies in single data center  ZRS – Zone-Redundant Storage  Multiple data centers within a zone  GRS – Geo-Redundant Storage  Replicates data to a data center in another region  RA-GRS – Read-access Geo-Redundant Storage  GRS, but can read from that remote region as well
  • 23.  Hot  Data that is accessed frequently  Highest storage cost. Lowest access cost.  Cool  For less-frequently-accessed data. Minimum 30 day storage.  Lower storage cost. Higher access cost.  Archive  Lowest storage cost. Highest access cost.  Slow to retrieve (hours). Minimum 180 day storage.  Comparable to Amazon Glacier
  • 24.  ADLS = Azure Data Lake Storage  Interopterable  Works with all major MPP platforms  Security  Posix ACLs & AAD integration  Performance  Optimized for analytics workloads  No file size limit  Folder Structure  A real hierarchical structure
  • 25.  ABFS = Azure Blob File System  AKA “ADLS Gen 2”  Integration  Supports most* MPP & authentication schemes  Hierarchical Namespace  Folder Structure & Atomic File Operations  File System Tiering  30% price premium over ADLS
  • 26. Object Storage (Azure Blob Storage) • Flat storage space • Data represented as blobs • Highly scalable • Cost-Effective File System Storage Azure Data Lake Storage Gen 1 • Hierarchical storage space • Data is blocks in a file system • Security at folder/file levels • Atomic & metadata operations Azure Data Lake Storage Gen 2 • Get the features of file system storage • Get the cost/scale features of blob storage or
  • 27.  Text Files!  Human readable  Standardized  Well supported  .txt, .csv, etc.  JSON  Human readable  Supports complex nesting  Files can be large without compression
  • 28.  Two-part file format  Binary portion for data  JSON-based schema  Supports schema evolution  Requires a custom viewer utility
  • 29. Parquet  From Cloudera/Twitter Orc  From Hortonworks/Facebook  Both support compression and predicate pushdown  Writes are slower in both due to columnar format
  • 30.  Generally better for analytical workloads  Test your workload!  May be worth the effort if  Querying data multiple times  Need predicate pushdown or column elimination  Can reduce/eliminate encoding issues with delimited formats
  • 31.  Folders!  Organization is key to success (with ADL and anything else..)  Common Zones  archive, raw, reference, curated, temp, stage  Other Zones  environment_dev, test, prod  Sub-zones  Subsystems, users
  • 32.  Data here is in its original state from source system  Sorted by source system and schema  Immutable – the key rule for this zone  Data is always appended here  Can be partitioned by date
  • 33.  Data here is in a transformed state  Similar to DW / Datamart  End users can access here  Analytics is performed on this data  Data is governed/blessed  This zone is flexible based on usage & requirements  Data Science Sandbox  Datamarts  Outbound data for integration, Etc.
  • 34.  Helpful when incoming data is mutable  Full/Incremental loads arrive constantly  Data is batched together here for analytical purposes  Result can be more efficient for MPP  Data can be compared against raw to calculate deltas  Files can be archived/vacuumed/deleted per needs
  • 35.  Data outside of analytical windows  May need to be kept for compliance purposes  Can restrict access to this via ACLs and ACEs  Generally the last stop before removal or tiering out  ABFS supports tiering within containers  Move to slower/cheaper storage and pay less!
  • 36.  ADF is great at this  Lots of tools you can orchestrate with it  Typically you don’t move though, you transform & create new  Databricks, Oozie  MPP Engines  AzCopy
  • 37.
  • 38. U-SQL
  • 39.  Cloud-based analytics platform based on Spark Image: Microsoft
  • 40.  Cloud-based EDW that uses MPP  Formerly SQL PDW  Uses relational tables / columnar storage
  • 41.  The native language of Azure Data Lake Analytics (ADLA)  A combination of SQL and C#  Can be executed in the cloud, or locally
  • 42. @dataset = EXTRACT UserID INT, StartDT DATETIME, URL STRING FROM @”/datafile.tsv” USING Extractors.Tsv(); OUTPUT @dataset TO @”/datafile_output.tsv” USING Outputters.Tsv();
  • 43.  An ETL/Query tool that runs on top of Hadoop  Has its own query language – HiveQL or HQL  Very similar to SQL SELECT plate_number, COUNT(*) FROM Tickets;
  • 44.
  • 45.  Use Source and Date/Time into name  Make Consistent  This makes things easier, especially for self-service  Example: ParkingTickets_20171031_2359.csv
  • 46.  This helps make data “splitable”  1 Table per file becomes inefficient as size increases  Harder* have multiple nodes working against a single file  Split file up by date within zone structure  Thousands of files in same directory isn’t great either  Example: raw/ParkingTickets/2017/10/ParkingTickets_20171031_2359.csv  Avoid creating empty files/folders  Avoid mixing types within folder hierarchy
  • 48.  Some file formats are splitable  MPP engine can split the file into segments for performance benefits  Compression can render certain formats unsplitable  This varies by MPP engine, check documentation  If file format is unsplitable, you need to manage partitions manually  Balancing file sizes may matter here too for best performance  (Think of bad SQL Server statistics & CXPACKET waits)
  • 49.  Compression can make formats unsplitable  Some columnar formats have internal compression already  Sometimes you can layer these with GZIP et. al.  TEST  Compression may not improve performance in all use cases
  • 50.  No Free Lunch! Storage Space Memory I/O Network CPU
  • 51.  Data streams in over time, is batched, and gets processed Data Sources Orchestration Analytics & Reporting Real-Time Message Ingestion Stream Processing Data Storage • Blob Storage • Data Lake • SQL Database • Cosmos DB Batch Processing • U-SQL • Hive • Pig • Spark Analytical Data Store • SQL DW • Spark SQL • HBase • Hive • Data Factory • HDInsight
  • 52.  Data streams in and is processed in near real-time Data Sources Orchestration Analytics & Reporting Ingestion • Event Hub • IoT Hub • Kafka Stream Processing • Stream Analytics • Storm • Spark Streaming Data Storage Batch Processing Analytical Data Store • SQL DW • Spark SQL • HBase • Hive
  • 53.  Multiple file types within a folder  Poor Organization  No standard naming  No governance  No data catalog  Makes self-service very difficult
  • 54.  Lots of small files can  Add overhead  Slow down MPP engines  Optimal file size varies per compute engine  Check your engine’s documentation
  • 55.  By themselves, data lakes don’t provide integrated views across org  Hard to guarantee quality of data in a data lake  Data lake may become a dumping ground for unused data  Lack of descriptive metadata can make data hard to consume  Lack of governance can lead to problems of varying types  Data lakes are not the best way to integrate all types of data  Relational Data
  • 56.  Data lakes are for more than just IoT data!  Can access a data lake with a relational database too  Polybase  Big Data Clusters
  • 57.
  • 58.  Authentication through Azure Active Directory  Roles for account management  POSIX ACL for data access  Unix-style read/write/execute permissions  ACLs can be enabled on folders or individual files  Permissions cannot be inherited from a parent object  Security groups also exist, can assign users to groups & grant that way
  • 59.
  • 60.  Can enable a firewall  Only IP addresses in a defined range can access ADL  Encryption  Can enable encryption before data is written on disk  TLS 1.2 for data in motion
  • 61.  Builds on security available in ADL Gen 1  RBAC has been extended into container resources  Shared Key and Shared Access Signature Authentication  Shared Keys grant super user access  SAS tokens include permissions  Inheritance (limited)  Objects can inherit permissions from parent objects if default permissions were set on parent objects ahead of time
  • 62.
  • 63.
  • 64.  Not a great story for backups at the moment  Use AzCopy to copy to a blob storage account  Can manage redundancy within WASB/ABFS
  • 65.  Azure Data Lake itself provides no indexes  U-SQL does have an CREATE INDEX statement  Only clustered indexes  Essentially re-orders the dataset  Some file formats support indexing (Columnar)  Generally that’s what your DW platform is for
  • 66.  Varies based on the tool you are using  Azure Portal  Azure Storage Explorer  Visual Studio
  • 67.  Offload archived/unused data from DW/OLTP systems into data lake  Use data lake as a staging area for DW loads  Have net-new projects ingest data through a data lake
  • 68.
  • 69. • You Pay For:  Storage ($0.039 per GB/month)  Operations  Reads ($0.05 per 10,000)  Writes ($0.004 per 10,000)  Outbound data transfers (varies by region)  North Central US: $0.087 per GB • Additional storage savings with monthly commitments • Check Azure Portal for most current pricing info • https://azure.microsoft.com/en-us/pricing/details/data-lake-storage-gen1/
  • 70. • Prices very based on flat/hierarchical structure & storage tier  Storage ($0.0208/0.0152/0.002 per GB/month)  Operations  Reads ($0.0052/0.013/6.50 per 10,000)  Writes ($0.065/0.13/0.13 per 10,000)  Outbound data transfers (varies by region)  North Central US: $0.087 per GB • Check Azure Portal for most current pricing info • https://azure.microsoft.com/en-us/pricing/details/storage/data-lake/
  • 71. • Run U-SQL locally on your machine for free • https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake- analytics-data-lake-tools-local-run
  • 72.
  • 73. • Chicago Parking Tickets Data: https://www.bobpusateri.com/archive/2018/09/new-data-set-chicago-parking-tickets/ • Melissa Coates has lots of good blog posts: https://www.sqlchick.com/entries/tag/Azure+Data+Lake • Run U-SQL Locally: https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-data-lake-tools-local-run • U-SQL Tutorial: https://saveenr.gitbooks.io/usql-tutorial/content/ • Querying Data in Azure Data Lake with Power BI: https://www.sqlchick.com/entries/2018/5/6/querying-data-in-azure-data-lake-store-with-power-bi • 10 Things to know about Azure Data Lake Storage Gen2: https://www.blue-granite.com/blog/10-things-to-know-about-azure-data-lake-storage-gen2