SlideShare una empresa de Scribd logo
1 de 43
Ike Ellis, MVP, Crafting Bytes
Dive into the Azure
Data Lake
Please silence
cell phones
Please silence
cell phones
2
Free online webinar
events
Free 1-day local
training events
Local user groups
around the world
Online special
interest user groups
Business analytics
training
Free Online Resources
PASS Blog
White Papers
Session Recordings
Newsletter www.pass.org
Explore everything PASS has to offer
PASS Connector
BA Insights
Get involved
Session evaluations
Download the GuideBook App
and search: PASS Summit 2017
Follow the QR code link
displayed on session signage
throughout the conference
venue and in the program guide
Your feedback is important and valuable.
Go to passSummit.com
Submit by 5pm Friday, November 10th to win prizes. 3 Ways to Access:
Ike Ellis
Partner, Crafting Bytes
Microsoft MVP
Since 2011
Frequent PASS Summit Speaker
And speak often at SQL Saturdays + maker
of youtube videos – Ike’s tips!
Author of Developing Azure
Solutions
Now in 2nd Edition
/ikeellis @ike_ellis ellisike
Thank You
Learn more from Ike Ellis
ike@craftingbytes.co
m
@ike_ellis
What is a data lake as a data architecture
term?
“If you think of a datamart as a store of bottled water
– cleansed, packaged, and structured for easy
consumption – the data lake is a large body of water
in a more natural state.” – James Dixon – creator of
the term
Data Lake Attributes
Contains unstructured,
structured, semi-structured
and multi-structured data.
Data lakes have been
traditionally Azure Blob
Storage, Amazon S3, and
Hadoop HDFS.
Data is transformed in place
with no movement.
Complexities with Standard MPP
Typical HDFS installation
Typical Hadoop installation
Redshift
Azure Data Lake Store
Cloud based file system that is unlimited in size
Compatible with Hadoop/Spark
• Hive
• Pig
• Sqoop
Azure Data Lake Store
• Unlimited Storage – a single
petabyte file
• Tuned for batch jobs
• Use active directory for security
• Store all data in native format
• Uses ADL:// and a URI
• Exposes WebHDFS API
Import data
• Use the Azure Portal
• .NET SDK
• Data Factory
• DistCp (Hadoop
Distributed Copy)
• Apache Sqoop
Azure Data Lake Store vs Azure Blob
Storage
• ADLS is optimized for analytics
• Blob Storage holds tons of data inappropriate for analytics like VHDs
• ADLS has folders, Blob Storage has containers
• ADLS uses WebHDFS Rest API, Azure Blob Storage uses the Azure
SDK Rest APIs
• ADLS has no size limits
• ADLS has no geo-redundancy yet
• ADLS is more expensive
• Azure Blob Storage can be better for analytics if there are a lot of
small files
DEMO
Provisioning Azure Data Lake Store
Provisioning Azure Data Lake Analytics
Connecting the two
Azure Data Lake Analytics
Service that queries data for analysis
ADLA runs jobs for:
querying
cleaning
aggregating
transforming
ADLA uses U-SQL as a language
U-SQL Job Syntax
U-SQL combines SQL and C#
SQL is a declarative language
• Don’t need to write all the steps
• Just write what data you want and let the optimizer get the data for
you
• Multi-threading is done for you in the background
C# is a procedural language that is often a better fit
• Custom Functions
• Multi-threading is very difficult
U-SQL unifies both of these languages
• Highly customizable and extensible way to interact with data
U-SQL – Start with a simple script
@hitters =
EXTRACT Rank int
, PlayerName string
, Age int
, Team string
, League string
, Games int
, HomeRuns int?
, BattingAvg decimal?
, OneBasePer decimal?
FROM "/BaseballStats/Baseball.csv"
USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1);
OUTPUT @hitters
TO "/output/hitters.txt"
USING Outputters.Text();
19
Let’s run this script?
@hitters =
EXTRACT Rank int
, PlayerName string
, Age int
, Team string
, League string
, Games int
, HomeRuns int?
, BattingAvg decimal?
, OneBasePer decimal?
FROM "/BaseballS tats/Baseball.csv"
USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1);
@bestHitters = select * from @hitters
where HomeRuns != null;
OUTPUT @bestHitters
TO "/output/hitters.txt"
USING Outputters.Text();
20
OK, here’s the good scripts
@hitters =
EXTRACT Rank int
, PlayerName string
, Age int
, Team string
, League string
, Games int
, HomeRuns int?
, BattingAvg decimal?
, OneBasePer decimal?
FROM "/BaseballStats/Baseball.csv"
USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1);
@bestHitters = SELECT * FROM @hitters
WHERE HomeRuns != null AND HomeRuns != 0;
OUTPUT @bestHitters
TO "/output/besthitters.txt"
USING Outputters.Text();
21
We can do GROUP BY@hitters =
EXTRACT Rank int
, PlayerName string
, Age int
, Team string
, League string
, Games int
, HomeRuns int?
, BattingAvg decimal?
, OneBasePer decimal?
FROM "/BaseballStats/Baseball.csv"
USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1);
@bestHitters = SELECT Team, MAX(HomeRuns) AS HomeRunTotal
FROM @hitters
WHERE HomeRuns != null AND HomeRuns != 0
GROUP BY Team;
OUTPUT @bestHitters
TO "/output/besthitters.txt"
USING Outputters.Text();
22
JOINS, TOO! Does this script work?@hitters =
EXTRACT Rank int
, PlayerName string
, Age int
, Team string
, League string
, Games int
, HomeRuns int?
, BattingAvg decimal?
, OneBasePer decimal?
FROM "/BaseballStats/Baseball.csv"
USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1);
@bestHitters = SELECT Team, MAX(HomeRuns) AS HomeRunTotal FROM @hitters
WHERE HomeRuns != null AND HomeRuns != 0
GROUP BY Team;
@bestHitterWithName = SELECT PlayerName, Team, HomeRuns
FROM @hitters h
JOIN @bestHitters bh
ON h.Team == bh.Team AND h.HomeRuns == bh.HomeRunTotal
OUTPUT @bestHitterWithName
TO "/output/besthitterswithnames.txt" 23
Nope! Need AS for alias, ambiguious
column names, INNER for JOINS@hitters =
EXTRACT Rank int
, PlayerName string
, Age int
, Team string
, League string
, Games int
, HomeRuns int?
, BattingAvg decimal?
, OneBasePer decimal?
FROM "/BaseballStats/Baseball.csv"
USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1);
@bestHitters = SELECT Team, MAX(HomeRuns) AS HomeRunTotal FROM @hitters
WHERE HomeRuns != null AND HomeRuns != 0
GROUP BY Team;
@bestHitterWithName = SELECT h.PlayerName, h.Team, h.HomeRuns
FROM @hitters AS h
INNER JOIN @bestHitters AS bh
ON h.Team == bh.Team AND h.HomeRuns == bh.HomeRunTotal;
OUTPUT @bestHitterWithName
TO "/output/besthitterswithnames.txt" 24
You can process every file in a folder like
this
@log = EXTRACT date string,
time string,
client_ip string,
username string,
server_ip string,
port int,
method string,
stem string,
query string,
status string,
server_bytes int,
client_bytes int,
time_taken int,
user_agent string,
referrer string
FROM "/iislogs/{*}.txt" USING Extractors.Text(' ', silent:true);
@dailysummary = SELECT date, COUNT(*) AS hits, SUM(server_bytes) AS bytes_sent, SUM(client_bytes) AS bytes_received
FROM @log
GROUP BY date;
25
U-SQL Tips
Case Sensitive
C# Data Type
• String instead of varchar
C# Comparison - ==, !=
COMPARE NULLS!
Typical U-SQL Job
1) Extract
2) Assign results to variable
3) Create a schema on read
4) Use schema when we process it
5) Filter data that we need
6) Apply the results to a variable
7) Output the variable someplace, like a CSV file
Visual Studio Code
28
VS Code Cool Features
• Generate Code Behind
• Format Code
• Submit Jobs
• Automatically takes you to the portal
29
Visual Studio
• Autocomplete
• Local Data Lake
for testing
• Show the
advanced tab
U-SQL Catalogs
CREATE EXTERNAL TABLE
Let’s you create a managed table that’s permanent
Can also create other metadata objects
• Views – Saved query
• Table Valued Function
• Stored Procedure
Demo U-SQL Catalog
Views
Functions
Procedures
U-SQL and C#
Use a code behind file
Or create an assembly and deploy it to use it all the time.
Deployment script
DEMO U-SQL Code Behind
DEMO U-SQL Custom Assemblies
1. Create a class library
2. Name project Data Utilities
3. Implement class
4. View the Cloud Explorer pane
5. In the Solution Explorer pane, right-click the DataUtilities project and
click Register Assembly.
6. select the webdata database.
7. In the Azure portal, on the blade for your Azure Data Lake Analytics
account, click Data Explorer; and then browse to the Assemblies
folder in your webdata database to verify that the assembly has been
registered.
8. Use in a query
U-SQL Job Architecture
Each job step is divided into vertices.
Each vertex represents a single piece of work
Each unit (the slider bar at the top) is functionally similar
to a physical node.
Each available unit is going to grab a vertex and work
through it until the task is completed.
Then it will move on to the next task.
Job Monitor
Job List
Replay Job
See Vertices
Can use the portal or Visual Studio
Designing for performance
Add ADLS data for Hive in HDInsight
• Connect ADLS with HDInsight
• Provide a security principal
• Query WebADFS with the ADL:// prefix
42
ALL DONE!
Ike Ellis
@ike_ellis
Crafting Bytes
We’re hiring a Data Experts!
Microsoft MVP
Chairperson of the San Diego TIG
Book co-author – Developing Azure Solutions
Upcoming course on Azure Data Lake
www.craftingbytes.com
www.ikeellis.com

Más contenido relacionado

La actualidad más candente

Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Michael Rys
 
U-SQL Learning Resources (SQLBits 2016)
U-SQL Learning Resources (SQLBits 2016)U-SQL Learning Resources (SQLBits 2016)
U-SQL Learning Resources (SQLBits 2016)Michael Rys
 
Open Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetOpen Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetCarl W. Handlin
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeDatabricks
 
Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mark Kromer
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Rukmani Gopalan
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWSGary Stafford
 
Semantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflowSemantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflowDatabricks
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azureDavid Giard
 
Azure Data Factory presentation with links
Azure Data Factory presentation with linksAzure Data Factory presentation with links
Azure Data Factory presentation with linksChris Testa-O'Neill
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityJulian Hyde
 
Amazon Athena Hands-On Workshop
Amazon Athena Hands-On WorkshopAmazon Athena Hands-On Workshop
Amazon Athena Hands-On WorkshopDoiT International
 
Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)Mark Kromer
 
U-SQL Federated Distributed Queries (SQLBits 2016)
U-SQL Federated Distributed Queries (SQLBits 2016)U-SQL Federated Distributed Queries (SQLBits 2016)
U-SQL Federated Distributed Queries (SQLBits 2016)Michael Rys
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in AzureMostafa
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Big data solutions in azure
Big data solutions in azureBig data solutions in azure
Big data solutions in azureMostafa
 

La actualidad más candente (20)

Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
 
U-SQL Learning Resources (SQLBits 2016)
U-SQL Learning Resources (SQLBits 2016)U-SQL Learning Resources (SQLBits 2016)
U-SQL Learning Resources (SQLBits 2016)
 
Open Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetOpen Source DataViz with Apache Superset
Open Source DataViz with Apache Superset
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake
 
Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Semantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflowSemantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflow
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Azure Data Factory presentation with links
Azure Data Factory presentation with linksAzure Data Factory presentation with links
Azure Data Factory presentation with links
 
Data virtualization using polybase
Data virtualization using polybaseData virtualization using polybase
Data virtualization using polybase
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
 
Amazon Athena Hands-On Workshop
Amazon Athena Hands-On WorkshopAmazon Athena Hands-On Workshop
Amazon Athena Hands-On Workshop
 
Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)
 
U-SQL Federated Distributed Queries (SQLBits 2016)
U-SQL Federated Distributed Queries (SQLBits 2016)U-SQL Federated Distributed Queries (SQLBits 2016)
U-SQL Federated Distributed Queries (SQLBits 2016)
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in Azure
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Big data solutions in azure
Big data solutions in azureBig data solutions in azure
Big data solutions in azure
 

Similar a Dive Into Azure Data Lake - PASS 2017

Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQLTaming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQLMichael Rys
 
Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Michael Rys
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Michael Rys
 
U-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for DevelopersU-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for DevelopersMichael Rys
 
3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sqlŁukasz Grala
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossAndrew Flatters
 
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive QueryInteractive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive QueryAshish Thapliyal
 
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive QueryInteractive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive QueryMicrosoft Tech Community
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsMicrosoft Tech Community
 
Getting started with titanium
Getting started with titaniumGetting started with titanium
Getting started with titaniumNaga Harish M
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Michael Rys
 
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...Michael Rys
 
Getting started with Appcelerator Titanium
Getting started with Appcelerator TitaniumGetting started with Appcelerator Titanium
Getting started with Appcelerator TitaniumTechday7
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Michael Rys
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudKellyn Pot'Vin-Gorman
 
Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Derek Jacoby
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBaseCon
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...Databricks
 

Similar a Dive Into Azure Data Lake - PASS 2017 (20)

Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQLTaming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
 
Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
 
U-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for DevelopersU-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for Developers
 
3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy Cross
 
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive QueryInteractive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
 
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive QueryInteractive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
 
Getting started with titanium
Getting started with titaniumGetting started with titanium
Getting started with titanium
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
 
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
 
Getting started with Appcelerator Titanium
Getting started with Appcelerator TitaniumGetting started with Appcelerator Titanium
Getting started with Appcelerator Titanium
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle Cloud
 
Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Untangling - fall2017 - week 9
Untangling - fall2017 - week 9
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
 
Apache hive
Apache hiveApache hive
Apache hive
 

Más de Ike Ellis

Storytelling with Data with Power BI
Storytelling with Data with Power BIStorytelling with Data with Power BI
Storytelling with Data with Power BIIke Ellis
 
Storytelling with Data with Power BI.pptx
Storytelling with Data with Power BI.pptxStorytelling with Data with Power BI.pptx
Storytelling with Data with Power BI.pptxIke Ellis
 
Data Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsData Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsIke Ellis
 
Migrate a successful transactional database to azure
Migrate a successful transactional database to azureMigrate a successful transactional database to azure
Migrate a successful transactional database to azureIke Ellis
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analyticsIke Ellis
 
Data modeling trends for Analytics
Data modeling trends for AnalyticsData modeling trends for Analytics
Data modeling trends for AnalyticsIke Ellis
 
Relational data modeling trends for transactional applications
Relational data modeling trends for transactional applicationsRelational data modeling trends for transactional applications
Relational data modeling trends for transactional applicationsIke Ellis
 
Power bi premium
Power bi premiumPower bi premium
Power bi premiumIke Ellis
 
Move a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudMove a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudIke Ellis
 
Pass 2018 introduction to dax
Pass 2018 introduction to daxPass 2018 introduction to dax
Pass 2018 introduction to daxIke Ellis
 
Pass the Power BI Exam
Pass the Power BI ExamPass the Power BI Exam
Pass the Power BI ExamIke Ellis
 
Slides for PUG 2018 - DAX CALCULATE
Slides for PUG 2018 - DAX CALCULATESlides for PUG 2018 - DAX CALCULATE
Slides for PUG 2018 - DAX CALCULATEIke Ellis
 
Introduction to DAX
Introduction to DAXIntroduction to DAX
Introduction to DAXIke Ellis
 
14 Habits of Great SQL Developers
14 Habits of Great SQL Developers14 Habits of Great SQL Developers
14 Habits of Great SQL DevelopersIke Ellis
 
14 Habits of Great SQL Developers
14 Habits of Great SQL Developers14 Habits of Great SQL Developers
14 Habits of Great SQL DevelopersIke Ellis
 
11 Goals of High Functioning SQL Developers
11 Goals of High Functioning SQL Developers11 Goals of High Functioning SQL Developers
11 Goals of High Functioning SQL DevelopersIke Ellis
 
SQL PASS BAC - 60 reporting tips in 60 minutes
SQL PASS BAC - 60 reporting tips in 60 minutesSQL PASS BAC - 60 reporting tips in 60 minutes
SQL PASS BAC - 60 reporting tips in 60 minutesIke Ellis
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBIke Ellis
 
Azure DocumentDB 101
Azure DocumentDB 101Azure DocumentDB 101
Azure DocumentDB 101Ike Ellis
 
Tips & Tricks SQL in the City Seattle 2014
Tips & Tricks SQL in the City Seattle 2014Tips & Tricks SQL in the City Seattle 2014
Tips & Tricks SQL in the City Seattle 2014Ike Ellis
 

Más de Ike Ellis (20)

Storytelling with Data with Power BI
Storytelling with Data with Power BIStorytelling with Data with Power BI
Storytelling with Data with Power BI
 
Storytelling with Data with Power BI.pptx
Storytelling with Data with Power BI.pptxStorytelling with Data with Power BI.pptx
Storytelling with Data with Power BI.pptx
 
Data Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsData Modeling on Azure for Analytics
Data Modeling on Azure for Analytics
 
Migrate a successful transactional database to azure
Migrate a successful transactional database to azureMigrate a successful transactional database to azure
Migrate a successful transactional database to azure
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
 
Data modeling trends for Analytics
Data modeling trends for AnalyticsData modeling trends for Analytics
Data modeling trends for Analytics
 
Relational data modeling trends for transactional applications
Relational data modeling trends for transactional applicationsRelational data modeling trends for transactional applications
Relational data modeling trends for transactional applications
 
Power bi premium
Power bi premiumPower bi premium
Power bi premium
 
Move a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudMove a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloud
 
Pass 2018 introduction to dax
Pass 2018 introduction to daxPass 2018 introduction to dax
Pass 2018 introduction to dax
 
Pass the Power BI Exam
Pass the Power BI ExamPass the Power BI Exam
Pass the Power BI Exam
 
Slides for PUG 2018 - DAX CALCULATE
Slides for PUG 2018 - DAX CALCULATESlides for PUG 2018 - DAX CALCULATE
Slides for PUG 2018 - DAX CALCULATE
 
Introduction to DAX
Introduction to DAXIntroduction to DAX
Introduction to DAX
 
14 Habits of Great SQL Developers
14 Habits of Great SQL Developers14 Habits of Great SQL Developers
14 Habits of Great SQL Developers
 
14 Habits of Great SQL Developers
14 Habits of Great SQL Developers14 Habits of Great SQL Developers
14 Habits of Great SQL Developers
 
11 Goals of High Functioning SQL Developers
11 Goals of High Functioning SQL Developers11 Goals of High Functioning SQL Developers
11 Goals of High Functioning SQL Developers
 
SQL PASS BAC - 60 reporting tips in 60 minutes
SQL PASS BAC - 60 reporting tips in 60 minutesSQL PASS BAC - 60 reporting tips in 60 minutes
SQL PASS BAC - 60 reporting tips in 60 minutes
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Azure DocumentDB 101
Azure DocumentDB 101Azure DocumentDB 101
Azure DocumentDB 101
 
Tips & Tricks SQL in the City Seattle 2014
Tips & Tricks SQL in the City Seattle 2014Tips & Tricks SQL in the City Seattle 2014
Tips & Tricks SQL in the City Seattle 2014
 

Último

Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 

Último (20)

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 

Dive Into Azure Data Lake - PASS 2017

  • 1. Ike Ellis, MVP, Crafting Bytes Dive into the Azure Data Lake
  • 2. Please silence cell phones Please silence cell phones 2
  • 3. Free online webinar events Free 1-day local training events Local user groups around the world Online special interest user groups Business analytics training Free Online Resources PASS Blog White Papers Session Recordings Newsletter www.pass.org Explore everything PASS has to offer PASS Connector BA Insights Get involved
  • 4. Session evaluations Download the GuideBook App and search: PASS Summit 2017 Follow the QR code link displayed on session signage throughout the conference venue and in the program guide Your feedback is important and valuable. Go to passSummit.com Submit by 5pm Friday, November 10th to win prizes. 3 Ways to Access:
  • 5. Ike Ellis Partner, Crafting Bytes Microsoft MVP Since 2011 Frequent PASS Summit Speaker And speak often at SQL Saturdays + maker of youtube videos – Ike’s tips! Author of Developing Azure Solutions Now in 2nd Edition /ikeellis @ike_ellis ellisike
  • 6. Thank You Learn more from Ike Ellis ike@craftingbytes.co m @ike_ellis
  • 7. What is a data lake as a data architecture term? “If you think of a datamart as a store of bottled water – cleansed, packaged, and structured for easy consumption – the data lake is a large body of water in a more natural state.” – James Dixon – creator of the term
  • 8. Data Lake Attributes Contains unstructured, structured, semi-structured and multi-structured data. Data lakes have been traditionally Azure Blob Storage, Amazon S3, and Hadoop HDFS. Data is transformed in place with no movement.
  • 9. Complexities with Standard MPP Typical HDFS installation Typical Hadoop installation Redshift
  • 10. Azure Data Lake Store Cloud based file system that is unlimited in size Compatible with Hadoop/Spark • Hive • Pig • Sqoop
  • 11.
  • 12. Azure Data Lake Store • Unlimited Storage – a single petabyte file • Tuned for batch jobs • Use active directory for security • Store all data in native format • Uses ADL:// and a URI • Exposes WebHDFS API
  • 13. Import data • Use the Azure Portal • .NET SDK • Data Factory • DistCp (Hadoop Distributed Copy) • Apache Sqoop
  • 14. Azure Data Lake Store vs Azure Blob Storage • ADLS is optimized for analytics • Blob Storage holds tons of data inappropriate for analytics like VHDs • ADLS has folders, Blob Storage has containers • ADLS uses WebHDFS Rest API, Azure Blob Storage uses the Azure SDK Rest APIs • ADLS has no size limits • ADLS has no geo-redundancy yet • ADLS is more expensive • Azure Blob Storage can be better for analytics if there are a lot of small files
  • 15. DEMO Provisioning Azure Data Lake Store Provisioning Azure Data Lake Analytics Connecting the two
  • 16. Azure Data Lake Analytics Service that queries data for analysis ADLA runs jobs for: querying cleaning aggregating transforming ADLA uses U-SQL as a language
  • 18. U-SQL combines SQL and C# SQL is a declarative language • Don’t need to write all the steps • Just write what data you want and let the optimizer get the data for you • Multi-threading is done for you in the background C# is a procedural language that is often a better fit • Custom Functions • Multi-threading is very difficult U-SQL unifies both of these languages • Highly customizable and extensible way to interact with data
  • 19. U-SQL – Start with a simple script @hitters = EXTRACT Rank int , PlayerName string , Age int , Team string , League string , Games int , HomeRuns int? , BattingAvg decimal? , OneBasePer decimal? FROM "/BaseballStats/Baseball.csv" USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1); OUTPUT @hitters TO "/output/hitters.txt" USING Outputters.Text(); 19
  • 20. Let’s run this script? @hitters = EXTRACT Rank int , PlayerName string , Age int , Team string , League string , Games int , HomeRuns int? , BattingAvg decimal? , OneBasePer decimal? FROM "/BaseballS tats/Baseball.csv" USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1); @bestHitters = select * from @hitters where HomeRuns != null; OUTPUT @bestHitters TO "/output/hitters.txt" USING Outputters.Text(); 20
  • 21. OK, here’s the good scripts @hitters = EXTRACT Rank int , PlayerName string , Age int , Team string , League string , Games int , HomeRuns int? , BattingAvg decimal? , OneBasePer decimal? FROM "/BaseballStats/Baseball.csv" USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1); @bestHitters = SELECT * FROM @hitters WHERE HomeRuns != null AND HomeRuns != 0; OUTPUT @bestHitters TO "/output/besthitters.txt" USING Outputters.Text(); 21
  • 22. We can do GROUP BY@hitters = EXTRACT Rank int , PlayerName string , Age int , Team string , League string , Games int , HomeRuns int? , BattingAvg decimal? , OneBasePer decimal? FROM "/BaseballStats/Baseball.csv" USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1); @bestHitters = SELECT Team, MAX(HomeRuns) AS HomeRunTotal FROM @hitters WHERE HomeRuns != null AND HomeRuns != 0 GROUP BY Team; OUTPUT @bestHitters TO "/output/besthitters.txt" USING Outputters.Text(); 22
  • 23. JOINS, TOO! Does this script work?@hitters = EXTRACT Rank int , PlayerName string , Age int , Team string , League string , Games int , HomeRuns int? , BattingAvg decimal? , OneBasePer decimal? FROM "/BaseballStats/Baseball.csv" USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1); @bestHitters = SELECT Team, MAX(HomeRuns) AS HomeRunTotal FROM @hitters WHERE HomeRuns != null AND HomeRuns != 0 GROUP BY Team; @bestHitterWithName = SELECT PlayerName, Team, HomeRuns FROM @hitters h JOIN @bestHitters bh ON h.Team == bh.Team AND h.HomeRuns == bh.HomeRunTotal OUTPUT @bestHitterWithName TO "/output/besthitterswithnames.txt" 23
  • 24. Nope! Need AS for alias, ambiguious column names, INNER for JOINS@hitters = EXTRACT Rank int , PlayerName string , Age int , Team string , League string , Games int , HomeRuns int? , BattingAvg decimal? , OneBasePer decimal? FROM "/BaseballStats/Baseball.csv" USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1); @bestHitters = SELECT Team, MAX(HomeRuns) AS HomeRunTotal FROM @hitters WHERE HomeRuns != null AND HomeRuns != 0 GROUP BY Team; @bestHitterWithName = SELECT h.PlayerName, h.Team, h.HomeRuns FROM @hitters AS h INNER JOIN @bestHitters AS bh ON h.Team == bh.Team AND h.HomeRuns == bh.HomeRunTotal; OUTPUT @bestHitterWithName TO "/output/besthitterswithnames.txt" 24
  • 25. You can process every file in a folder like this @log = EXTRACT date string, time string, client_ip string, username string, server_ip string, port int, method string, stem string, query string, status string, server_bytes int, client_bytes int, time_taken int, user_agent string, referrer string FROM "/iislogs/{*}.txt" USING Extractors.Text(' ', silent:true); @dailysummary = SELECT date, COUNT(*) AS hits, SUM(server_bytes) AS bytes_sent, SUM(client_bytes) AS bytes_received FROM @log GROUP BY date; 25
  • 26. U-SQL Tips Case Sensitive C# Data Type • String instead of varchar C# Comparison - ==, != COMPARE NULLS!
  • 27. Typical U-SQL Job 1) Extract 2) Assign results to variable 3) Create a schema on read 4) Use schema when we process it 5) Filter data that we need 6) Apply the results to a variable 7) Output the variable someplace, like a CSV file
  • 29. VS Code Cool Features • Generate Code Behind • Format Code • Submit Jobs • Automatically takes you to the portal 29
  • 30. Visual Studio • Autocomplete • Local Data Lake for testing • Show the advanced tab
  • 31. U-SQL Catalogs CREATE EXTERNAL TABLE Let’s you create a managed table that’s permanent Can also create other metadata objects • Views – Saved query • Table Valued Function • Stored Procedure
  • 33. Views
  • 36. U-SQL and C# Use a code behind file Or create an assembly and deploy it to use it all the time. Deployment script
  • 37. DEMO U-SQL Code Behind
  • 38. DEMO U-SQL Custom Assemblies 1. Create a class library 2. Name project Data Utilities 3. Implement class 4. View the Cloud Explorer pane 5. In the Solution Explorer pane, right-click the DataUtilities project and click Register Assembly. 6. select the webdata database. 7. In the Azure portal, on the blade for your Azure Data Lake Analytics account, click Data Explorer; and then browse to the Assemblies folder in your webdata database to verify that the assembly has been registered. 8. Use in a query
  • 39. U-SQL Job Architecture Each job step is divided into vertices. Each vertex represents a single piece of work Each unit (the slider bar at the top) is functionally similar to a physical node. Each available unit is going to grab a vertex and work through it until the task is completed. Then it will move on to the next task.
  • 40. Job Monitor Job List Replay Job See Vertices Can use the portal or Visual Studio
  • 42. Add ADLS data for Hive in HDInsight • Connect ADLS with HDInsight • Provide a security principal • Query WebADFS with the ADL:// prefix 42
  • 43. ALL DONE! Ike Ellis @ike_ellis Crafting Bytes We’re hiring a Data Experts! Microsoft MVP Chairperson of the San Diego TIG Book co-author – Developing Azure Solutions Upcoming course on Azure Data Lake www.craftingbytes.com www.ikeellis.com

Notas del editor

  1. CAPITALIZE THE KEYWORDS!
  2. CREATE DATABASE BaseballStats; USE DATABASE BaseballStats; CREATE SCHEMA stats; CREATE TABLE stats.BestHomeRunHitterPerTeam ( Rank int , PlayerName string , Age int , Team string , League string , Games int , HomeRuns int? , BattingAvg decimal? , OnBasePer decimal? , INDEX idx_Rank CLUSTERED(Rank ASC) DISTRIBUTED BY HASH(Rank)); @hitters = EXTRACT Rank int , PlayerName string , Age int , Team string , League string , Games int , HomeRuns int? , BattingAvg decimal? , OnBasePer decimal? FROM "/BaseballStats/Baseball.csv" USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1); @bestHitters = SELECT Team, MAX(HomeRuns) AS HomeRunTotal FROM @hitters WHERE HomeRuns != null AND HomeRuns != 0 GROUP BY Team; @bestHitterWithName = SELECT h.Rank, h.PlayerName, h.Age, h.Team, h.League, h.Games, h.HomeRuns, h.BattingAvg, h.OnBasePer FROM @hitters AS h INNER JOIN @bestHitters AS bh ON h.Team == bh.Team AND h.HomeRuns == bh.HomeRunTotal; INSERT INTO stats.BestHomeRunHitterPerTeam SELECT * FROM @bestHitterWithName;
  3. @results = SELECT * FROM stats.BestHomeRunHitterPerTeam; OUTPUT @results TO "/output/BestHRHitters.csv" USING Outputters.Csv();
  4. using Microsoft.Analytics.Interfaces; using Microsoft.Analytics.Types.Sql; using System; using System.Collections.Generic; using System.IO; using System.Linq; using System.Text; namespace USQLApplication1 { public class IkeTools { public static string FindGreatTeam(string teamName) { string temp = teamName; if (teamName == "SDP") { temp = "The Great San Diego Padres!!!"; } return temp; } } } @results = SELECT hpt.PlayerName, hpt.HomeRuns, USQLApplication1.IkeTools.FindGreatTeam(hpt.Team) AS Team FROM stats.BestHomeRunHitterPerTeam AS hpt; OUTPUT @results TO "/output/BestHRHitters.csv" USING Outputters.Csv();
  5. // Auto-generated header code // Generated Code Behind Header USE DATABASE [BaseballStats]; REFERENCE ASSEMBLY IkeADLTools; @results = SELECT hpt.PlayerName, hpt.HomeRuns, IkeADLTools.IkeTools.FindGreatTeam(hpt.Team) AS Team FROM stats.BestHomeRunHitterPerTeam AS hpt; OUTPUT @results TO "/output/BestHRHitters.csv" USING Outputters.Csv();
  6. Add file to folder Create tables set hive.execution.engine=tez; DROP TABLE BaseBallStats; CREATE TABLE BaseBallStats ( Team string , HomeRuns string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION 'adl://adlaikeadls.azuredatalakestore.net/passdemo’; Query Table in Ambari SELECT * FROM BaseBallStats;