"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
USQ Landdemos Azure Data Lake
1. BASLE BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA
HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Azure Data Lake Event
Big data solutions on Microsoft Azure using Azure Data Lake
Regensdorf, 02.03.2018
2. Agenda
Azure Data Lake Analytics2
1. Begrüssung
(Willfried Färber – Trivadis)
2. Historie, Aktuelles zu Azure Data Lake
(Michael Rys - Microsoft)
3. Code Session, ein oder zwei Beispiele
(Marco Amhof - Trivadis)
4. Azure Data Lake Services fast, flexible and at your fingertips
(Patrik Borosch – Microsoft)
5. Zusammenfassung, Ausblick, Q&A
(Michael Rys - Microsoft)
6. Lunch
4. Big Data
Data that is too large or complex for analysis in
traditional relational databases
Typified by the “3 V’s”:
Volume – Huge amounts of data to process
Variety – A mixture of structured and unstructured data
Velocity – New data generated extremely frequently
Web server click-streams Sensor and IoT ProcessingSocial media sentiment analysis
5. Big Data Processing
Filter, cleanse, and
shape data for
analysis
Apply statistical
algorithms for
classification,
regression, clustering,
and prediction
Capture, filter, and
aggregate streams of
data for low-latency
querying
Batch Processing Predictive AnalyticsReal-Time Processing
..110100101001..
7. Azure Data Lake (Store, HDInsight, Analytics)
Azure Data Lake Analytics7
8. U-SQL – A Language makes Big Data Processing Easy
Azure Data Lake Analytics8
Requirements and characteristics of Big Data analytics
▪ Process any type and any size of data
▪ BotNet attack patterns
▪ Security logs
▪ Extract features from images and videos (machine learning)
▪ The language enables you to work on any data
▪ Use custom code / algorithms to easily express your complex and often proprietary business algorithms
▪ User Defined Functions
▪ Custom Input- and Output Formats
▪ Scale efficiently to any size of data without you focusing on scale-out topologies, plumbing code, or
limitations of a specific distributed infrastructure
9. U-SQL Origins
Azure Data Lake Analytics9
▪ SCOPE – Microsoft’s internal Big Data
language
▪ COSMOS – Microsoft’s internal Big Data
analysis platform
▪ SQL and C# integration model
▪ Optimization and Scaling model
▪ Runs 100’000s of jobs daily
▪ Hive
▪ Complex data types (Maps, Arrays)
▪ Data format alignment for text files
▪ T-SQL/ANSI SQL
▪ Many of the SQL capabilities (windowing
functions, meta data model etc.)
10. U-SQL Features
Azure Data Lake Analytics10
▪ Operating over set of files with patterns
▪ Using (Partitioned) Tables
▪ Federated Queries against Azure SQL DB
▪ Encapsulating your U-SQL code with Views, Table-Valued Functions, and
Procedures
▪ SQL Windowing Functions
▪ Programming with C# User-defined Operators (custom extractors, processors)
▪ Complex Types (MAP, ARRAY)
▪ Using U-SQL in data processing pipelines
▪ U-SQL in a lambda architecture for IOT analytics
11. Query the data where it lives
Azure Data Lake Analytics11
▪ Avoid moving large amount of data across the network between stores
▪ Single view of data irrespective of physical location
▪ Minimize data proliferation issues caused by maintaining multiple copies
▪ Single query language for all data
▪ Each data store maintains its own sovereignty
▪ Design choices based on the need
▪ Push SQL expressions to remote SQL sources
▪ Projections
▪ Filters
▪ Joins
12. U-SQL = SQL + C#
Azure Data Lake Analytics12
▪ unifies the ease of use of SQL with the
expressive power of C#
▪ Get benefits of both…
▪ Unstructured and structured data processing
▪ Declarative SQL and custom imperative Code
▪ Local and remote Queries
13. U-SQL Language Overview
Azure Data Lake Analytics13
U-SQL Fundamentals
All the familiar SQL clauses
▪ SELECT | FROM | WHERE
GROUP BY | JOIN | OVER
▪ Operate on unstructured and
structured data
▪ Relational metadata objects
▪ Federated Queries against Azure
SQL DB and Azure SQL DWH
▪ SQL Windowing Functions
▪ EXCEPT / INTERSECT / UNION
.NET integration and extensibility
U-SQL expressions are full C# expressions
Reuse .NET code in your own assemblies
Use C# to define your own
▪ Types
▪ Functions
▪ Joins
▪ Aggregators
▪ I/O (Extractors, Outputters)
14. U-SQL extensibility
Azure Data Lake Analytics14
▪ Extend U-SQL with C#
(.NET)
▪ Extensions require .NET
assemblies to be
registered with a database
15. Azure Data Lake and Azure SQL Data Warehouse
Azure Data Lake Analytics15
17. Why use U-SQL
Azure Data Lake Analytics17
SQL makes Big Data processing easy because it:
▪ Unifies declarative nature of SQL with the imperative power of C#
▪ Unifies querying structured, semi-structured and unstructured data
▪ Unifies local and remote queries
▪ Distributed query support over all data
▪ Increases productivity and agility from Day 1 for YOU!
18. Azure Data Lake in Visual Studio
Azure Data Lake Analytics18
Prerequisites
▪ Visual Studio 2017 (under data storage and processing workload), Visual Studio 2015 update 3,
Visual Studio 2013 update 4, or Visual Studio 2012
Enterprise (Ultimate/Premium), Professional, Community editions are supported; Express edition
is not supported
▪ Microsoft Azure SDK for .NET version 2.7.1 or above. Install it using the Web platform installer.
▪ Data Lake Tools for Visual Studio
▪ Once Data Lake Tools for Visual Studio is installed, you will see a "Data Lake Analytics" node
in Server Explorer under the "Azure" node (Open Server Explorer by pressing Ctrl+Alt+S).
▪ Data Lake Analytics account and sample data
The Data Lake Tools do not support creating Data Lake Analytics accounts. Create an account
using the Azure portal, Azure PowerShell, .NET SDK or Azure CLI. For your convenience, a
PowerShell script for creating a Data Lake Analytics service and uploading the source data file
can be found in Appx-A PowerShell sample for preparing the tutorial.
Documentation
20. U-SQL Script
Azure Data Lake Analytics20
@t = EXTRACT date string
, time string
, author string
, tweet string
FROM "/input/MyTwitterHistory.csv"
USING Extractors.Csv();
@res = SELECT author
, COUNT(*) AS tweetcount
FROM @t
GROUP BY author;
OUTPUT @res TO "/output/MyTwitterAnalysis.csv"
ORDER BY tweetcount DESC
USING Outputters.Csv();
This U-SQL script
▪ extracts the source data file
using Extractors.Tsv()
▪ Transforms (aggregates) the
data using SQL
▪ creates a csv file using
Outputters.Csv()
21. Read the input, write it directly to output (just a simple copy)
Azure Data Lake Analytics21
Apply Schema on read
From a file in a Data Lake
Easy delimited text
handling
Write out
Rowset
22. Extract – Transform - Persist
Azure Data Lake Analytics22
▪ Retrieve data from stored locations in rowset format
▪ Stored locations can be files that will be schematized on read with EXTRACT
expressions
▪ Stored locations can be U-SQL tables that are stored in a schematized format
▪ Or can be tables provided by other data sources such as an Azure SQL database
▪ Transform the rowset(s)
▪ Several transformations over the rowsets can be composed in a data flow format
▪ Store the transformed rowset data
▪ Store it in a file with an OUTPUT statement, or
▪ Store it in a U-SQL table with an INSERT statement
23. Run the U-SQL job
Azure Data Lake Analytics23
Submit Job
▪ (local) to run script locally
▪ Data Lake Analytics account
to run script in the cloud
Solution Explorer
▪ Right-click Script.usql and
click Submit Script
24. ADLA - Jobs
Azure Data Lake Analytics24
▪ Create a Cluster of N Nodes
▪ Pay as long as the cluster exist and
is up and running
▪ Delete the cluster when done
▪ Submit a Job (a U-SQL Script) and
reserve N Nodes of parallelism per job run
▪ Pay as long as the job is running
(1 AU = CHF 1.807 / Hour)
▪ Nodes go away when the job finishes
HDInsight Analytics
“Cluster Service” “Job Service”
25. Benefits of ADLA Job Service
Azure Data Lake Analytics25
▪ Pay for what you use
▪ Easier
▪ No need to fetch logs on the cluster
▪ No tuning needed of cluster
▪ Job History / Job Replay
▪ Performance analysis
▪ Vertex Debugging
▪ Built-in Job Monitoring
▪ Built-in Auditing
26. U-SQL Job Workflow
Azure Data Lake Analytics26
Job Front End
Job Scheduler Compiler Service
Job Queue
Job Manager
U-SQL Catalog
YARN
Job submission
Job execution
U-SQL Runtime Vertex execution
28. Demo oeVCH
Azure Data Lake Analytics28
EinwohnerCH.csv (Quelle BFS)
HaltestellenCH.csv (Quelle BAV)
Azure
SQL DW
29. U-SQL = SQL + C#
Azure Data Lake Analytics29
▪ unifies the ease of use of SQL with the
expressive power of C#
▪ Get benefits of both…
▪ Unstructured and structured data processing
▪ Declarative SQL and custom imperative Code
▪ Local and remote Queries
30. ADLA – Cognitive Services
Azure Data Lake Analytics30
Imaging
– Detect faces
– Detect emotion
– Detect objects (tagging)
– OCR (optical character recognition)
Text
– Key Phrase Extraction
– Sentiment Analysis
31. Demo
Azure Data Lake Analytics31
images.csv
images_with_food.csv
tags_aggregated.csv