Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Killer Scenarios with Data Lake in Azure with U-SQL

1.699 visualizaciones

Publicado el

Presentation from Microsoft Data Science Summit 2016
Presents 4 examples of custom U-SQL data processing: Overlapping Range Aggregation, JSON Processing, Image Processing and R with U-SQL

Publicado en: Datos y análisis
  • Sé el primero en comentar

Killer Scenarios with Data Lake in Azure with U-SQL

  1. 1. U-SQL extensibility Extend U-SQL with C#/.NET Built-in operators, function, aggregates C# expressions (in SELECT expressions) User-defined aggregates (UDAGGs) User-defined functions (UDFs) User-defined operators (UDOs)
  2. 2.  User-Defined Extractors  User-Defined Outputters  User-Defined Processors  Take one row and produce one row  Pass-through versus transforming  User-Defined Appliers  Take one row and produce 0 to n rows  Used with OUTER/CROSS APPLY  User-Defined Combiners  Combines rowsets (like a user-defined join)  User-Defined Reducers  Take n rows and produce m rows (normally m<n)  Scaled out with explicit U-SQL Syntax that takes a UDO instance (created as part of the execution):  EXTRACT  OUTPUT  PROCESS  COMBINE  REDUCE What are UDOs? Custom Operator Extensions Scaled out by U-SQL
  3. 3. UDO model • Marking UDOs • Parameterizing UDOs • UDO signature • UDO-specific processing pattern • Rowsets and their schemas in UDOs • Setting results  By position  By name [SqlUserDefinedExtractor] public class DriverExtractor : IExtractor { private byte[] _row_delim; private string _col_delim; private Encoding _encoding; // Define a non-default constructor since I want to pass in my own parameters public DriverExtractor( string row_delim = "rn", string col_delim = ",“ , Encoding encoding = null ) { _encoding = encoding == null ? Encoding.UTF8 : encoding; _row_delim = _encoding.GetBytes(row_delim); _col_delim = col_delim; } // DriverExtractor // Converting text to target schema private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow) { var schema = outputrow.Schema; if (schema[i].Type == typeof(int)) { var tmp = Convert.ToInt32(c); outputrow.Set(i, tmp); } ... } //SerializeCol public override IEnumerable<IRow> Extract( IUnstructuredReader input , IUpdatableRow outputrow) { foreach (var row in input.Split(_row_delim)) { using(var s = new StreamReader(row, _encoding)) { int i = 0; foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None)) { OutputValueAtCol_I(c, i++, outputrow); } // foreach } // using yield return outputrow.AsReadOnly(); } // foreach } // Extract } // class DriverExtractor
  4. 4.  Code behind How to specify UDOs?
  5. 5.  C# Class Project for U-SQL How to specify UDOs?
  6. 6.  Any .Net language usable  however not first-class in tooling  Use U-SQL specific .Net DLLs  Compile DLL, upload to ADLS, register with script How to specify UDOs?
  7. 7. Managing Assemblies • CREATE ASSEMBLY db.assembly FROM @path; • CREATE ASSEMBLY db.assembly FROM byte[]; • Can also include additional resource files • REFERENCE ASSEMBLY db.assembly; • Referencing .Net Framework Assemblies • Always accessible system namespaces: • U-SQL specific (e.g., for SQL.MAP) • All provided by system.dll system.core.dll system.data.dll, System.Runtime.Serialization.dll, mscorelib.dll (e.g., System.Text, System.Text.RegularExpressions, System.Linq) • Add all other .Net Framework Assemblies with: REFERENCE SYSTEM ASSEMBLY [System.XML]; • Enumerating Assemblies • Powershell command • U-SQL Studio Server Explorer • DROP ASSEMBLY db.assembly;  Create assemblies  Reference assemblies  Enumerate assemblies  Drop assemblies  VisualStudio makes registration easy!
  8. 8. 'USING' csharp_namespace | Alias '=' csharp_namespace_or_class. Examples: DECLARE @ input string = "somejsonfile.json"; REFERENCE ASSEMBLY [Newtonsoft.Json]; REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats]; USING Microsoft.Analytics.Samples.Formats.Json; @data0 = EXTRACT IPAddresses string FROM @input USING new JsonExtractor("Devices[*]"); USING json = [Microsoft.Analytics.Samples.Formats.Json.JsonExtractor]; @data1 = EXTRACT IPAddresses string FROM @input USING new json("Devices[*]");
  9. 9. Overlapping Range Aggregation Start Time - End Time - User Name 5:00 AM - 6:00 AM - ABC 5:00 AM - 6:00 AM - XYZ 8:00 AM - 9:00 AM - ABC 8:00 AM - 10:00 AM - ABC 10:00 AM - 2:00 PM - ABC 7:00 AM - 11:00 AM - ABC 9:00 AM - 11:00 AM - ABC 11:00 AM - 11:30 AM - ABC 11:40 PM - 11:59 PM - FOO 11:50 PM - 0:40 AM - FOO https://blogs.msdn.microsoft.com/azuredatalake/2016/06/27/how-do-i-combine- overlapping-ranges-using-u-sql-introducing-u-sql-reducer-udos Start Time - End Time - User Name 5:00 AM - 6:00 AM - ABC 5:00 AM - 6:00 AM - XYZ 7:00 AM - 2:00 PM - ABC 11:40 PM - 0:40 AM - FOO
  10. 10. U-SQL: @r = REDUCE @in PRESORT begin ON user PRODUCE begin DateTime , end DateTime , user string READONLY user USING new ReduceSample.RangeReducer(); Overlapping Range Aggregation
  11. 11.  Code Behind: namespace ReduceSample { [SqlUserDefinedReducer(IsRecursive = true)] public class RangeReducer : IReducer { public override IEnumerable<IRow> Reduce(IRowset input, IUpdatableRow output) { // Init aggregation values int i = 0; var begin = DateTime.MaxValue; var end = DateTime.MinValue; foreach (var row in input.Rows) { ... begin = row.Get<DateTime>("begin"); end = row.Get<DateTime>("end"); ... output.Set<DateTime>("begin", begin); output.Set<DateTime>("end", end); yield return output.AsReadOnly(); ... } // foreach } // Reduce Overlapping Range Aggregation
  12. 12. JSON Processing How do I extract data from JSON documents? https://github.com/Azure/usql/tree/master/Examples/DataFormats
  13. 13.  Architecture of Sample Format Assembly  Single JSON document per file: Use JsonExtractor  Multiple JSON documents per file:  Do not allow CR/LF (row delimiter) in JSON  Use built-in Text Extractor to extract  Use JsonTuple to schematize (with CROSS APPLY)  Currently loads full JSON document into memory  better to use JSONReader Processing if docs are large JSON Processing Microsoft.Analytics.Samples.Formats NewtonSoft.Json System.Xml
  14. 14. JSON Processing @json = EXTRACT personid int, name string, addresses string FROM @input USING new Json.JsonExtractor(“[*].person"); @person = SELECT personid, name, Json.JsonFunctions.JsonTuple(addresses)["address"] AS address_array FROM @json; @addresses = SELECT personid, name, Json.JsonFunctions.JsonTuple(address) AS address FROM @person CROSS APPLY EXPLODE (Json.JsonFunctions.JsonTuple(address_array).Values) AS A(address); @result = SELECT personid, name, address["addressid"]AS addressid, address["street"]AS street, address["postcode"]AS postcode, address["city"]AS city FROM @addresses;
  15. 15. Image Processing Copyright Camera Make Camera Model Thumbnail Michael Canon 70D Michael Samsung S7 https://github.com/Azure/usql/tree/master/Examples/ImageApp
  16. 16.  Image processing assembly  Uses System.Drawing  Exposes  Extractors  Outputter  Processor  User-defined Functions  Trade-offs  Column memory limits: Image Extractor vs Feature Extractor  Main memory pressures in vertex: UDFs vs Processor vs Extractor Image Processing
  17. 17. R Processing KMeans Centroids
  18. 18. Architecture U-SQL Processing with R KMeansRReducer R to .Net interop (RDotNet.dll & RDotNet.NativeLib.dll) R Runtime (R-bin.zip) R Engine Manager Utility (RUtilities.dll) Similar Approaches can be done for deploying other runtimes: Python, JavaScript, JVM No external access from UDOs Future work:  More generic samples  More automatic experiences (no user wrappers/deploys)
  19. 19. What are UDOs? Custom Operator Extensions written in .Net (C#) Scaled out by U-SQL
  20. 20. UDO Tips and Warnings • Tips when Using UDOs:  READONLY clause to allow pushing predicates through UDOs  REQUIRED clause to allow column pruning through UDOs  PRESORT on REDUCE if you need global order  Hint Cardinality if it does choose the wrong plan • Warnings and better alternatives:  Use SELECT with UDFs instead of PROCESS  Use User-defined Aggregators instead of REDUCE  Learn to use Windowing Functions (OVER expression) • Good use-cases for PROCESS/REDUCE/COMBINE:  The logic needs to dynamically access the input and/or output schema. E.g., create a JSON doc for the data in the row where the columns are not known apriori.  Your UDF based solution creates too much memory pressure and you can write your code more memory efficient in a UDO  You need an ordered Aggregator or produce more than 1 row per group
  21. 21. http://usql.io http://blogs.msdn.microsoft.com/azuredatalake/ http://blogs.msdn.microsoft.com/mrys/ https://channel9.msdn.com/Search?term=U-SQL#ch9Search http://aka.ms/usql_reference https://azure.microsoft.com/en- us/documentation/services/data-lake-analytics/ https://msdn.microsoft.com/en-us/magazine/mt614251 http://aka.ms/adlfeedback https://social.msdn.microsoft.com/Forums/azure/en- US/home?forum=AzureDataLake http://stackoverflow.com/questions/tagged/u-sql

×