Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Azure Data Lake and U-SQL

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 22 Anuncio

Azure Data Lake and U-SQL

Descargar para leer sin conexión

Introduction to Azure Data Lake and U-SQL presented at Seattle Scalability Meetup, January 2016. Demo code available at https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis

Please signup for the preview at http://www.azure.com/datalake. Install Visual Studio Community Edition and the Azure Datalake Tools (http://aka.ms/adltoolvs) to use U-SQL locally for free.

Introduction to Azure Data Lake and U-SQL presented at Seattle Scalability Meetup, January 2016. Demo code available at https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis

Please signup for the preview at http://www.azure.com/datalake. Install Visual Studio Community Edition and the Azure Datalake Tools (http://aka.ms/adltoolvs) to use U-SQL locally for free.

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

A los espectadores también les gustó (19)

Anuncio

Similares a Azure Data Lake and U-SQL (20)

Más de Michael Rys (12)

Anuncio

Más reciente (20)

Azure Data Lake and U-SQL

  1. 1. SeaScale Meetup Jan 2016 Azure Data Lake & U-SQL Michael Rys, @MikeDoesBigData http://www.azure.com/datalake {mrys, usql}@microsoft.com
  2. 2. Analytics Storage HDInsight (“managed clusters”) Azure Data Lake Analytics Azure Data Lake Storage Azure Data Lake
  3. 3. ADLA complements HDInsight Target the same scenarios, tools, and customers HDInsight For developers familiar with the Open Source: Java, Eclipse, Hive, etc. Clusters offer customization, control, and flexibility in a managed Hadoop cluster ADLA Enables customers to leverage existing experience with C#, SQL & PowerShell Offers convenience, efficiency, automatic scale, and management in a “job service” form factor
  4. 4. WebHDFS YARN U-SQL Analytics Service HDInsight (managed Hadoop Clusters) Analytics Store Azure Data Lake
  5. 5. Azure Data Lake Analytics Service
  6. 6. Enterprise- grade Limitless scaleProductivity from day one Easy and powerful data preparation All data 6 0100101001000101010100101001000 10101010010100100010101010010100 10001010101001010010001010101001 0100100010101010010100100010101 0100101001000101010100101001000 10101010010100100010101010010100 10001010101001010010001010101001 0100100010101010010100100010101 0100101001000101010100101001000 10101010010100100010101010010100 Azure Data Lake Analytics
  7. 7. Azure Data Lake Analytics Service A new distributed analytics service Built on Apache YARN Scales dynamically with the turn of a dial Pay by the query Supports Azure AD for access control, roles, and integration with on-prem identity systems Built with U-SQL to unify the benefits of SQL with the power of C# Processes data across Azure 7
  8. 8. Work across all cloud data Azure Data Lake Analytics Azure SQL DW Azure SQL DB Azure Storage Blobs Azure Data Lake Store SQL DB in an Azure VM
  9. 9. Azure Data Lake U-SQL
  10. 10. • • •
  11. 11.  hard to work with anything other than structured data  difficult to extend with custom code
  12. 12.  User often has to care about scale and performance  SQL is 2nd class within string  Often no code reuse/ sharing across queries
  13. 13. Get benefits of both! Makes it easy for you by unifying: • Unstructured and structured data processing • Declarative SQL and custom imperative Code • Local and remote Queries • Increase productivity and agility from Day 1 and at Day 100 for YOU!
  14. 14. Extend U-SQL with C#/.NET Built-in operators, function, aggregates C# expressions (in SELECT expressions) User-defined aggregates (UDAGGs) User-defined functions (UDFs) User-defined operators (UDOs)
  15. 15. U-SQL Language Philosophy Declarative Query and Transformation Language: • Uses SQL’s SELECT FROM WHERE with GROUP BY/Aggregation, Joins, SQL Analytics functions • Optimizable, Scalable Expression-flow programming style: • Easy to use functional lambda composition • Composable, globally optimizable Operates on Unstructured & Structured Data • Schema on read over files • Relational metadata objects (e.g. database, table) Extensible from ground up: • Type system is based on C# • Expression language IS C# • User-defined functions (U-SQL and C#) • User-defined Aggregators (C#) • User-defined Operators (UDO) (C#) U-SQL provides the Parallelization and Scale-out Framework for Usercode • EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINER, APPLIER Federated query across distributed data sources REFERENCE MyDB.MyAssembly; CREATE TABLE T( cid int, first_order DateTime , last_order DateTime, order_count int , order_amount float ); @o = EXTRACT oid int, cid int, odate DateTime, amount float FROM "/input/orders.txt" USING Extractors.Csv(); @c = EXTRACT cid int, name string, city string FROM "/input/customers.txt" USING Extractors.Csv(); @j = SELECT c.cid, MIN(o.odate) AS firstorder , MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt , AGG<MyAgg.MySum>(c.amount) AS totalamount FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid WHERE c.city.StartsWith("New") && MyNamespace.MyFunction(o.odate) > 10 GROUP BY c.cid; OUTPUT @j TO "/output/result.txt" USING new MyData.Write(); INSERT INTO T SELECT * FROM @j;
  16. 16. Intro Blog entry: http://aka.ms/usql-intro Blog entry on UDFs: http://aka.ms/usql-udf U-SQL Reference Doc (beta): http://aka.ms/usql_reference U-SQL Community & Team site: http://usql.io/ Videos: https://channel9.msdn.com/Series/AzureDataLake
  17. 17. Microsoft Confidential Material - covered under NDA Additional Resources • Blogs and community page: • http://usql.io • https://blogs.msdn.microsoft.com/azuredatalake/ • http://blogs.msdn.com/b/visualstudio/ • http://azure.microsoft.com/en-us/blog/topics/big- data/ • https://channel9.msdn.com/Search?term=U- SQL#ch9Search • Documentation: • http://aka.ms/usql_reference • https://azure.microsoft.com/en- us/documentation/services/data-lake-analytics/ • ADL forums and feedback • http://aka.ms/adlfeedback • https://social.msdn.microsoft.com/Forums/azure/en- US/home?forum=AzureDataLake • http://stackoverflow.com/questions/tagged/u-sql
  18. 18. Unifies natively SQL’s declarativity and C#’s extensibility Unifies querying structured and unstructured Unifies local and remote queries Increase productivity and agility from Day 1 forward for YOU! Sign up for an Azure Data Lake account and join the Public Preview http://www.azure.com/datalake and give us your feedback via http://aka.ms/adlfeedback or at http://aka.ms/u-sql-survey!

Notas del editor

  • All data
    Unstructured, Semi structured, Structured
    Domain-specific user defined types using C#
    Queries over Data Lake and Azure Blobs
    Federated Queries over Operational and DW SQL stores removing the complexity of ETL

    Productive from day one
    Effortless scale and performance without need to manually tune/configure
    Best developer experience throughout development lifecycle for both novices and experts
    Leverage your existing skills with SQL and .NET

    Easy and powerful data preparation
    Easy to use built-in connectors for common data formats
    Simple and rich extensibility model for adding customer – specific data transformation – both existing and new

    No limits scale
    Scales on demand with no change to code
    Automatically parallelizes SQL and custom code
    Designed to process petabytes of data

    Enterprise grade
    Managing, securing, sharing, and discovery of familiar data and code objects (tables, functions etc.)
    Role based authorization of Catalogs and storage accounts using AAD security
    Auditing of catalog objects (databases, tables etc.)
  • A new distributed analytics service
    Built on Apache YARN
    Dynamically scales
    Handles jobs of any scale instantly by simply setting the dial for how much power you need.
    You only pay for the cost of the query
    Supports Azure Active Directory for Access Control, Roles, Integration with on-premises identity systems
    It also includes U-SQL, a language that unifies the benefits of SQL with the expressive power of C#
    U-SQL’s scalable runtime processes data across multiple Azure data sources
  • ADLA allows you to compute on data anywhere and a join data from multiple cloud sources.
  • Hard to operate on unstructured data: Even Hive requires meta data to be created to operate on unstructured data. Adding Custom Java functions, aggregators and SerDes is involving a lot of steps and often access to server’s head node and differs based on type of operation. Requires many tools and steps.

    Some examples:

    Hive UDAgg
    Code and compile .java into .jar
    Extend AbstractGenericUDAFResolver class: Does type checking, argument checking and overloading
    Extend GenericUDAFEvaluator class: implements logic in 8 methods.
    - Deploy:
    Deploy jar into class path on server
    Edit FunctionRegistry.java to register as built-in
    Update the content of show functions with ant

    Hive UDF (as of v0.13)
    Code
    Load JAR into head node or at URI
    CREATE FUNCTION USING JAR to register and load jar into classpath for every function (instead of registering jar and just use the functions)
  • Spark supports Custom “inputters and outputters” for defining custom RDDs
    No UDAGGs
    Simple integration of UDFs but only for duration of program. No reuse/sharing.

    Cloud dataflow? Requires has to care about scale and perf

    Spark UDAgg
    Is not yet supported ( SPARK-3947)

    Spark UDF
    Write inline function def westernState(state: String) = Seq("CA", "OR", "WA", "AK").contains(state)
    for SQL usage need to register the table customerTable.registerTempTable("customerTable")
    Register each UDF sqlContext.udf.register("westernState", westernState _)
    Call it val westernStates = sqlContext.sql("SELECT * FROM customerTable WHERE westernState(state)")
  • Offers Auto-scaling and performance
    Operates on unstructured data without tables needed
    Easy to extend declaratively with custom code: consistent model for UDO, UDF and UDAgg.
    Easy to query remote sources even without external tables

    U-SQL UDAgg
    Code and compile .cs file:
    Implement IAggregate’s 3 methods :Init(), Accumulate(), Terminate()
    C# takes case of type checking, generics etc.
    Deploy:
    Tooling: one click registration in user db of assembly
    By Hand:
    Copy file to ADL
    CREATE ASSEMBLY to register assembly
    Use via AGG<MyNamespace.MyAggregate<T>>(a)

    U-SQL UDF
    Code in C#, register assembly once, call by C# name.

  • Extensions require .NET assemblies to be registered with a database

×