Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Big Data Analytics in the Cloud with Microsoft Azure

Cargando en…3

Eche un vistazo a continuación

1 de 60 Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

A los espectadores también les gustó (20)


Similares a Big Data Analytics in the Cloud with Microsoft Azure (20)

Más de Mark Kromer (20)


Más reciente (20)

Big Data Analytics in the Cloud with Microsoft Azure

  1. 1. Twitter : @bigdataconf
  2. 2. Big Data Analytics in the Cloud Microsoft Azure Cortana Intelligence Suite Mark Kromer Microsoft Azure Cloud Data Architect @kromerbigdata @mssqldude
  3. 3. What is Big Data Analytics? Tech Target: “… the process of examining large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information.” Techopedia: “… the strategy of analyzing large volumes of data, or big data. This big data is gathered from a wide variety of sources, including social networks, videos, digital images, sensors, and sales transaction records. The aim in analyzing all this data is to uncover patterns and connections that might otherwise be invisible, and that might provide valuable insights about the users who created it. Through this insight, businesses may be able to gain an edge over their rivals and make superior business decisions.”  Requires lots of data wrangling and Data Engineers  Requires Data Scientists to uncover patterns from complex raw data  Requires Business Analysts to provide business value from multiple data sources  Requires additional tools and infrastructure not provided by traditional database and BI technologies Why Cloud for Big Data Analytics? • Quick and easy to stand-up new, large, big data architectures • Elastic scale • Metered pricing • Quickly evolve architectures to rapidly changing landscapes • Prototype, tear down
  4. 4. Big Data Analytics Tools & Use Cases vs. “Traditional BI” Traditional BI • Sales reports • Post-campaign marketing research & analysis • CRM reports • Enterprise data assets • Can’t miss any transactions, records or rows • DWs • Relational Databases • Well-defined and format data sources • Direct connections to OLTP and LOB data sources • Excel • Well-defined business semantic models • OLAP cubes • MDM, Data Quality, Data Governance Big Data Analytics • Sentiment Analysis • Predictive Maintenance • Churn Analytics • Customer Analytics • Real-time marketing • Avoid simply siphoning off data for BI tools • Architect multiple paths for data pipelines: speed, batch, analytical • Plan for data of varying types, volumes and formats • Data can/will land at any time, any speed, any format • It’s OK to miss a few records and data points • NoSQL • MPP DWs • Hadoop, Spark, Storm • R & ML to find patterns in masses of data lakes
  5. 5. • Key Values / JSON / CSV • Compress files • Columnar • Land raw data fast • Data Wrangle/Munge/Engineer • Find patterns • Prepare for business models • Present to business decision makers A few basic fundamentals Big Data Analytics in the Cloud Collect and land data in lake Process data pipelines (stream, batch, analysis) Presentation Layer: Surface knowledge to business decision makers
  6. 6. Azure Data Platform-at-a-glance
  7. 7. Action People Automated Systems Apps Web Mobile Bots Intelligence Dashboards & Visualizations Cortana Bot Framework Cognitive Services Power BI Information Management Event Hubs Data Catalog Data Factory Machine Learning and Analytics HDInsight (Hadoop and Spark) Stream Analytics Intelligence Data Lake Analytics Machine Learning Big Data Stores SQL Data Warehouse Data Lake Store Data Sources Apps Sensors and devices Data
  8. 8. Azure Data Factory What it is: When to use it: A pipeline system to move data in, perform activities on data, move data around, and move data out • Create solutions using multiple tools as a single process • Orchestrate processes - Scheduling • Monitor and manage pipelines • Call and re-train Azure ML models
  9. 9. ADF Components
  10. 10. ADF Logical Flow
  11. 11. Example – Customer Churn Call Log Files Customer Table Call Log Files Customer Table Customer Churn Table Azure Data Factory: Data Sources Customers Likely to Churn Customer Call Details Transform & Analyze PublishIngest
  12. 12. Simple ADF • Business Goal: Transform and Analyze Web Logs each month • Design Process: Transform Raw Weblogs, using a Hive Query, storing the results in Blob Storage Web Logs Loaded to Blob Files ready for analysis and use in AzureML HDInsight HIVE query to transform Log entries
  13. 13. Azure SQL Data Warehouse What it is: When to use it: A Scaling Data Warehouse Service in the Cloud • When you need a large-data BI solution in the cloud • MPP SQL Server in the Cloud • Elastic scale data warehousing • When you need pause-able scale-out compute
  14. 14. Elastic scale & performance Real-time elasticity Resize in <1 minute On-demand compute Expand or reduce as needed Pause Data Warehouse to Save on Compute Costs. I.e. Pause during non-business hours
  15. 15. Storage can be as big or small as required Users can execute niche workloads without re-scanning data Elastic scale & performance Scale
  16. 16. Logical overview
  17. 17. SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales]; SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales]; SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales]; SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales]; SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales]; SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales]; Compute Control
  18. 18. Azure Data Lake What it is: When to use it: Data storage (Web-HDFS) and Distributed Data Processing (HIVE, Spark, HBase, Storm, U-SQL) Engines • Low-cost, high-throughput data store • Non-relational data • Larger storage limits than Blobs
  19. 19. Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop and ADLA Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  20. 20. WebHDFS YARN U-SQL ADL Analytics ADL HDInsight Store HiveAnalytics Storage Azure Data Lake (Store, HDInsight, Analytics)
  21. 21. No limits to SCALE Store ANY DATA in its native format HADOOP FILE SYSTEM (HDFS) for the cloud Optimized for analytic workload PERFORMANCE ENTERPRISE GRADE authentication, access control, audit, encryption at rest Azure Data Lake Store A hyperscale repositoryfor big data analyticsworkloads Introducing ADLS
  22. 22. Enterprise- grade Limitless scaleProductivity from day one Easy and powerful data preparation All data 23 0100101001000101010100101001000 10101010010100100010101010010100 10001010101001010010001010101001 0100100010101010010100100010101 0100101001000101010100101001000 10101010010100100010101010010100 10001010101001010010001010101001 0100100010101010010100100010101 0100101001000101010100101001000 10101010010100100010101010010100
  23. 23. Developing big data apps Author, debug, & optimize big data apps in Visual Studio Multiple Languages U-SQL, Hive, & Pig Seamlessly integrate .NET
  24. 24. Work across all cloud data Azure Data Lake Analytics Azure SQL DW Azure SQL DB Azure Storage Blobs Azure Data Lake Store SQL DB in an Azure VM
  25. 25. What is U-SQL? A hyper-scalable, highly extensible language for preparing, transforming and analyzing all data Allows users to focus on the what—not the how—of business problems Built on familiar languages (SQL and C#) and supported by a fully integrated development environment Built for data developers & scientists 26
  26. 26. U-SQL language philosophy 27 Declarative query and transformation language: • Uses SQL’s SELECT FROM WHERE with GROUP BY/aggregation, joins, SQL Analytics functions • Optimizable, scalable Operates on unstructured & structured data • Schema on read over files • Relational metadata objects (e.g. database, table) Extensible from ground up: • Type system is based on C# • Expression language is C# 21 User-defined functions (U-SQL and C#) User-defined types (U-SQL/C#) (future) User-defined aggregators (C#) User-defined operators (UDO) (C#) U-SQL provides the parallelization and scale-out framework for usercode • EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINERS Expression-flow programming style: • Easy to use functional lambda composition • Composable, globally optimizable Federated query across distributed data sources (soon) REFERENCE MyDB.MyAssembly; CREATE TABLE T( cid int, first_order DateTime , last_order DateTime, order_count int , order_amount float ); @o = EXTRACT oid int, cid int, odate DateTime, amount float FROM "/input/orders.txt“ USING Extractors.Csv(); @c = EXTRACT cid int, name string, city string FROM "/input/customers.txt“ USING Extractors.Csv(); @j = SELECT c.cid, MIN(o.odate) AS firstorder , MAX( AS lastorder, COUNT(o.oid) AS ordercnt , SUM(c.amount) AS totalamount FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid WHERE"New") && MyNamespace.MyFunction(o.odate) > 10 GROUP BY c.cid; OUTPUT @j TO "/output/result.txt" USING new MyData.Write(); INSERT INTO T SELECT * FROM @j;
  27. 27. Expression-flow programming style Automatic "in-lining" of SQLIP expressions – whole script leads to a single execution model Execution plan that is optimized out-of- the-box and w/o user intervention Per-job and user-driven parallelization Detail visibility into execution steps, for debugging Heat map functionality to identify performance bottlenecks 010010 100100 010101
  28. 28. “Unstructured” Files • Schema on Read • Write to File • Built-in and custom Extractors and Outputters • ADL Storage and Azure Blob Storage EXTRACT Expression @s = EXTRACT a string, b int FROM "filepath/file.csv" USING Extractors.Csv; • Built-in Extractors: Csv, Tsv, Text with lots of options • Custom Extractors: e.g., JSON, XML, etc. OUTPUT Expression OUTPUT @s TO "filepath/file.csv" USING Outputters.Csv(); • Built-in Outputters: Csv, Tsv, Text • Custom Outputters: e.g., JSON, XML, etc. (see Filepath URIs • Relative URI to default ADL Storage account: "filepath/file.csv" • Absolute URIs: • ADLS: "adl://" • WASB: "wasb://container@account/filepath/file.csv"
  29. 29. Expression-flow Programming Style 12 • Automatic "in-lining" of U-SQL expressions – whole script leads to a single execution model. • Execution plan that is optimized out-of-the- box and w/o user intervention. • Per job and user driven level of parallelization. • Detail visibility into execution steps, for debugging. • Heatmap like functionality to identify performance bottlenecks.
  30. 30. Visual Studio integration
  31. 31. What can you do with Visual Studio? 32 Visualize and replay progress of job Fine-tune query performance Visualize physical plan of U-SQL query Browse metadata catalog Author U-SQL scripts (with C# code) Create metadata objects Submit and cancel U-SQL Jobs Debug U-SQL and C# code
  32. 32. Plug-in
  33. 33. Authoring U-SQL queries 34 Visual Studio fully supports authoring U-SQL scripts While editing, it provides: IntelliSense Syntax color coding Syntax checking … Contextual Menu
  34. 34. Job execution graph 35 After a job is submitted the progress of the execution of the job as it goes through the different stages is shown and updated continuously Important stats about the job are also displayed and updated continuously
  35. 35. Job diagnostics Diagnostics information is shown to help with debugging and performance issues
  36. 36. HDInsight: Cloud Managed Hadoop What it is: When to use it: Microsoft’s implementation of apache Hadoop (as a service) that uses Blobs for persistent storage • When you need to process large scale data (PB+) • When you want to use Hadoop or Spark as a service • When you want to compute data and retire the servers, but retain the results • When your team is familiar with the Hadoop Zoo
  37. 37. Hadoop and HDInsight Using the Hadoop Ecosystem to process and query data
  38. 38. HDInsight Tools for Visual Studio
  39. 39. Deploying HDInsight Clusters • Cluster Type: Hadoop, Spark, HBase and Storm. • Hadoop clusters: for query and analysis workloads • HBase clusters: for NoSQL workloads • Spark clusters: for in-memory processing, interactive queries, stream, and machine learning workloads • Operating System: Windows or Linux • Can be deployed from Azure portal, Azure Command Line Interface (CLI), or Azure PowerShell and Visual Studio • A UI dashboard is provided to the cluster through Ambari. • Remote Access through SSH, REST API, ODBC, JDBC. • Remote Desktop (RDP) access for Windows clusters
  40. 40. Azure Machine Learning What it is: When to use it: A multi-platform environment and engine to create and deploy Machine Learning models and API’s • When you need to create predictive analytics • When you need to share Data Science experiments across teams • When you need to create call-able API’s for ML functions • When you also have R and Python experience on your Data Science team
  41. 41. Creating an Experiment Get/Prepare Data Build/Edit Experiment Create/Update Model Evaluate Model Results Build and ModelCreate Workspace Deploy Model Consume Model
  42. 42. Basic Azure ML Elements Import Data Preprocess Algorithm Train Model Split Data Score Model
  43. 43. Power BI What it is: When to use it: Interactive Report and Visualization creation for computing and mobile platforms • When you need to create and view interactive reports that combine multiple datasets • When you need to embed reporting into an application • When you need customizable visualizations • When you need to create shared datasets, reports, and dashboards that you publish to your team
  44. 44. Common architectural patterns
  45. 45. Big Data Analytics – Data Flow
  46. 46. Event Ingestion Patterns Business apps Custom apps Sensors and devices Events Events Azure Data Lake Store Transformed Data Raw Events Azure Event Hubs Kafka
  47. 47. Bulk Ingestion and Preparation Business apps Custom apps Sensors and devices Bulk Load Azure Data Factory
  48. 48. Data Transformation Data Collection Presentation and action Queuing System Data Storage Big Data Lambda Architecture Azure Search Data analytics (Excel, Power BI, Looker, Tableau) Web/thick client dashboards Devices to take action Event hub Event & data producers Applications Web and social Devices Live Dashboards DocumentDB MongoDB SQL Azure ADW Hbase Blob StorageKafka/RabbitMQ/ ActiveMQ Event hubs Azure ML Storm / Stream Analytics Hive / U-SQL Data Factory Sensors Pig Cloud gateways (web APIs) Field gateways
  49. 49. Get started today! 57 Cortana Intelligence Solutions
  50. 50. Cortana Intelligence Solutions: Discover
  51. 51. Cortana Intelligence Solutions: Try
  52. 52. Cortana Intelligence Solutions: Deploy

Notas del editor

  • What you can do with it:
    Virtual Machines: and
  • Azure Data Factory:
  • Pricing:
  • Learning Path:
    Quick Example:
  • Video of this process:
  • More options: Prepare System: - Follow steps
    Another Lab:
  • Azure SQL Data Warehouse:
  • 15
  • 16
  • Azure Data Lake:
  • All data
    Unstructured, Semi structured, Structured
    Domain-specific user defined types using C#
    Queries over Data Lake and Azure Blobs
    Federated Queries over Operational and DW SQL stores removing the complexity of ETL

    Productive from day one
    Effortless scale and performance without need to manually tune/configure
    Best developer experience throughout development lifecycle for both novices and experts
    Leverage your existing skills with SQL and .NET

    Easy and powerful data preparation
    Easy to use built-in connectors for common data formats
    Simple and rich extensibility model for adding customer – specific data transformation – both existing and new

    No limits scale
    Scales on demand with no change to code
    Automatically parallelizes SQL and custom code
    Designed to process petabytes of data

    Enterprise grade
    Managing, securing, sharing, and discovery of familiar data and code objects (tables, functions etc.)
    Role based authorization of Catalogs and storage accounts using AAD security
    Auditing of catalog objects (databases, tables etc.)
  • ADLA allows you to compute on data anywhere and a join data from multiple cloud sources.
  • Use for language experts
  • Azure HDInsight:
  • Primary site:
    Quick overview:
    4-week online course through the edX platform:
    11 minute introductory video:
    Microsoft Virtual Academy Training (4 hours) -
    Learning path for HDInsight:
    Azure Feature Pack for SQL Server 2016, i.e., SSIS (SQL Server Integration Services):
  • Azure Portal:
    Provisioning Clusters:
    Different clusters have different node types, number of nodes, and node sizes.
  • Azure Machine Learning:
  • Beginning Series:
  • Designing an experiment in the Studio:
  • Power BI:
  • Customize yourself or with featured partners