Más contenido relacionado

Presentaciones para ti(20)

Similar a 1 Introduction to Microsoft data platform analytics for release(20)


Más de Jen Stirrup(20)


1 Introduction to Microsoft data platform analytics for release

  1. Introduction to Analytics with the Microsoft Data Platform Jen Stirrup Data Whisperer, Data Relish Level: 300
  2. Jen Stirrup ·Consultant ·Postgraduate degrees in Artificial Intelligence and Cognitive Science ·Twenty year career in industry ·Author
  3. Contact Details · · · ·
  4. Agenda • Azure Data Explorer • Azure Data Factory • Streaming Analytics • Event Hubs • Azure SQL Database
  5. Agenda • Analysis Services • Data Lake Analytics • HDInsight • Azure Databricks • Azure SQL Datawarehouse • Azure Synapse
  6. What to use and when? A fully managed, elastic data warehouse with security at every level of scale at no extra cost Azure Synapse Analytics Fast, easy and collaborative Apache Spark-based analytics platform Azure Databricks A fully managed cloud Hadoop and Spark service backed by 99.9% SLA for your enterprise HDInsight A fully managed cloud service that enables you to easily build, deploy and share predictive analytics solutions Machine Learning An on-demand, real-time stream processing service with enterprise-grade security, auditing and support Stream Analytics
  7. What to use and when? A no-limits data lake built to support massively parallel analytics Data Lake Store A fully managed on-demand pay-per-job analytics service with enterprise-grade security, auditing and support Data Lake Analytics An enterprise-wide metadata catalogue that makes data asset discovery simple Azure Data Catalog A data integration service to orchestrate and automate data movement and transformation Data Factory
  8. INGEST Modern Data Warehouse PREPARE TRANSFORM & ENRICH SERVE STORE VISUALIZE On-premises data Cloud data SaaS data
  9. Azure Data Explorer
  10. Azure Data Explorer Jupyter Notebook allows you to create and share documents that contain live code, equations, visualizations, and explanatory text. We are excited to announce KQL magic commands which extends the functionality of the Python kernel in Jupyter Notebook. KQL magic allows you to write KQL queries natively and query data from Microsoft Azure Data Explorer. You can easily interchange between Python and KQL, and visualize data using rich library integrated with KQL render commands. KQL magic supports Azure Data Explorer, Application Insights, and Log Analytics as data sources to run queries against.
  11. Azure Data Explorer
  12. Azure Data Explorer Fast and highly scalable data exploration service. Azure Data Explorer is a fast, fully managed data analytics service for real-time analysis on large volumes of data streaming from applications, websites, IoT devices and more.
  13. Azure Data Explorer ● Low-latency ingestion ● Fast read-only query with high concurrency ● Query large amounts of structured, semi- structured (JSON-like nested types) and unstructured (free-text) data.
  14. Credit: Microsoft
  15. Azure Data Explorer Demo
  16. Azure Data Factory Data Ingestion
  17. Data Factory ● No code or maintenance required to build hybrid ETL and ELT pipelines within the Data Factory visual environment. ● Cost-efficient and fully managed serverless cloud data integration tool that scales on demand.
  18. Data Factory ● Azure security measures to connect to on- premises, cloud-based and software-as-a- service apps with peace of mind. ● SSIS integration runtime to easily move SSIS ETL workloads into the cloud with minimal effort.
  19. Data Factory ● Ingest, move, prepare, transform and process your data in a few clicks, and complete your data modelling within the accessible visual environment.
  20. Why Data Factory? ● Orchestrate, monitor & schedule data pipelines ● Automatic cloud resource management ● Single pane of glass
  21. • Data Sources, Linked Services & Datasets • Activities • Pipelines • Supported data sources • Supported activity types Module Overview
  22. Streaming Analytics
  23. Stream Analytics
  24. Stream Analytics ● Build streaming pipelines in minutes - Run complex analytics with no need to learn new processing frameworks or provision virtual machines (VMs) or clusters. Use familiar SQL language that is extensible with JavaScript and C# custom code for more advanced use cases. Easily enable scenarios such as low-latency dashboarding, streaming ETL and real-time alerting with one-click integration across sources and sinks. ● Run mission-critical workloads with subsecond latencies - Get guaranteed, “exactly once” event processing with 99.9% availability and built-in recovery capabilities. Easily set up a continuous integration and continuous delivery (CI-CD) pipeline and achieve subsecond latencies on your most demanding workloads. ● Deploy in the cloud and on the edge - Bring real-time insights and analytics capabilities closer to where your data originates. Enable new scenarios with true hybrid architectures for stream processing and run the same query in the cloud or on the edge. ● Power real-time analytics with artificial intelligence - Take advantage of built-in machine learning (ML) models to shorten time to insights. Use ML-based capabilities to perform anomaly detection directly in your streaming jobs with Azure Stream Analytics.
  25. Credit: Microsoft
  26. Event Hubs
  27. Event Hubs ● A hyper-scale telemetry ingestion service that collects, transforms and stores millions of events. ● Event Hubs is a fully managed, real-time data ingestion service that’s simple, trusted and scalable.
  28. Event Hubs ● Integrate seamlessly with other Azure services to unlock valuable insights. ● Experience real-time data ingestion and microbatching on the same stream.
  29. Event Hubs ● Focus on drawing insights from your data instead of managing infrastructure. Build real-time big data pipelines and respond to business challenges right away. ● Build real-time data pipelines with just a couple of clicks. Seamlessly integrate with Azure data services to uncover insights faster.
  30. Event Hubs ● Ingest millions of events per second - Continuously ingress data from hundreds of thousands of sources with low latency and configurable time retention.
  31. Analysis Services
  32. Analysis Services Focus on solving business problems, not learning new skills, when you use the familiar, integrated development environment of Visual Studio. Easily deploy your existing SQL Server 2016 tabular models to the cloud.
  33. Data Lake Analytics
  34. Data Lake Analytics ● Easily develop and run massively parallel data transformation and processing programs in U-SQL, R, Python and .NET over petabytes of data. With no infrastructure to manage, you can process data on demand, scale instantly and only pay per job.
  35. Data Lake Analytics ● Process big data jobs in seconds with Azure Data Lake Analytics. There is no infrastructure to worry about because there are no servers, virtual machines or clusters to wait for, manage or tune.
  36. Data Lake Analytics ● Instantly scale the processing power, measured in Azure Data Lake Analytics Units (AU), from one to thousands for each job. You only pay for the processing that you use per job.
  37. Data Lake Analytics ● U-SQL is a simple, expressive and extensible language that allows you to write code once and have it automatically parallelised for the scale you need.
  38. Data Lake Analytics ● Process petabytes of data for diverse workload categories such as querying, ETL, analytics, machine learning, machine translation, image processing and sentiment analysis by leveraging existing libraries written in .NET languages, R or Python.
  39. Apache Spark
  40. Three Ages of Databases Data WarehouseData Warehouse RDBMS RDBMS RDBMS NoSQL 1985-1995 2010-Now1995-2010
  41. Pressures on single node RDBMS Scalability Single Node RDBMS OLAP/BI/Data Warehouse Social Networks Agile Schema Free
  42. Key-Value Stores Graph/Triple Stores XML Column-Family Stores Object Stores 17
  43. Simplicity? YesDoes it look like document? Start No Stop Use the RDBMS Use Microsoft Office
  44. What is big data? • When you have to innovate to collect, store, organize, analyse and share it. • - Werner Vogels, Amazon CTO
  45. What is Big Data? • Traditionally….. – Physics Experiments – Sensor data – Satellite data
  46. Now? • Now: zettabytes in the cloud expected by end of next year • the-future-for-cloud-data-storage-clouds-of- glass/
  47. Azure Synapse ● Azure Synapse delivers insights from all your data, across data warehouses and big data analytics systems, with blazing speed. ● Data professionals can query both relational and non- relational data at petabyte-scale using the familiar SQL language ● Credit to the Microsoft team for help with these decks
  48. Azure Synapse ● Azure Synapse is a limitless analytics service that brings together enterprise data warehousing and Big Data analytics. ● Query data on your terms, using either serverless on-demand or provisioned resources—at scale
  49. Azure Synapse • Discover powerful insights across your most important data • Unified analytics experience
  50. Azure Synapse • support for SSDT with Visual Studio 2019 • native platform integration with Azure DevOps • built-in continuous integration and deployment (CI/CD) capabilities for enterprise-level deployments
  51. Azure Synapse STORE VISUALIZE On-premises data Cloud data SaaS data
  52. Best in class price per performance Developer productivity Intelligent workload management Data flexibility Up to 94% less expensive than competitors Prioritize resources for the most valuable workloads Ingest variety of data sources to derive the maximum benefit Use preferred tooling for SQL data warehouse development Industry-leading security Defense-in-depth security and 99.9% financially backed availability SLA Azure Synapse
  53. DirectQuery Composite Models & Aggregation Tables The enterprise solution Avoid data movement Delegate query work to the back-end source; take advantage of Azure SQL Data Warehouse’s advanced features Why choose? Import and DirectQuery in a single model Keep summarized data local; get detail data from the source Import Great for small data sources and personal data discovery Fine for CSV files, spreadsheet data and summarized OLTP data Power BI
  54. Best in class price per performance Developer productivity Intelligent workload management Data flexibility Up to 94% less expensive than competitors Prioritize resources for the most valuable workloads Ingest variety of data sources to derive the maximum benefit Use preferred tooling for SQL data warehouse development Industry-leading security Defense-in-depth security and 99.9% financially backed availability SLA Azure SQL Data Warehouse
  55. Complete Data SecurityCategory Feature Data Protection Data in Transit Data Encryption at Rest Data Discovery and Classification Access Control Object Level Security (Tables/Views) Row Level Security Column Level Security Dynamic Data Masking SQL Login Authentication Azure Active Directory Multi-Factor Authentication Virtual Networks Network Security Firewall Azure ExpressRoute Thread Detection Threat Protection Auditing Vulnerability Assessment
  56. Workload Management Scale-In Isolation Predictablecost Online elasticity Efficient for unpredictableworkloads No cacheeviction forscaling Intra Cluster Workload Isolation (Scale In) Marketing CREATE WORKLOAD GROUP Sales WITH ( [ MIN_PERCENTAGE_RESOURCE = 60 ] [ CAP_PERCENTAGE_RESOURCE = 100 ] [ MAX_CONCURRENCY = 6 ] ) 40% Compute 1000c DWU 60% Sales 60% 100% M I C R OSOFT C O NFIDE NTI AL
  57. Heterogeneous Data Scale-In Isolation Predictablecost Online elasticity Efficient for unpredictableworkloads No cacheeviction forscaling
  58. Developer Productivity
  59. Performance Optimized Storage Elastic Architecture Columnar Storage Columnar Ordering Table Partitioning Nonclustered Indexes Hash Distribution Materialized Views Resultset Cache
  60. Azure SQL Data Warehouse performance advantage Overview
  61. Complete Security Data In Transit Data encryption at rest (Service & User Managed Keys) Data Discovery and Classification Native Row Level Security Table and View Security (GRANT / DENY) Column Level Security Dynamic Data Masking SQL Authentication Native Azure Active Directory Integrated Security Multi-Factor Authentication Virtual Network (VNET) SQL Firewall (server) Integration with ExpressRoute SQL Threat Detection SQL Auditing Vulnerability Assessment Data Protection CATEGORY FEATURE SQL DATA WAREHOUSE Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
  63. What is Polybase? Unstructured Data
  64. What is Polybase? SQL Server Azure SQL Data Warehouse Parallel Data Warehouse Azure SQL Database
  66. Making business data accessible Provides a scalable, T-SQL-compatible query processing framework for combining data from both universes
  67. Polybase Purpose Consumer Analyst Scientist Data Volume Medium to Low Reasonable High -> Huge Degree of Structure Very High Some Low ->None Number of Users Very High Medium Low Transformation Complexity Low Medium to High High Analytics Complexity Low Medium Very High Data
  68. Machine Learning
  69. Agenda • Why machine learning in SQL Server? • How to leverage: – SQL Compute context – sp_execute_external_script features – PREDICT T-SQL Function • Call to action • Questions
  70. Why machine learning with SQL Server? Reduce or eliminate data movement with in-database analytics Operationalize machine learning models Get enterprise scale, performance, and security
  71. Machine Learning Services • R/Python Integration Design – Invokes runtime outside of SQL Server process – Batch-oriented operations • SQL Compute context
  72. Any R/Python IDE Data Scientist Workstation Typical Machine Learning workflow against database SQL Server Pull Data1 train <- sqlQuery(connection, “select * from nyctaxi_sample”) model <- glm(formula, train) 3 Model Output 2 Execution
  73. Any R/Python IDE Data Scientist Workstation rx* output 3 Machine Learning workflow using SQL compute context Execution2 SQL Server 2017 SQL Server R/Python Runtime Machine Learning Services Script1 cc <- RxInSqlServer( connectionString, computeContext) rxLogit(formula, cc) Model or Prediction s 4
  74. SQL Compute Context from R/Python client • Requirement - Use rx* functions
  75. Push data from SQL Server to external runtime sp_execute_external_ script @input_data_1 = N’ SELECT * FROM TrainingData’ InputDataset: data.frame OR Pandas dataframe
  76. Read Files with R Server • R can read almost all flat text files like SPSS, CSV, TXT. • Provide path direction and read file directory from the given path into R.
  77. RStudio & XDF File • In order to convert our text files either CSV or other text formats, into XDF format, RStudio can easily handle this task. • XDF file formats can only be read by R, and they are very small in size as compared to other files.
  78. RStudio
  79. Convert File to XDF • TXT or CSV files can be converted to XDF format . • XDF file formats can only be read by Rand they are very small in size as compared to other files.
  80. rxCrossTabs rxCrossTabs is used to create contingency tables from cross- classifying factors using a formula interface. rxCrossTabs() is also used to compute sums according to combinations of different variables
  81. RxCube • rxCube() performs a very similar function to rxCrossTabs(). • It computes tabulated sums or means. • rxCube() produces the sums or means in long format rather than a table. • This can be useful when we want to aggregate data for further analysis within R.
  82. dplyrXdf Package The dplyr package is a popular toolkit for data transformation and manipulation. dplyr supports data frames, data tables (from the data.table package) The dplyrXdf package implements such a backend for the xdf file format, a technology supplied as part of Revolution R Enterprise.
  83. Ggplot2 Ggplot2 allows you to create graphs that represent both univariate and multivariate numerical and categorical data in a straightforward manner This function can be used to create the most common graph types. It can create a very wide range of useful plots.
  84. Custom visualizations with rxSummary & rxCube • RxSummary: • The rxSummary function provides descriptive statistics using a formula argument.
  85. Custom visualizations with rxSummary & rxCube • RxCube • Rxcube is similar to rxSummary but it returns fewer statistical summaries and therefore run faster. • With y ~u : v as the formula, the rxCube returns count and averages for column y
  86. Custom visualizations with rxSummary & rxCube • RxCube • Code: rxc1 <- rxCube(trip_distance ~ picup_nb:dropoff_nb, • mht_xdf)
  87. rxHistogram • rxHistogram() is used to create a histogram for the Close variable. • Syntax: Function(formula, data, …) • Formula = A formula that contains the variable which you want to visualize. The
  88. rxlinePlot • Line or scatter plot use data from an .xdf file or data frame • Syntax: rxLinePlot(formula, data, …. ) • Formula = For this function, this formula should have one variable on the left side of the ~ that reflects the Y-axis, and one
  89. rxDataSteps • The rxDataStep function can be used to process data in chunks. • rxDataStep can be used to create and transform subsets of data.
  90. rxDataStep • • rxDatastep can be used to • Modify existing columns or add new columns to the data
  91. Subset Rows Of Data Using Transform Argument • A common use of rxDataStep is to create a new data set with a subset of rows and variables. • For this purpose, we use the data frame of our data as the input data set.
  92. On-the-fly Transformation • Analytical functions within the RevoScaleR package use a formal transformation function framework for generating on the fly variables • The RevoScaleR approach is to use the
  93. In-data transformation • There are two main approaches to in-data transformation: • Define an external based R function and reference it. • Define an embedded transformation as an input to a transforms argument on another function.
  94. Generate a data frame • A data frame is a table or a two- dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.
  95. Generate a Data Frame • Code: SalesData <- file.path("D:/773Demo", "CustomerSalesInfo.xdf") • SalesDataFrame <- rxImport(inData = SalesData)
  96. POSIXct & POSIXIt • R provides several options for dealing with date and date/time data. The POSIXct and POSIXlt classes allow for dates and times with control for time zones.
  97. Transform functions ▪ It is a generic function which does useful things with data frames. ▪ Embedded transformations provide instructions within a formula, through arguments on a function. ▪ Using just arguments, you can manipulate data using transformations.
  98. Summary Improve performance of your ML scripts by using: – SQL Compute context from client (rx* functions) – Streaming to reduce memory usage – Trivial parallelism for scoring (predict or rxPredict) – Parallel training and scoring using rx* functions – Native PREDICT function for low latency scoring
  99. Call to action • Resources – SQL Server Samples on GitHub – R Services & ML Services – Getting started tutorials: AKA.MS/MLSQLDEV – Configure instance: SSMS Reports for ML Services – ML cheat sheet – Microsoft documentation: SQL Server Machine