Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO for Data, Microsoft

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 53 Anuncio

The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO for Data, Microsoft

Descargar para leer sin conexión

Until recently, data was gathered for well-defined objectives such as auditing, forensics, reporting and line-of-business operations; now, exploratory and predictive analysis is becoming ubiquitous, and the default increasingly is to capture and store any and all data, in anticipation of potential future strategic value. These differences in data heterogeneity, scale and usage are leading to a new generation of data management and analytic systems, where the emphasis is on supporting a wide range of very large datasets that are stored uniformly and analyzed seamlessly using whatever techniques are most appropriate, including traditional tools like SQL and BI and newer tools, e.g., for machine learning and stream analytics. These new systems are necessarily based on scale-out architectures for both storage and computation.

Hadoop has become a key building block in the new generation of scale-out systems. On the storage side, HDFS has provided a cost-effective and scalable substrate for storing large heterogeneous datasets. However, as key customer and systems touch points are instrumented to log data, and Internet of Things applications become common, data in the enterprise is growing at a staggering pace, and the need to leverage different storage tiers (ranging from tape to main memory) is posing new challenges, leading to caching technologies, such as Spark. On the analytics side, the emergence of resource managers such as YARN has opened the door for analytics tools to bypass the Map-Reduce layer and directly exploit shared system resources while computing close to data copies. This trend is especially significant for iterative computations such as graph analytics and machine learning, for which Map-Reduce is widely recognized to be a poor fit.

While Hadoop is widely recognized and used externally, Microsoft has long been at the forefront of Big Data analytics, with Cosmos and Scope supporting all internal customers. These internal services are a key part of our strategy going forward, and are enabling new state of the art external-facing services such as Azure Data Lake and more. I will examine these trends, and ground the talk by discussing the Microsoft Big Data stack.

Until recently, data was gathered for well-defined objectives such as auditing, forensics, reporting and line-of-business operations; now, exploratory and predictive analysis is becoming ubiquitous, and the default increasingly is to capture and store any and all data, in anticipation of potential future strategic value. These differences in data heterogeneity, scale and usage are leading to a new generation of data management and analytic systems, where the emphasis is on supporting a wide range of very large datasets that are stored uniformly and analyzed seamlessly using whatever techniques are most appropriate, including traditional tools like SQL and BI and newer tools, e.g., for machine learning and stream analytics. These new systems are necessarily based on scale-out architectures for both storage and computation.

Hadoop has become a key building block in the new generation of scale-out systems. On the storage side, HDFS has provided a cost-effective and scalable substrate for storing large heterogeneous datasets. However, as key customer and systems touch points are instrumented to log data, and Internet of Things applications become common, data in the enterprise is growing at a staggering pace, and the need to leverage different storage tiers (ranging from tape to main memory) is posing new challenges, leading to caching technologies, such as Spark. On the analytics side, the emergence of resource managers such as YARN has opened the door for analytics tools to bypass the Map-Reduce layer and directly exploit shared system resources while computing close to data copies. This trend is especially significant for iterative computations such as graph analytics and machine learning, for which Map-Reduce is widely recognized to be a poor fit.

While Hadoop is widely recognized and used externally, Microsoft has long been at the forefront of Big Data analytics, with Cosmos and Scope supporting all internal customers. These internal services are a key part of our strategy going forward, and are enabling new state of the art external-facing services such as Azure Data Lake and more. I will examine these trends, and ground the talk by discussing the Microsoft Big Data stack.

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

A los espectadores también les gustó (20)

Anuncio

Similares a The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO for Data, Microsoft (20)

Más de The Hive (20)

Anuncio

Más reciente (20)

The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO for Data, Microsoft

  1. 1. Big Data @ Microsoft Raghu Ramakrishnan CTO for Data, Technical Fellow Microsoft
  2. 2. Data and Analytics – 3 Pillars SQL 2016 Azure SQL DB Azure SQL DW SQL Server R services On-prem and cloud (Windows, Linux) Cortana Intelligence Suite Hadoop, Data Lake, Machine learning, PowerBI, Data Factory, Streaming, Perceptual Intelligence On-prem connectivity Microsoft R server Hadoop Teradata On-prem and cloud (Windows, Linux)
  3. 3. SQL Server 2016: Everything Built-In The above graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from Microsoft. Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose. Consistent experience from on-premises to cloud Microsoft Tableau Oracle $120 $480 $2,230 Self-service BI per user In-memory across all workloads TPC-H non-clustered 10TB Oracle is #4#2 SQL Server #1 SQL Server #3 SQL Server built-inbuilt-in built-in built-in built-in 0 1 4 0 0 3 34 29 22 15 5 22 6 43 20 69 18 49 3 -80 -70 -60 -50 -40 -30 -20 -10 0 2010 2011 2012 2013 2014 2015 SQL Server Oracle MySQL2 SAP HANA TPC-H non-clustered results as of 04/06/15, 5/04/15, 4/15/14 and 11/25/13, respectively. http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster at massive scale National Institute of Standards and Technology Comprehensive Vulnerability Database update 5/4/2015
  4. 4. In-Database Advanced Analytics No need to move the data Open source R with in- memory & massive scale – multi-threading & massive parallel processing Data Scientist Interact directly with data R built-in to SQL Server Data Developer/DBA Manage data and analytics together Example Solutions • Sales forecasting • Warehouse efficiency • Predictive maintenance Extensibility ? R R Integration Relational data Analytic Library T-SQL interface 010010 100100 010101 New R scripts 010010 100100 010101 010010 100100 010101 010010 100100 010101 • Credit risk protection 010010 100100 010101 Microsoft Azure Marketplace Real-time operational analytics without moving the data NEW NEW End-to-end mobile BI Advanced AnalyticsMission critical OLTP
  5. 5. High-performance open source R plus:  Enterprise Scale & Performance – Scales from workstations to large clusters – Scales to large data sizes – Growing portfolio of Parallelized algorithms  Secure, Scalable R Deployment/Operationalization  Write Once Deploy Anywhere for multiple platforms  IDE for data scientists and developers  Enterprise Class Support DistributedR DeployR DevelopR ScaleR ConnectR
  6. 6. Cloud – SQL Server/SQL Azure Shifting how you purchase and manage machines Increased focus on Total Cost of Ownership and continuous improvements Built from the same code base We increased surface area compatibility with V12 Azure SQL Database We’re learning how to run our own code – the good and the bad We’re using that to improve both product and service Microsoft is the only provider both on-premises and in the cloud
  7. 7. Order history Name SSN Date Jane Doe cm61ba906fd 2/28/2005 Jim Gray ox7ff654ae6d 3/18/2005 John Smith i2y36cg776rg 4/10/2005 Bill Brown nx290pldo90l 4/27/2005 Order history Name SSN Date Jane Doe cm61ba906fd 2/28/2005 Jim Gray ox7ff654ae6d 3/18/2005 John Smith i2y36cg776rg 4/10/2005 Bill Brown nx290pldo90l 4/27/2005 Customer data Product data Order History Stretch to cloud Stretch SQL Server into Azure Stretch warm and cold tables to Azure with remote query processing App Query Microsoft Azure  Jim Gray ox7ff654ae6d 3/18/2005 SQL Server 2016
  8. 8. Azure SQL DW Fully managed relational data warehouse-as-a-service First elastic cloud data warehouse with proven SQL Server capabilities Support your smallest to your largest data storage needs Scales to petabytes of data Massively Parallel Processing Instant-on compute scales in seconds Query Relational / Non-Relational Saas Azure Public Cloud Office 365Office 365 Get started in minutes Integrated with Azure ML, PowerBI & ADF Simple billing compute & storage Pay for what you need, when you need it with dynamic pause AzureAzure
  9. 9. Store any data relations Do any analysis SQL queries Hive, At any speed Batch Hive At any scale … elastic! Anywhere Data to Intelligent Action
  10. 10. Web Logs, Omniture logs On-Premise SQL Server (customer and product data) In-Store Activity with Kinect sensors Social Data Diagnostic streaming Event hubs Machine Learning Stream Analytics Azure DataLake Data Factory: Move Data, Orchestrate, Schedule, and Monitor HDInsight HDInsight Machine Learning Azure SQL Data Warehouse Power BI INGEST PREPARE ANALYZE PUBLISH Stream Analytics CONSUMEDATA SOURCES Cortana Web/LOB Dashboards
  11. 11. Azure Data Analytics Stack REEF library STORAGE YARN HDFS/WebHDFS API Compute-tier Cache Clusters (Local ENs + CSM) RAM / SSD / HDD WAS-based Remote Storage Cosmos Store API CLUSTER-WIDE RM (YARN++) YARN + Federation YARN + Rayon (Capacity reservation) YARN + Mercury Shared micro- services for all metadata (extent map, logical name space, secure store) based on Hekaton/RSL rings YARN + Mercury YARN + Mercury Application Engines Per-job RM and runtimeM/R U-SQL Batch Spark Tez Spark Runtime Spark HiveU-SQL Azure ML Azure SA COMPUTE TIER SQL-DW HDInsight IaaS Services
  12. 12. Windows SMSG Live Ads CRM/Dynamics Windows Phone Xbox Live Office365 STB Malware Protection Microsoft Stores STBCommerceRisk Messenger LCA Exchange Yammer Skype Bing data managed: EBs cluster sizes: 10s of Ks # machines: 100s of Ks daily I/O: >100 PBs # internal developers: 1000s # daily jobs: 100s of Ks
  13. 13. Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why did it happen? Descriptive Analytics Diagnostic Analytics Confirmation Theory Hypothesis Observation
  14. 14. Implement Data Warehouse Physical Design ETL Development Reporting & Analytics Development Install and Tune Reporting & Analytics Design Dimension Modelling ETL Design Setup Infrastructure Understand Corporate Strategy Data sources ETL BI and analytic Data warehouse Gather Requirements Business Requirements Technical Requirements
  15. 15. Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  16. 16. What happened? What is happening? Why did it happen? What are key relationships? What will happen? What if? How risky is it? What should happen? What is the best option? How can I optimize? Data sources
  17. 17. Handling failures Sharing data, resources Parallelism Data-aware Optimization Security, Compliance, Governance Enterprise
  18. 18. Forrester Wave Big Data Hadoop Cloud Solutions Q2 2016
  19. 19. • Interactive and Real-Time Analytics requires i • Massive data volumes require scale-out stores using commodity servers, even archival storage Tiered Storage Seamlessly move data across tiers, mirroring life-cycle and usage patterns Schedule compute near low-latency copies of data How can we manage this trade-off without moving data across different storage systems (and governance boundaries)?
  20. 20. • Many different analytic engines (OSS and vendors; SQL, ML; batch, interactive, streaming) • Many users’ jobs (across these job types) run on the same machines (where the data lives) Resource Management with Multitenancy and SLAs Policy-driven management of vast compute pools co-located with data Schedule computation “near” data How can we manage this multi-tenanted heterogeneous job mix across tens of thousands of machines?
  21. 21. Azure Data Lake Store Fully managed cloud data store designed for analytics Supports HDFS compliant analytics applications and tools Petabyte files, unlimited account size High throughput for analytics performance Low latency ingestion with read as you write AAD-based authentication, access auditing File and folder-level ACLs, Encryption at rest
  22. 22. ADLS Security: Encryption-at-Rest Transparently encrypts data flowing to and from public networks as well as at rest Transparent server-side encryption User can manage their own encryption keys or let Azure Data Lake Store manage the key using Azure Key Vault 28
  23. 23. ADLS Security: Role-Based Access Control Each file and directory is associated with an owner and a group Files or directories have separate permissions (read(r), write(w), execute(x)) for owners, members of the group, and for all other users Fine-grained access control lists (ACLs) can be specified for specific named users or named groups 29
  24. 24. ADL Store: Ingress Data can be ingested into Azure Data Lake Store from a variety of sources Server logs Azure Event Hub Apache Flume Azure Storage Blobs Custom programs .NET SDK JavaScript CLI Azure Portal Azure PowerShell Azure Data Factory Apache Sqoop Azure SQL DB Azure SQL DW Azure tables Table Storage On-premises databases SQL 30 ADL Store Built-in copy service
  25. 25. ADL Store: Egress Data can be exported from Azure Data Lake Store into numerous targets/sinks Azure SQL DB SQL Azure SQL DW Azure Tables Table Storage On-premises databases Azure Data Factory Apache Sqoop Azure Storage Blobs Custom programs .NET SDK JavaScript CLI Azure Portal Azure PowerShell 31 Built-in copy service ADL Store
  26. 26. Extent Metadata Data Data Data… Remote Storage Naming Service Secret Store 1) Filename Translation 3) Find Extents 4) Data access Remote storage tier builds securely on WAS Secure Works with YARN! COMPUTE TIER Secure Store Service Intelligent ingest Massively parallel 2) Azure Access Keys
  27. 27. • Interactive and Real-Time Analytics requires i • Massive data volumes require scale-out stores using commodity servers, even archival storage Tiered Storage Scale storage independently of compute Seamlessly move data across tiers, mirroring life-cycle and usage patterns Schedule compute near low-latency copies of data Data Lifecycle Management How can we manage this trade-off without moving data across different storage systems (and governance boundaries)?
  28. 28. Extent Metadata Data Data Data… Remote Storage Naming Service Secret Store 1) Filename Translation 3) Find Extents 4) Data access Remote storage tier builds securely on WAS Secure Works with YARN! COMPUTE TIER Data Data Data … Secure Store Service Local Storage Intelligent ingest Massively parallel 2) Azure Access Keys
  29. 29. Azure HDInsight—Linux and Windows Managed, Monitored, Supported • Cluster customization – Install your favorite project • Harness existing .Net & Java skills to write customer extensions • Supports broad ecosystem of ISVs (Hadoop and Traditional) Full Apache Hadoop • Batch – MapReduce, PIG, Hive, Spark • Stream Processing and Analytics – Storm, SparkStreaming • Interactive SQL – Hive (Tez), and SparkSQL • Table Serving – Hbase • Machine Learning – SparkML, Mahout
  30. 30. Batch MapReduce, PIG, Hive, Spark Interactive SQL Hive (Tez), SparkSQL Stream Analytics Storm, SparkStreaming Machine Learning SparkML, Mahout Table Serving Hbase Exploratory Visualization Jupyter, Zeppelin Interactive SQL SQL DW Stream Analytics Azure Stream Analytics Machine Learning Azure ML Table Serving Azure SQL DB Exploratory Visualization Power BI
  31. 31. • > 14 million field hours http://www.ebird.org
  32. 32. Tree Swallow
  33. 33. Azure Data Lake Analytics Service A new distributed analytics service Built on Apache YARN Scales dynamically with a dial Pay by the query Supports Azure AD for access control, roles, and integration with on-prem identity systems U-SQL language unifies the benefits of SQL with the power of C# Hive etc. will be added over time Processes data across Azure 41
  34. 34. Get started Log in to Azure Create an ADLA account Write and submit an ADLA job with U-SQL (or Hive/Pig) The job reads and writes data from storage 1 2 3 4 30 seconds ADLS Azure Blobs Azure DB …
  35. 35. ADLA Complements HDInsight HDInsight Dedicated managed clusters for developers familiar with the Open Source: Java, Eclipse, Hive, etc. Clusters offer customization, control, and flexibility in a managed Hadoop cluster ADLA Enables customers to leverage existing experience with C#, SQL & PowerShell Offers convenience, efficiency, and automatic scale in a “job service” form factor over a system-managed shared resource pool
  36. 36. U-SQL A hyper-scalable, highly extensible language for preparing, transforming and analyzing all data Allows users to focus on the what— not the how—of business problems Built on familiar languages (SQL and C#) and supported by a fully integrated development environment Built for data developers & scientists 44
  37. 37. U-SQL Language Philosophy Declarative query and transformation language: • Uses SQL’s SELECT FROM WHERE with GROUP BY/aggregation, joins, SQL Analytics functions • Optimizable, scalable Operates on unstructured & structured data • Schema on read over files • Relational metadata objects (e.g. database, table) Extensible from ground up: • Type system is based on C# • Expression language is C# 21 User-defined functions (U-SQL and C#) User-defined types (U-SQL/C#) (future) User-defined aggregators (C#) User-defined operators (UDO) (C#) U-SQL provides the parallelization and scale-out framework for usercode • EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINERS Expression-flow programming style: • Easy to use functional lambda composition • Composable, globally optimizable Federated query across distributed data sources (soon) REFERENCE MyDB.MyAssembly; CREATE TABLE T( cid int, first_order DateTime , last_order DateTime, order_count int , order_amount float ); @o = EXTRACT oid int, cid int, odate DateTime, amount float FROM "/input/orders.txt“ USING Extractors.Csv(); @c = EXTRACT cid int, name string, city string FROM "/input/customers.txt“ USING Extractors.Csv(); @j = SELECT c.cid, MIN(o.odate) AS firstorder , MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt , SUM(c.amount) AS totalamount FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid WHERE c.city.StartsWith("New") && MyNamespace.MyFunction(o.odate) > 10 GROUP BY c.cid; OUTPUT @j TO "/output/result.txt" USING new MyData.Write(); INSERT INTO T SELECT * FROM @j;
  38. 38. Federated Queries: Query Data Where It Lives Easily query data in multiple Azure data stores without moving it to a single store Benefits Avoid moving large amounts of data across the network between stores Single view of data irrespective of physical location Minimize data proliferation issues caused by maintaining multiple copies Single query language for all data Each data store maintains its own sovereignty Design choices based on the need U-SQL Query Result Query 46 Azure Storage Blobs Azure SQL in VMs Azure SQL DB Azure Data Lake Analytics
  39. 39. Join Local (ADLS) and External Data 1. Create two tables. • An external table ‘PurchaseOrders’ that refers to the PurchaseOrders table in the external SQL Azure DB. • A ‘local’ table ‘UserIdsTable’ created by ‘extracting’ User Ids and region fields from the WebLogRecords.txt file stored in Azure Data Lake. 2. Join the PurchaseOrders table with UserIds table on the common UserId column. Purchase orders table Azure SQL DB External purchase orders table Local user IDs table JOIN (on User IDs) Azure Data Lake Analytics Find sum of all purchases by users in the ‘en-us’ region Query 9 47 WebLogRecords.txt
  40. 40. Concepts: Jobs, Stages and Vertexes Each job is broken into a number of vertexes Each vertex is some work that needs to be done Input Output Output 6 Stages 8 Vertexes Vertexes are organized into stages – Vertexes in each stage do the same work on the same data – Vertex in one stage may depend on a vertex in a earlier stage Stages themselves are organized into an acyclic graph 49
  41. 41. • Many different analytic engines (OSS and vendors; SQL, ML; batch, interactive, streaming) • Many users’ jobs (across these job types) run on the same machines (where the data lives) Resource Management with Multitenancy and SLAs Policy-driven management of vast compute pools co-located with data Schedule computation “near” data How can we manage this multi-tenanted heterogeneous job mix across tens of thousands of machines?
  42. 42. Resource Managers for Big Data Allocate compute containers to competing jobs Multiple job engines shared pool Containers YARN: Resource manager for Hadoop2.x Corona, Mesos, Omega
  43. 43. Shared Data and Compute Tiered Storage Relational Query Engine Machine Learning Compute Fabric (Resource Management) Multiple analytic engines sharing same resource pool Compute and store/cache on same machines
  44. 44. What’s Behind a U-SQL Query . . . . . . … … …
  45. 45. YARN Gaps resource allocation SLOs scalability limitations • High allocation latency • Support for specialized execution frameworks • Interactive environments, long-running services
  46. 46. • Amoeba Rayon • Status: shipping in Apache Hadoop 2.6 • Mercury and Yaq • Status: Now in Apache Hadoop trunk! • Federation • Status: prototype and JIRA • Framework-level Pooling • Enable frameworks that want to take over resource allocation to support millisecond- level response and adaptation times • Status: spec Microsoft Contributions to OSS Apache YARN
  47. 47. REEF http://ww.reef-project.org http://reef.incubator.apache.org
  48. 48. http://aka.ms/adltechblog/ http://ww.reef-project.org and http://reef.incubator.apache.org

×