SlideShare una empresa de Scribd logo
1 de 30
Microsoft's Big Play for Big
                       Data
                 Andrew J. Brust
                    CEO and Founder
                  Blue Badge Insights
                      Level: Intermediate
Meet Andrew
 •   CEO and Founder, Blue Badge Insights
 •   Big Data blogger for ZDNet
 •   Microsoft Regional Director, MVP
 •   Co-chair VSLive! and 17 years as a speaker
 •   Founder, Microsoft BI User Group of NYC
     – http://www.msbinyc.com
 •   Co-moderator, NYC .NET Developers Group
     – http://www.nycdotnetdev.com
 •   “Redmond Review” columnist for
     Visual Studio Magazine and Redmond Developer
     News
 •   brustblog.com, Twitter: @andrewbrust
My New Blog (bit.ly/bigondata)
Read All About It!
What is Big Data?
•   100s of TB into PB and higher
•   Involving data from: financial data, sensors,
    Web logs, social media, etc.
•   Distributed/parallel processing often involved
    – Hadoop is emblematic, but other technologies are Big Data
      too
•   Processing of data sets too large for
    transactional databases
    – Analyzing interactions, rather than transactions
    – The three V’s: Volume, Velocity, Variety
•   Big Data tech sometimes imposed on small
    data problems
What’s MapReduce?
•   “Big” input data as key-value pair series
•   Partition the data and send to mappers
    (nodes in cluster)
•   Mappers pre-aggregate by key, then all
    output for (a) given key(s) goes to a
    reducer
•   Reducer completes aggregations; one
    output per key, with value
•   Map and Reduce code natively written as
    Java functions
MapReduce, in a Diagram


        Input   mapper   Output

                                  K1

        Input   mapper   Output   Input   reducer   Output

                                  K2
                mapper   Output                              Output
        Input                     Input   reducer   Output
Input
                                  K3
        Input   mapper   Output
                                  Input   reducer   Output


        Input   mapper   Output


        Input   mapper   Output
What’s a Distributed File System?
•   One where data gets distributed over
    commodity drives on commodity servers
•   Data is replicated
•   If one box goes down, no data lost
    – Except the name node = SPOF!
•   BUT: HDFS is immutable
    – Files can only be written to once
    – So updates require drop + re-write (slow)
Hadoop = MapReduce + HDFS
•   Modeled after Google MapReduce + GFS
•   Have more data? Just add more nodes to
    cluster.
    – Mappers execute in parallel
    – Hardware is commodity
    – “Scaling out”
•   Use of HDFS means data may well be local
    to mapper processing
•   So, not just parallel, but minimal data
    movement, which avoids network
    bottlenecks
What’s NoSQL?
•   Databases that are non-relational (don’t let
    name fool you, some actually use SQL)
•   Four kinds:
    – Key-Value Store
      Schema-free
      FYI: Azure Table Storage is an example
    – Document Store
      All data stored in JSON objects
    – Wide-Column Store
      Define column families, but not columns
    – Graph database
      Manage relationships between objects
What’s HBase?
•   A Wide-Column Store
•   Modeled after Google BigTable
•   Born at Powerset in 2007
    – Powerset acquired by Microsoft in 2008
    – Adopted in 2010 by Facebook for messaging platform
•   Uses HDFS
    – Therefore, Hadoop-compatible
•   Hadoop often used with HBase
    – But you can use either without the other
The Hadoop Stack
•   Hadoop
    – MapReduce, HDFS
•   HBase
    – Lesser extent: Cassandra, HyperTable
•   Hive, Pig
    – SQL-like “data warehouse” system
    – Data transformation language
•   Sqoop
    – Import/export between HDFS, HBase,
      Hive and relational data warehouses
•   Flume
    – Log file integration
•   Mahout
    – Data Mining
What’s Hive?
•   Began as Hadoop sub-project
    – Now top-level Apache project
•   Provides a SQL-like (“HiveQL”)
    abstraction over MapReduce
•   Has its own HDFS table file format (and it’s
    fully schema-bound)
•   Can also work over HBase
•   Acts as a bridge to many BI products
    which expect tabular data
Hadoop Distributions
•   Cloudera
•   Hortonworks
    – HCatalog: Hive/Pig/MR Interop
•   MapR
    – Network File System replaces HDFS
•   IBM InfoSphere BigInsights
    – HDFS<->DB2 integration
•   And now Microsoft…
Project “Isotope”
•   Work with Hortonworks to create “distro”
    of Hadoop that runs on Windows Server
    and Windows Azure
    – Hortonworks are ex-Yahoo FTEs who are Hadoop
      pioneers
•   Create ODBC Driver for Hive
    – And Excel Add-In that uses it
•   Build JavaScript command line and
    MapReduce framework
•   Contribute it all back to open source
    Apache project
Hadoop on Azure
•   Install onto your own Azure VMs and build
    a cluster, or…
•   Provision a cluster in one step
    – Give it a name
    – Choose number of nodes and storage size in cluster
    – Wait for it to provision
    – Go!
Provisioning a Cluster
Submitting, Running and
Monitoring Jobs
•   Upload a JAR
•   Use .NET
•   Use the JavaScript Console
•   Use the Hive Console
Running MapReduce
Jobs
Hadoop on Azure Data Sources
•   Files in HDFS
•   Azure Blob Storage
•   Amazon S3 Storage
•   Hive Tables
•   HBase?
Review: ODBC Connection Types
•   Registry-based
    – User Data Source Name (DSN)
    – System DSN
•   File-based
    – File DSN
•   String-based
    – DSN-less connection
•   We need file-based
•   Wizard obfuscates how to do this
•   Don’t forget to open the ODBC port!
Hive ODBC Setup,
Excel Add-In
ODBC Driver’s Untold Story
•   Works with any Hive install/Hadoop
    cluster, not just Windows-based ones.
How Does SQL Server Fit In?
•   RDBMS + PDW: Sqoop connectors
•   RDBMS: Columnstore Indexes
    – Enterprise Edition only
•   Analysis Services: Tabular Mode
    – Compatible with ODBC Driver
      Multidimensional mode is not
•   RDBMS + SSAS Tabular: DirectQuery
•   PowerPivot (as with SSAS Tabular)
•   Power View
    – Works against PowerPivot and SSAS Tabular
Querying Hadoop from
SQL Server BI
The “Data-Refinery” Idea
•   Use Hadoop to “on-board” unstructured
    data, then extract manageable subsets
•   Load the subsets into conventional DW/BI
    servers and use familiar analytics tools to
    examine
•   This is the current rationalization of
    Hadoop + BI tools’ coexistence
•   Will it stay this way?
Usability Impact
•   PowerPivot makes analysis much easier,
    self-service
•   Power View is great for discovery and
    visualization; also self-service
•   Combine with the Hive ODBC driver and
    suddenly Hadoop is accessible to
    business users
•   Caveats
    – Someone has to write the HiveQL
    – Can query Big Data, but must have smaller result
Other Relevant MS Technologies
•   SQL Server Components:
    – SQL Server Parallel Data Warehouse
    – StreamInsight
•   Azure Components:
    – Data Explorer
    – DataMarket
•   Deprecated MSR Project
    – Dryad
Resources
•   Big On Data blog
    – http://www.zdnet.com/blog/big-data
•   Apache Hadoop home page
    – http://hadoop.apache.org/
•   Hive & Pig home pages
    – http://hive.apache.org/
    – http://pig.apache.org/
•   Hadoop on Azure home page
    – https://www.hadooponazure.com/
•   SQL Server 2012 Big Data
    – http://bit.ly/sql2012bigdata
Thank you



•   andrew.brust@bluebadgeinsights.com
•   @andrewbrust on twitter
•   Want to get the free “Redmond Roundup
    Plus?”
    – Text “bluebadge” to 22828

Más contenido relacionado

La actualidad más candente

A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooA Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooAndrew Brust
 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational DatabasesUdi Bauman
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sqlRam kumar
 
SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms Andrew Brust
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big dataSteven Francia
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7abdulrahmanhelan
 
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionHadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionAndrew Brust
 
Non Relational Databases
Non Relational DatabasesNon Relational Databases
Non Relational DatabasesChris Baglieri
 
NoSql Data Management
NoSql Data ManagementNoSql Data Management
NoSql Data Managementsameerfaizan
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and UsesSuvradeep Rudra
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL DatabasesBADR
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 

La actualidad más candente (20)

Relational vs. Non-Relational
Relational vs. Non-RelationalRelational vs. Non-Relational
Relational vs. Non-Relational
 
A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooA Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data Hullabaloo
 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational Databases
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big data
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
 
Selecting best NoSQL
Selecting best NoSQL Selecting best NoSQL
Selecting best NoSQL
 
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionHadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in Action
 
NoSQL Seminer
NoSQL SeminerNoSQL Seminer
NoSQL Seminer
 
Non Relational Databases
Non Relational DatabasesNon Relational Databases
Non Relational Databases
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
NoSql Data Management
NoSql Data ManagementNoSql Data Management
NoSql Data Management
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 
Rdbms vs. no sql
Rdbms vs. no sqlRdbms vs. no sql
Rdbms vs. no sql
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
NoSQL
NoSQLNoSQL
NoSQL
 

Similar a Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012

Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Andrew Brust
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Mark Rittman
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackBig Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackAndrew Brust
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache HadoopKMS Technology
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 

Similar a Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012 (20)

Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Apache drill
Apache drillApache drill
Apache drill
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackBig Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
hadoop overview.pptx
hadoop overview.pptxhadoop overview.pptx
hadoop overview.pptx
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 

Más de Andrew Brust

Azure ml screen grabs
Azure ml screen grabsAzure ml screen grabs
Azure ml screen grabsAndrew Brust
 
Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIAndrew Brust
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystemAndrew Brust
 
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012Andrew Brust
 
Power View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s DataPower View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s DataAndrew Brust
 
Evolved BI with SQL Server 2012
Evolved BIwith SQL Server 2012Evolved BIwith SQL Server 2012
Evolved BI with SQL Server 2012Andrew Brust
 
Grasping The LightSwitch Paradigm
Grasping The LightSwitch ParadigmGrasping The LightSwitch Paradigm
Grasping The LightSwitch ParadigmAndrew Brust
 
Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis Andrew Brust
 

Más de Andrew Brust (8)

Azure ml screen grabs
Azure ml screen grabsAzure ml screen grabs
Azure ml screen grabs
 
Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BI
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
 
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
 
Power View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s DataPower View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s Data
 
Evolved BI with SQL Server 2012
Evolved BIwith SQL Server 2012Evolved BIwith SQL Server 2012
Evolved BI with SQL Server 2012
 
Grasping The LightSwitch Paradigm
Grasping The LightSwitch ParadigmGrasping The LightSwitch Paradigm
Grasping The LightSwitch Paradigm
 
Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis
 

Último

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 

Último (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012

  • 1. Microsoft's Big Play for Big Data Andrew J. Brust CEO and Founder Blue Badge Insights Level: Intermediate
  • 2. Meet Andrew • CEO and Founder, Blue Badge Insights • Big Data blogger for ZDNet • Microsoft Regional Director, MVP • Co-chair VSLive! and 17 years as a speaker • Founder, Microsoft BI User Group of NYC – http://www.msbinyc.com • Co-moderator, NYC .NET Developers Group – http://www.nycdotnetdev.com • “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News • brustblog.com, Twitter: @andrewbrust
  • 3. My New Blog (bit.ly/bigondata)
  • 5. What is Big Data? • 100s of TB into PB and higher • Involving data from: financial data, sensors, Web logs, social media, etc. • Distributed/parallel processing often involved – Hadoop is emblematic, but other technologies are Big Data too • Processing of data sets too large for transactional databases – Analyzing interactions, rather than transactions – The three V’s: Volume, Velocity, Variety • Big Data tech sometimes imposed on small data problems
  • 6. What’s MapReduce? • “Big” input data as key-value pair series • Partition the data and send to mappers (nodes in cluster) • Mappers pre-aggregate by key, then all output for (a) given key(s) goes to a reducer • Reducer completes aggregations; one output per key, with value • Map and Reduce code natively written as Java functions
  • 7. MapReduce, in a Diagram Input mapper Output K1 Input mapper Output Input reducer Output K2 mapper Output Output Input Input reducer Output Input K3 Input mapper Output Input reducer Output Input mapper Output Input mapper Output
  • 8. What’s a Distributed File System? • One where data gets distributed over commodity drives on commodity servers • Data is replicated • If one box goes down, no data lost – Except the name node = SPOF! • BUT: HDFS is immutable – Files can only be written to once – So updates require drop + re-write (slow)
  • 9. Hadoop = MapReduce + HDFS • Modeled after Google MapReduce + GFS • Have more data? Just add more nodes to cluster. – Mappers execute in parallel – Hardware is commodity – “Scaling out” • Use of HDFS means data may well be local to mapper processing • So, not just parallel, but minimal data movement, which avoids network bottlenecks
  • 10. What’s NoSQL? • Databases that are non-relational (don’t let name fool you, some actually use SQL) • Four kinds: – Key-Value Store Schema-free FYI: Azure Table Storage is an example – Document Store All data stored in JSON objects – Wide-Column Store Define column families, but not columns – Graph database Manage relationships between objects
  • 11. What’s HBase? • A Wide-Column Store • Modeled after Google BigTable • Born at Powerset in 2007 – Powerset acquired by Microsoft in 2008 – Adopted in 2010 by Facebook for messaging platform • Uses HDFS – Therefore, Hadoop-compatible • Hadoop often used with HBase – But you can use either without the other
  • 12. The Hadoop Stack • Hadoop – MapReduce, HDFS • HBase – Lesser extent: Cassandra, HyperTable • Hive, Pig – SQL-like “data warehouse” system – Data transformation language • Sqoop – Import/export between HDFS, HBase, Hive and relational data warehouses • Flume – Log file integration • Mahout – Data Mining
  • 13. What’s Hive? • Began as Hadoop sub-project – Now top-level Apache project • Provides a SQL-like (“HiveQL”) abstraction over MapReduce • Has its own HDFS table file format (and it’s fully schema-bound) • Can also work over HBase • Acts as a bridge to many BI products which expect tabular data
  • 14. Hadoop Distributions • Cloudera • Hortonworks – HCatalog: Hive/Pig/MR Interop • MapR – Network File System replaces HDFS • IBM InfoSphere BigInsights – HDFS<->DB2 integration • And now Microsoft…
  • 15. Project “Isotope” • Work with Hortonworks to create “distro” of Hadoop that runs on Windows Server and Windows Azure – Hortonworks are ex-Yahoo FTEs who are Hadoop pioneers • Create ODBC Driver for Hive – And Excel Add-In that uses it • Build JavaScript command line and MapReduce framework • Contribute it all back to open source Apache project
  • 16. Hadoop on Azure • Install onto your own Azure VMs and build a cluster, or… • Provision a cluster in one step – Give it a name – Choose number of nodes and storage size in cluster – Wait for it to provision – Go!
  • 18. Submitting, Running and Monitoring Jobs • Upload a JAR • Use .NET • Use the JavaScript Console • Use the Hive Console
  • 20. Hadoop on Azure Data Sources • Files in HDFS • Azure Blob Storage • Amazon S3 Storage • Hive Tables • HBase?
  • 21. Review: ODBC Connection Types • Registry-based – User Data Source Name (DSN) – System DSN • File-based – File DSN • String-based – DSN-less connection • We need file-based • Wizard obfuscates how to do this • Don’t forget to open the ODBC port!
  • 23. ODBC Driver’s Untold Story • Works with any Hive install/Hadoop cluster, not just Windows-based ones.
  • 24. How Does SQL Server Fit In? • RDBMS + PDW: Sqoop connectors • RDBMS: Columnstore Indexes – Enterprise Edition only • Analysis Services: Tabular Mode – Compatible with ODBC Driver Multidimensional mode is not • RDBMS + SSAS Tabular: DirectQuery • PowerPivot (as with SSAS Tabular) • Power View – Works against PowerPivot and SSAS Tabular
  • 26. The “Data-Refinery” Idea • Use Hadoop to “on-board” unstructured data, then extract manageable subsets • Load the subsets into conventional DW/BI servers and use familiar analytics tools to examine • This is the current rationalization of Hadoop + BI tools’ coexistence • Will it stay this way?
  • 27. Usability Impact • PowerPivot makes analysis much easier, self-service • Power View is great for discovery and visualization; also self-service • Combine with the Hive ODBC driver and suddenly Hadoop is accessible to business users • Caveats – Someone has to write the HiveQL – Can query Big Data, but must have smaller result
  • 28. Other Relevant MS Technologies • SQL Server Components: – SQL Server Parallel Data Warehouse – StreamInsight • Azure Components: – Data Explorer – DataMarket • Deprecated MSR Project – Dryad
  • 29. Resources • Big On Data blog – http://www.zdnet.com/blog/big-data • Apache Hadoop home page – http://hadoop.apache.org/ • Hive & Pig home pages – http://hive.apache.org/ – http://pig.apache.org/ • Hadoop on Azure home page – https://www.hadooponazure.com/ • SQL Server 2012 Big Data – http://bit.ly/sql2012bigdata
  • 30. Thank you • andrew.brust@bluebadgeinsights.com • @andrewbrust on twitter • Want to get the free “Redmond Roundup Plus?” – Text “bluebadge” to 22828

Notas del editor

  1. Visual Studio Live! New York 2012 © 2012 Visual Studio Live! All rights reserved.
  2. Visual Studio Live! Las Vegas 2011 © 2012 Visual Studio Live! All rights reserved.