SlideShare a Scribd company logo
1 of 24
Big Data in the Real World
Orlando PASS
October 2013

http://www.pssug.org
Mark Kromer
http://www.kromerbigdata.com
@kromerbigdata
@mssqldude
What we’ll (try) to cover today

‣ What is Big Data?
‣ The Big Data and Apache Hadoop environment
‣ Big Data Analytics
‣ SQL Server in the Big Data world
‣ Microsoft + Hortonworks (Yahoo!) = HDInsights

2
Big Data 101
‣ 3 V’s
‣ Volume – Terabyte records, transactions, tables, files
‣ Velocity – Batch, near-time, real-time (analytics), streams.
‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix

‣ Text Processing
‣ Techniques for processing and analyzing unstructured (and structured) LARGE files

‣ Analytics & Insights
‣ Distributed File System & Programming
Mark’s Big Data Myths
‣ Big Data ≠ NoSQL
‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!,
Google, Facebook, et al) but not the same thing

‣ Facebook, for example, uses Hbase from the Hadoop stack
‣ NoSQL does not have to be Big Data

‣ Big Data ≠ Real Time
‣ Big Data is primarily about batch processing huge files in a distributed manner
and analyzing data that was otherwise too complex to provide value

‣ Use in-memory analytics for real time insights

‣ Big Data ≠ Data Warehouse
‣ I still refer to large multi-TB DWs as “VLDB”
‣ Big Data is about crunching stats in text files for discovery of new patterns and
insights

‣ Use the DW to aggregate and store the summaries of those calculations for
reporting
‣
‣
‣
‣
‣
‣

Batch Processing
Commodity Hardware

Data Locality, no shared storage
Scales linearly
Great for large text file processing, not so great on small files
Distributed programming paradigm
Popular Hadoop Distributions
Hosted PaaS Hadoop platforms: Amazon
EMR, Pivotal, Microsoft Hadoop on Azure
Popular NoSQL Distributions
Transactional-based, not analytics schemas
Popular MPP Distributions
Big Data as distributed, scale-out, sharded data stores
Big Data Analytics Web Platform - Example
MapReduce Framework (Map)
using Microsoft.Hadoop.MapReduce;
using System.Text.RegularExpressions;
public class TotalHitsForPageMap : MapperBase
{
public override void Map(string inputLine, MapperContext context)
{
context.Log(inputLine);
var parts = Regex.Split(inputLine, "s+");
if (parts.Length != expected) //only take records with all values
{
return;
}
context.EmitKeyValue(parts[pagePos], hit);
}
}
MapReduce Framework (Reduce & Job)
public class TotalHitsForPageReducerCombiner : ReducerCombinerBase
{
public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext
context)
{
context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString());
}
}
public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner>
{
public override HadoopJobConfiguration Configure(ExecutorContext context)
{
var retVal = new HadoopJobConfiguration();
retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT");
retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT");
retVal.DeleteOutputFolder = true;
return retVal;
}
}
Get Data into Hadoop
‣
‣
‣
‣

Linux shell commands to access data in HDFS
Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv
List files in HDFS:
c:Hadoop>hadoop fs -ls /import
Found 1 items
-rw-r--r-- 1 makromer supergroup

114 2013-05-07 12:11 /import/sales.csv

‣ View file in HDFS:
c:Hadoop>hadoop fs -cat /import/sales.csv
Kromer,123,5,55
Smith,567,1,25
Jones,123,9,99

James,11,12,1
Johnson,456,2,2.5
Singh,456,1,3.25
Yu,123,1,11

‣ Now, we can work on the data with MapReduce, Hive, Pig, etc.
Use Hive for Data Schema and Analysis
create external table ext_sales
(
lastname string,
productid int,
quantity int,
sales_amount float

)
row format delimited fields terminated by ',' stored as
textfile location '/user/makromer/hiveext/input';
LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE
INTO TABLE ext_sales;
Sqoop
Data transfer to & from Hadoop & SQL Server
‣ sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password
password –table customers -m 1

‣ > hadoop fs -cat /user/mark/customers/part-m-00000
‣ > 5,Bob Smith
‣ sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password
password -m 1 –table customers –export-dir /user/mark/data/employees3

‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in
32.6364 seconds (6.1588 bytes/sec)

‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
SQL Server Big Data – Data Loading

Amazon HDFS & EMR

Data Loading

Amazon S3 Bucket
Role of NoSQL in a Big Data Analytics Solution
‣ Use NoSQL to store data quickly without the overhead of RDBMS
‣ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few

‣ Why NoSQL?
‣ In the world of “Big Data”
‣ “Schema later”
‣ Ignore ACID properties
‣ Drop data into key-value store quick & dirty
‣ Worry about query & read later

‣ Why NOT NoSQL?
‣ In the world of Big Data Analytics, you will need support from analytical tools with a
SQL, SAS, MR interface

‣ SQL Server and NoSQL
‣ Not a natural fit
‣ Use HDFS or your favorite NoSQL database
‣ Consider turning off SQL Server locking mechanisms
‣ Focus on writes, not reads (read uncommitted)
SQL Server Big Data Environment
‣ SQL Server Database
‣
‣
‣
‣
‣
‣
‣
‣

SQL 2012 Enterprise Edition
Page Compression
2012 Columnar Compression on Fact Tables
Clustered Index on all tables
Auto-update Stats Asynch

Partition Fact Tables by month and archive data with sliding window technique
Drop all indexes before nightly ETL load jobs
Rebuild all indexes when ETL completes

‣ SQL Server Analysis Services
‣
‣
‣
‣

SSAS 2012 Enterprise Edition
2008 R2 OLAP cubes partition-aligned with DW
2012 cubes in-memory tabular cubes
All access through MSMDPUMP or SharePoint
SQL Server Big Data Analytics Features

‣ Columnstore
‣ Sqoop adapter
‣ PolyBase
‣ Hive
‣ In-memory analytics
‣ Scale-out MPP
Microsoft’s Data Solution – Big Data & PDW
Excel with PowerPivot

Predictive Analytics

Embedded BI

Data Market Place

Familiar End User Tools

S
S
R
S

SSAS

BI Platform

Hundreds of TB of Data
(structured)

Petabytes of Data
(Unstructured)
Hadoop On
Windows
Azure

Sensors

Hadoop On
Windows
Server

Devices

Data Market

Bots

Connectors
Parallel Data Warehouse
Crawlers

ERP

CRM

LOB

APPs

Unstructured and Structured Data
19

19
MICROSOFT BIG DATA
immersive data
experiences

PowerPivot

Self-Service

Power View

Collaboration

Corporate Apps

Devices

connecting with
worlds data
Combine

Discover

Refine

Microsoft HDInsight Server

any data, any
size, anywhere

StreamInsight
Parallel Data Warehouse

Relational

HDInsight Service

Non-relational

Analytical

Streaming
Microsoft .NET Hadoop APIs
‣ WebHDFS
‣ Linq to Hive
‣ MapReduce
‣ C#
‣ Java
‣ Hive
‣ Pig
‣ http://hadoopsdk.codeplex.com/
‣ SQL on Hadoop
‣ Cloudera Impala
‣ Teradata SQL-H
‣ Microsoft Polybase
‣ Hadapt
Data Movement to the Cloud

‣ Use Windows Azure Blob Storage
• Already stored in 3 copies
• Hadoop can read from Azure blob storage
• Allows you to upload while using no Hadoop network or CPU resources

‣ Compress files
•
•
•
•

Hadoop can read Gzip
Uses less network resources than uncompressed
Costs less for direct storage costs
Compress directories where source files are created as well.

23
Wrap-up
‣ What is a Big Data approach to Analytics?
‣ Massive scale
‣ Data discovery & research
‣ Self-service
‣ Reporting & BI
‣ Why do we take this Big Data Analytics approach?
‣ TBs of change data in each subject area
‣ The data in the sources are variable and unstructured
‣ SSIS ETL alone couldn’t keep up or handle complexity
‣ SQL Server 2012 columnstore and tabular SSAS 2012 are key to using SQL
Server for Big Data

‣ With the configs mentioned previously, SQL Server works great
‣ Analytics on Big Data also requires Big Data Analytics tools
‣ Aster, Tableau, PowerPivot, SAS, Parallel Data Warehouse

More Related Content

What's hot

Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at Walgreens
DataWorks Summit
 

What's hot (20)

Big data in Azure
Big data in AzureBig data in Azure
Big data in Azure
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData Story
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
 
Big Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace ImagesBig Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace Images
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
 
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for Architects
 
Big Data with Azure
Big Data with AzureBig Data with Azure
Big Data with Azure
 
Azure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analyticsAzure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analytics
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at Walgreens
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 

Viewers also liked

Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Mark Kromer
 

Viewers also liked (19)

Big Data - A Real Life Revolution
Big Data - A Real Life RevolutionBig Data - A Real Life Revolution
Big Data - A Real Life Revolution
 
Big data in the real world opportunities and challenges facing healthcare -...
Big data in the real world   opportunities and challenges facing healthcare -...Big data in the real world   opportunities and challenges facing healthcare -...
Big data in the real world opportunities and challenges facing healthcare -...
 
ETL in the Cloud With Microsoft Azure
ETL in the Cloud With Microsoft AzureETL in the Cloud With Microsoft Azure
ETL in the Cloud With Microsoft Azure
 
Definitions for Real World of Big Data Marketing
Definitions for Real World of Big Data MarketingDefinitions for Real World of Big Data Marketing
Definitions for Real World of Big Data Marketing
 
Analytics: The Real-world Use of Big Data
Analytics: The Real-world Use of Big DataAnalytics: The Real-world Use of Big Data
Analytics: The Real-world Use of Big Data
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Microsoft Event Registration System Hosted on Windows Azure
Microsoft Event Registration System Hosted on Windows AzureMicrosoft Event Registration System Hosted on Windows Azure
Microsoft Event Registration System Hosted on Windows Azure
 
PSSUG Nov 2012: Big Data with SQL Server
PSSUG Nov 2012: Big Data with SQL ServerPSSUG Nov 2012: Big Data with SQL Server
PSSUG Nov 2012: Big Data with SQL Server
 
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerPhilly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
 
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
 
What's new in SQL Server 2012 for philly code camp 2012.1
What's new in SQL Server 2012 for philly code camp 2012.1What's new in SQL Server 2012 for philly code camp 2012.1
What's new in SQL Server 2012 for philly code camp 2012.1
 
MEC Data sheet
MEC Data sheetMEC Data sheet
MEC Data sheet
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
 
Sql server 2012 roadshow masd overview 003
Sql server 2012 roadshow masd overview 003Sql server 2012 roadshow masd overview 003
Sql server 2012 roadshow masd overview 003
 
Microsoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAsMicrosoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAs
 
Azure vs. amazon
Azure vs. amazonAzure vs. amazon
Azure vs. amazon
 
AWS vs Azure - Cloud Services Comparison
AWS vs Azure - Cloud Services ComparisonAWS vs Azure - Cloud Services Comparison
AWS vs Azure - Cloud Services Comparison
 
The real world use of Big Data to change business
The real world use of Big Data to change businessThe real world use of Big Data to change business
The real world use of Big Data to change business
 
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature MappingMicrosoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
 

Similar to Big Data in the Real World

Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Jesus Rodriguez
 
Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013
Jean-Pierre König
 

Similar to Big Data in the Real World (20)

Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 

More from Mark Kromer

More from Mark Kromer (20)

Fabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptxFabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptx
 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelines
 
Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22
 
Data cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsData cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flows
 
Data cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsData cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flows
 
Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
 
Data Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADFData Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADF
 
Azure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryAzure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power Query
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
Data Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADFData Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADF
 
Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)
 
Data quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADFData quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADF
 
Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
 
ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300
 
ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2
 
ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1
 
ADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview MigrationADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview Migration
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 

Recently uploaded

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Big Data in the Real World

  • 1. Big Data in the Real World Orlando PASS October 2013 http://www.pssug.org Mark Kromer http://www.kromerbigdata.com @kromerbigdata @mssqldude
  • 2. What we’ll (try) to cover today ‣ What is Big Data? ‣ The Big Data and Apache Hadoop environment ‣ Big Data Analytics ‣ SQL Server in the Big Data world ‣ Microsoft + Hortonworks (Yahoo!) = HDInsights 2
  • 3. Big Data 101 ‣ 3 V’s ‣ Volume – Terabyte records, transactions, tables, files ‣ Velocity – Batch, near-time, real-time (analytics), streams. ‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix ‣ Text Processing ‣ Techniques for processing and analyzing unstructured (and structured) LARGE files ‣ Analytics & Insights ‣ Distributed File System & Programming
  • 4. Mark’s Big Data Myths ‣ Big Data ≠ NoSQL ‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not the same thing ‣ Facebook, for example, uses Hbase from the Hadoop stack ‣ NoSQL does not have to be Big Data ‣ Big Data ≠ Real Time ‣ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was otherwise too complex to provide value ‣ Use in-memory analytics for real time insights ‣ Big Data ≠ Data Warehouse ‣ I still refer to large multi-TB DWs as “VLDB” ‣ Big Data is about crunching stats in text files for discovery of new patterns and insights ‣ Use the DW to aggregate and store the summaries of those calculations for reporting
  • 5. ‣ ‣ ‣ ‣ ‣ ‣ Batch Processing Commodity Hardware Data Locality, no shared storage Scales linearly Great for large text file processing, not so great on small files Distributed programming paradigm
  • 6. Popular Hadoop Distributions Hosted PaaS Hadoop platforms: Amazon EMR, Pivotal, Microsoft Hadoop on Azure
  • 8. Popular MPP Distributions Big Data as distributed, scale-out, sharded data stores
  • 9. Big Data Analytics Web Platform - Example
  • 10. MapReduce Framework (Map) using Microsoft.Hadoop.MapReduce; using System.Text.RegularExpressions; public class TotalHitsForPageMap : MapperBase { public override void Map(string inputLine, MapperContext context) { context.Log(inputLine); var parts = Regex.Split(inputLine, "s+"); if (parts.Length != expected) //only take records with all values { return; } context.EmitKeyValue(parts[pagePos], hit); } }
  • 11. MapReduce Framework (Reduce & Job) public class TotalHitsForPageReducerCombiner : ReducerCombinerBase { public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context) { context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString()); } } public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner> { public override HadoopJobConfiguration Configure(ExecutorContext context) { var retVal = new HadoopJobConfiguration(); retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT"); retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT"); retVal.DeleteOutputFolder = true; return retVal; } }
  • 12. Get Data into Hadoop ‣ ‣ ‣ ‣ Linux shell commands to access data in HDFS Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv List files in HDFS: c:Hadoop>hadoop fs -ls /import Found 1 items -rw-r--r-- 1 makromer supergroup 114 2013-05-07 12:11 /import/sales.csv ‣ View file in HDFS: c:Hadoop>hadoop fs -cat /import/sales.csv Kromer,123,5,55 Smith,567,1,25 Jones,123,9,99 James,11,12,1 Johnson,456,2,2.5 Singh,456,1,3.25 Yu,123,1,11 ‣ Now, we can work on the data with MapReduce, Hive, Pig, etc.
  • 13. Use Hive for Data Schema and Analysis create external table ext_sales ( lastname string, productid int, quantity int, sales_amount float ) row format delimited fields terminated by ',' stored as textfile location '/user/makromer/hiveext/input'; LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales;
  • 14. Sqoop Data transfer to & from Hadoop & SQL Server ‣ sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1 ‣ > hadoop fs -cat /user/mark/customers/part-m-00000 ‣ > 5,Bob Smith ‣ sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3 ‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec) ‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
  • 15. SQL Server Big Data – Data Loading Amazon HDFS & EMR Data Loading Amazon S3 Bucket
  • 16. Role of NoSQL in a Big Data Analytics Solution ‣ Use NoSQL to store data quickly without the overhead of RDBMS ‣ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few ‣ Why NoSQL? ‣ In the world of “Big Data” ‣ “Schema later” ‣ Ignore ACID properties ‣ Drop data into key-value store quick & dirty ‣ Worry about query & read later ‣ Why NOT NoSQL? ‣ In the world of Big Data Analytics, you will need support from analytical tools with a SQL, SAS, MR interface ‣ SQL Server and NoSQL ‣ Not a natural fit ‣ Use HDFS or your favorite NoSQL database ‣ Consider turning off SQL Server locking mechanisms ‣ Focus on writes, not reads (read uncommitted)
  • 17. SQL Server Big Data Environment ‣ SQL Server Database ‣ ‣ ‣ ‣ ‣ ‣ ‣ ‣ SQL 2012 Enterprise Edition Page Compression 2012 Columnar Compression on Fact Tables Clustered Index on all tables Auto-update Stats Asynch Partition Fact Tables by month and archive data with sliding window technique Drop all indexes before nightly ETL load jobs Rebuild all indexes when ETL completes ‣ SQL Server Analysis Services ‣ ‣ ‣ ‣ SSAS 2012 Enterprise Edition 2008 R2 OLAP cubes partition-aligned with DW 2012 cubes in-memory tabular cubes All access through MSMDPUMP or SharePoint
  • 18. SQL Server Big Data Analytics Features ‣ Columnstore ‣ Sqoop adapter ‣ PolyBase ‣ Hive ‣ In-memory analytics ‣ Scale-out MPP
  • 19. Microsoft’s Data Solution – Big Data & PDW Excel with PowerPivot Predictive Analytics Embedded BI Data Market Place Familiar End User Tools S S R S SSAS BI Platform Hundreds of TB of Data (structured) Petabytes of Data (Unstructured) Hadoop On Windows Azure Sensors Hadoop On Windows Server Devices Data Market Bots Connectors Parallel Data Warehouse Crawlers ERP CRM LOB APPs Unstructured and Structured Data 19 19
  • 20. MICROSOFT BIG DATA immersive data experiences PowerPivot Self-Service Power View Collaboration Corporate Apps Devices connecting with worlds data Combine Discover Refine Microsoft HDInsight Server any data, any size, anywhere StreamInsight Parallel Data Warehouse Relational HDInsight Service Non-relational Analytical Streaming
  • 21.
  • 22. Microsoft .NET Hadoop APIs ‣ WebHDFS ‣ Linq to Hive ‣ MapReduce ‣ C# ‣ Java ‣ Hive ‣ Pig ‣ http://hadoopsdk.codeplex.com/ ‣ SQL on Hadoop ‣ Cloudera Impala ‣ Teradata SQL-H ‣ Microsoft Polybase ‣ Hadapt
  • 23. Data Movement to the Cloud ‣ Use Windows Azure Blob Storage • Already stored in 3 copies • Hadoop can read from Azure blob storage • Allows you to upload while using no Hadoop network or CPU resources ‣ Compress files • • • • Hadoop can read Gzip Uses less network resources than uncompressed Costs less for direct storage costs Compress directories where source files are created as well. 23
  • 24. Wrap-up ‣ What is a Big Data approach to Analytics? ‣ Massive scale ‣ Data discovery & research ‣ Self-service ‣ Reporting & BI ‣ Why do we take this Big Data Analytics approach? ‣ TBs of change data in each subject area ‣ The data in the sources are variable and unstructured ‣ SSIS ETL alone couldn’t keep up or handle complexity ‣ SQL Server 2012 columnstore and tabular SSAS 2012 are key to using SQL Server for Big Data ‣ With the configs mentioned previously, SQL Server works great ‣ Analytics on Big Data also requires Big Data Analytics tools ‣ Aster, Tableau, PowerPivot, SAS, Parallel Data Warehouse