Intorducing Big Data and Microsoft Azure

| © Copyright 2015 Hitachi Consulting1
Introducing Big Data
with Microsoft Azure
Khalid M. Salama
Microsoft Business Intelligence
Hitachi Consulting UK
We Make it Happen. Better.

Outline
 What is Big Data?
 Why Big Data Platforms?
 Fundamentals of a Big Data Platform
 Distributed Processing & CAP Theorem
 Big Data Solutions vs. Traditional RDBMS
 Where Big Data Fits in Enterprise Data Platforms?
 Hadoop Ecosystem: Apache Tools for Big Data
 Big Data on Microsoft Azure
 How to Get Started with Big Data?

Basic Concepts

What is Big Data?
“Data that is too complex for processing using traditional relational
databases efficiently and cost-effectively.”
In a nutshell…

What is Big Data?
Big Data attributes…
Complex (3 V’s)
 Volume – Huge amounts of data to process
 Variety – A mixture of structured and unstructured data
 Velocity – High frequency or (near) real-time data processing

What is Big Data?
Tell me more…
Complex (3 V’s)
 Volume – Huge amounts of data to process
 Variety – A mixture of structured and unstructured data
 Velocity – High frequency or (near) real-time data processing
Processing
 Stream (operational)
 Batch (Analytical)
Efficiently
 Availability/Scalability
 Performance/Throughputs
Cost-Effectively
 Acquiring
 Scaling up/down

What is Big Data?
Common examples and applications
• User Experience Improvement
• Recommendation & Target Advertising
Clickstream
• Predictive Maintenance
• Energy Efficiency – Smart City
Sensor/Devices
• Sentiment Analysis
• Crisis Management
Social Media
• Push Notifications
• Process Optimisation
Spatial & GPS
• Proactive securityImages/Audio/Video
• Analysis of customer reviews/feedbacks/complaints
• Automatic news summarization/analysis
Free Text

Why Big Data Platforms?
Traditional Data Platforms

Breaking points of traditional Data Platforms – Volume

Breaking points of traditional Data Platforms – Variety

Breaking points of traditional Data Platforms – Velocity

Enterprise-wide data scale
Terabytes
Gigabytes

Terabytes
Gigabytes

Terabytes
Gigabytes
Non-
Transactional
Transactional

Addressing Big Data Challenges

Addressing the three “V”s…
Volume
Variety
Velocity
Challenges

Volume
Variety
Velocity
Distributed
Computing
Challenges Solutions

Volume
Variety
Velocity
Distributed
Computing
Batch Processing

Volume
Variety
Velocity
Distributed
Computing
Stream
Processing
Batch Processing

Volume
Variety
Velocity
Distributed
Computing
NoSQL
Stream
Processing
Batch Processing

Volume
Variety
Velocity
Distributed
Computing
NoSQL
Stream
Processing
In-Memory
Processing
Batch Processing

Volume
Variety
Velocity
Distributed
Computing
NoSQL
Stream
Processing
In-Memory
Processing
Batch Processing
Consistency/Availability
/FaultToleranceTrade-off
(CAP)

Tell me more….
Distributed Computing
Batch Processing
In-Memory Processing
Stream Processing
NoSQL
 Distributed
 Available/ Fault Tolerant
 Random read/write access
 Supports Batch & Stream
 Cluster of many data/compute nodes (commodity hardware)
 Data Partitioning (sharding)
 Data partitions are processed in parallel
 Easy/cheap to scale-out
 Process massive amount of data
 Write once / read many
 High latency
 Iterative processing of the same data in memory
 Data size that fits into the memory
 Low latency
 Process continuous stream of data
 Small data chunks
 Low latency
 Key-value stores
 Column family stores
 Document stores
 Graph stores
 Distributed
 Eventually Consistent
 High throughputs
 Distributed
 Distributed

Fundamental Components

Fundamentals of a Big Data Platform
Basic Architectural Components
Distributed File System
….

….
 Data file are stored in
raw form (no schema)
 Partitioned across data
nodes (disks)
 A partition is replicated
to M nodes
 Fault-tolerance

Compute Cluster
Head Compute
1
….
….
nodes (disks)
to M nodes
 Fault-tolerance
Compute
2
Compute
N
Resource Manager

Compute Cluster
Head Compute
1
….
….
nodes (disks)
to M nodes
 Fault-tolerance
 Plus an extra
failover head node
 Availability
Compute
2
Compute
N
Resource Manager
 Manage and execute jobs
 Distributed execution
model

Compute Cluster
Applications
Batch In-Memory Stream SQL NoSQL
Head Compute
1
….
….
nodes (disks)
to M nodes
 Fault-tolerance
 Plus an extra
failover head node
 Availability
Acquisition
Compute
2
Compute
N
Resource Manager
model

Compute Cluster
Applications
Batch In-Memory Stream SQL NoSQL
Head Compute
1
….
….
nodes (disks)
to M nodes
 Fault-tolerance
 Plus an extra
failover head node
 Availability
 Support Batch/Speed
workloads
Acquisition
Compute
2
Compute
N
Resource Manager
model

Lambda Architecture
 Data is dispatched to both the batch layer and the
speed layer for processing.
 The batch layer manages the master dataset (write
once- read many), and pre-computes the batch
views. Handle large data volumes with high latency.
 The speed layer indexes the batch views so that
they can be queried in low-latency, ad-hoc way.
Deals with recent, limited window of data only.
 The serving layer answer and incoming query by
merging results from batch views and real-time views
Hot Path
Cold Path

CAP Theorem

Distributed Processing & CAP Theorem
 In order to handle large volume of data processing efficiently, we need to scale out, i.e.
partition the data and distribute the computation
The trade-off…

 Now we face a trade-off between Consistency, Availability, and Partition Tolerance
The trade-off…

 Consistency: Data is in a consistent state across all the nodes.
That is, all the reads would get you the same, most recent write.
The trade-off…

 Availability: Every request to the system gets a response (i.e., executed) on success/failure.
That is, system responsiveness (latency)
The trade-off…

That is, system responsiveness (latency)
 Partition Tolerance: The system continuous to work despite of message loss or partition
(node) failure. That is, the system can sustain partial network failures.
The trade-off…

That is, system responsiveness (latency).
 Partition Tolerance: The system continuous to work despite of message loss or partition
(node) failure. That is, the system can sustain partial network failures.
 CAP Theorem: only two out of three properties can be satisfied in a distributed data
system. In facet, it is consistency vs availability, wrt partition tolerance!
The trade-off…

The trade-off…
Continues working if
partition is not reachable
by the systemP
C A

The trade-off…
by the systemP
C A
Big Data Systems
 BASE Mode – Eventually Consistency
 Remains available (operational &
responsive)
 partition tolerant, i.e., sacrifices
consistency

The trade-off…
by the system
Transactional RDBMS
 ACID Mode – Strong Consistency
 Commits are atomic across the
entre system
 Not partition tolerant,
i.e., sacrifices availability
P
C A
Big Data Systems
responsive)
consistency

The trade-off…
by the system
Transactional RDBMS
entre system
P
C A
Big Data Systems
responsive)
consistency
ACID
 Atomic: Everything in a transaction succeeds
or the entire transaction is rolled back.
 Consistent: A transaction cannot leave the
database in an inconsistent state.
 Isolated: Transactions cannot interfere with
each other.
 Durable: Completed transactions persist,
even when servers restart etc.
BASE
 Basic Availability
 Soft-state
 Eventual consistency

The trade-off…
by the system
Transactional RDBMS
entre system
P
C A
Big Data Systems
responsive)
consistency
ACID
 Atomic: Everything in a transaction succeeds
or the entire transaction is rolled back.
 Consistent: A transaction cannot leave the
database in an inconsistent state.
 Isolated: Transactions cannot interfere with
each other.
 Durable: Completed transactions persist,
even when servers restart etc.
BASE
 Basic Availability
 Soft-state
 Eventual consistency
NoSQL: Strong vs. Eventual Consistency

Big Data Solutions vs. Traditional RDMS
The face-off…
Feature RDBMS Big Data (Batch) Big Data (Stream & NoSQL)
Data Integrity Strong Consistency
– ACID Transactions
Eventual Consistency
– BASE Model
Depending on the technology
(Strong vs. Eventual Consistency)
Schema Static – required on write Dynamic – schema on read Flexible – extensible
Data types and
formats
Structured Structured , Semi-structured, and
unstructured
Semi-structured
Read and write
pattern
Fully repeatable read/write Write once, repeatable read Fully repeatable read/write
Storage volume Gigabytes to terabytes Terabytes, petabytes, and beyond Terabytes, petabytes, and beyond
- (small data chunks for stream processing)
Scalability Scale up with more powerful hardware Scale out with additional servers Scale out with additional servers
Data processing
distribution
Limited or none Distributed across the cluster Distributed across the cluster
Economics Expensive hardware and software Commodity hardware and open
source software
Commodity hardware and open
source software
Microsoft Patterns & Practises

Enterprise Big Data Platform

Big Data Fit in Enterprise Data Platform
Enterprise Data Platform

Use Case 1: Data Exploration/ Experiments Platform
101
100

Use Case 2: Data Processing (ETL)
MPP
MPP

Use Case 3: Data Warehouse

Use Case 4: Full Data/BI Integration
1 – ETL Level Integration
2 – DW Level Integration
3 – BI Level Integration
 Corporate Data Model
 Reports/Dashboard (Mashup)
MPP

Use Case 4: Full Data/BI Integration
1 – ETL Level Integration
2 – DW Level Integration
3 – BI Level Integration
 Corporate Data Model
 Reports/Dashboard (Mashup)
MPP
Operational Apps

Big Data with Hadoop

Introducing Hadoop
Apache Hadoop Ecosystem - “A” Big Data Platform
Hadoop Distributed File System (HDFS)
Applications
In-Memory Stream SQL
 Spark-
SQL
NoSQL Machine
Learning
….
Batch
Yet Another Resource Negotiator (YARN)
Search Orchest.
MgmntAcquisition
Named
Node
DataNode 1 DataNode 2 DataNode 3 DataNode N

Introducing Hadoop
Apache Hadoop Ecosystem - “A” Big Data Platform
A programming model for distributed
processing large data on a cluster
A scripting platform for processing and
analysing large data sets
The de facto standard for SQL queries in
Hadoop
Efficiently transfers bulk data between
Apache Hadoop and relational data stores
An algorithm library for scalable machine
learning on Hadoop
Provides workflow scheduling services
manage Hadoop jobs
A system for processing streaming data in
real time
A fast, scalable, fault-tolerant messaging
system
In-Memory compute for ETL, Machine
Learning, SQL, and streaming
A distributed key-value store with cell-based
access control
CouchDB: JSON document-oriented data
store
Provides random read/write access to a
distributed, fault tolerant, NoSQL data store

[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
Oct-03 Dec-04 Jan-06 Feb-06 Apr-06 May-06 Apr-07 Jun-07 Oct-07 Jan-08 Feb-08 Jul-08 Oct-08 Nov-08 Mar-09 Apr-09 May-10 Jun-10 Sep-10 Jan-11 Mar-11 Jun-11 Jan-12 Nov-12 Feb-14 Jun-15
Introducing Hadoop
History

Introducing Hadoop
MapReduce - Distributed Programing Model

Introducing Hadoop
Read lines
from file
Convert line
to
Key-Value
Pair(s)
Filter
(by
key/value)
Combine
Values with
similar Keys
Shuffle data
across
nodes
for reduces
by Key
Map

Introducing Hadoop
Read lines
from file
Convert line
to
Key-Value
Pair(s)
Filter
(by
key/value)
Combine
Values with
similar Keys
Shuffle data
across
nodes
for reduces
by Key
Sort by Key
Aggregate
(reduce)
Filter
(based on
aggregated
value)
Write
results
to file
Map Reduce

Introducing Hadoop
Read lines
from file
Convert line
to
Key-Value
Pair(s)
Filter
(by
key/value)
Combine
Values with
similar Keys
Shuffle data
across
nodes
for reduces
by Key
Sort by Key
Aggregate
(reduce)
Filter
(based on
aggregated
value)
Write
results
to file
Map Reduce
Input
Mapper
Mapper
Mapper

Introducing Hadoop
Read lines
from file
Convert line
to
Key-Value
Pair(s)
Filter
(by
key/value)
Combine
Values with
similar Keys
Shuffle data
across
nodes
for reduces
by Key
Sort by Key
Aggregate
(reduce)
Filter
(based on
aggregated
value)
Write
results
to file
Map Reduce
Input
Mapper
Mapper
Mapper
HashShuffling
(Key1, Value1)
(Key2, Value2)
(Key1, Value3)

Introducing Hadoop
Read lines
from file
Convert line
to
Key-Value
Pair(s)
Filter
(by
key/value)
Combine
Values with
similar Keys
Shuffle data
across
nodes
for reduces
by Key
Sort by Key
Aggregate
(reduce)
Filter
(based on
aggregated
value)
Write
results
to file
Map Reduce
Input
Mapper
Mapper
Mapper
Reducer
Reducer
HashShuffling
Output
(Key1, Value1)
(Key2, Value2)
(Key1, Value3)
Key1: {Value1, Value3}
Key 2: {Value2}

Introducing Hadoop
MapReduce - Example
SELECT Month, City, SUM(SalesValue) FROM Sales WHERE Product = ‘Bike’ GROUP BY City Having SUM(SalesValue) > 50,000
Read lines
from file
Convert line
to (Month-
Cirty,
Value) Pair
Discard
lines
where
Product is
not ‘Bike’
Combine
Values with
similar Keys
Shuffle data
across
nodes
for reduces
by Key
Sort by Key
Sum all the
values
in a given
key
Discard
records
where sum
<= 50,000
Write
results
to file

Big Data with Microsoft Azure

Big Data on Microsoft Azure
Virtual Machines
(IaaS)
Azure Services
(Data Acquisition, Stream Processing, Machine Learning, NoSQL)
Azure HDInsight
(IaaS+)
Azure Data Lake
(PaaS)

Infrastructure as a Service (IaaS).
Different distributions of Hadoop, still 100% Hadoop
(plus distribution specific extra tools).
You are responsible for provisioning, configuring, managing,
and updating the cluster with new tools.
The Distributed File System is part the compute cluster,
that is, killing the cluster means loosing the data
Hortonworks/Cloudera/MapR Virtual Machines

Azure HDInsight
Infrastructure as a Service+ (SaaS+).
Hortonworks distribution of Hadoop.
You pay for the cluster (infrastructure), and the Blob Storage, rather than the jobs.
Yet, you are NOT responsible for configuring, managing,
and updating the cluster with new tools (Managed by Microsoft).
On-demand Provisioning/shutting down.
Independent of the Distributed File System (Azure Blob Storage),
that is, killing the cluster will not cause loosing the data.
Data can be shared by multiple clusters.

Azure HDInsight
Windows Azure Blob Storage (WABS) Distributed File System
Applications (by cluster type)
Spark Storm HBase
….
Hadoop

Azure HDInsight
Windows Azure Blob Storage (WABS) Distributed File System
Applications (by cluster type)
Spark Storm HBase
….
Hadoop
Acquisition
 Azure Data Factory
Stream Processing
• Steam Analytics
• Event Hub
Machine Learning
 Azure Machine
Learning
NoSQL
 Table Storage
 DocumentDB

The PaaS zoo on the cloud…
Data Factory - Defines and automates the
movement, processing, and transformation of data by
through data flow pipelines.
Stream Analytics - Real-time event processing engine
for real-time analytic computations on data streams
Event Hub - highly scalable data ingress (message
queuing) service that can ingest millions of events
per second for downstream processing
Machine Learning - Cloud-based predictive analytics
service rapid creation and deployment predictive
models as analytics solutions
Table Storage - Stores structured key/attribute
NoSQL data store in the cloud.
DocumentDB - fully managed NoSQL JSON database
service for high performance, high availability,
automatic scaling, and ease of development

Data Lake Analytics
Azure Data Lake
Data Lake Storage
….
U-SQL
Acquisition
 Azure Data Factory
Stream Processing
• Steam Analytics
• Event Hub
Machine Learning
 Azure Machine
Learning
NoSQL
 Table Storage
 DocumentDB

Azure Data Lake
Platform as a Service (PaaS).
Microsoft’s own implementation of Big Data Platform, like Google (GCP) and
Amazon (AWS), rather than a distribution of Hadoop.
U-SQL for batch data processing.
You pay for the jobs, and the data lake storage.
Optimized Distributed File System (Data Lake) for analytical workloads.

Microsoft Azure Big Data Analytics Options
Microsoft Advanced Analytics laboratory

Microsoft Azure – Cortana Analytical Suite
Microsoft

How to Get Started with Big Data?
 Read these slides!
 Coursera – Big Data Specialization
https://www.coursera.org/specializations/big-data
 Azure Documentation – HDInsight Emulator
https://azure.microsoft.com/en-gb/documentation/articles/hdinsight-hadoop-emulator-get-started
 MVA – Big Data Analytics
https://mva.microsoft.com/en-US/training-courses/big-data-analytics-8255?l=ogCizYKy_9604984382
 MVA – Big Data Analytics with HDInsight: Hadoop on Azure
https://mva.microsoft.com/en-US/training-courses/big-data-analytics-with-hdinsight-hadoop-on-azure-10551
 MVA – Implementing Big Data Analysis
https://mva.microsoft.com/en-US/training-courses/implementing-big-data-analysis-8311?l=44REr2Yy_5404984382
 Azure Documentation – Getting Started with HDInsight
https://azure.microsoft.com/en-gb/documentation/services/hdinsight/
 Microsoft Patterns & Practice – Developing big data solutions on Microsoft Azure HDInsight
https://msdn.microsoft.com/en-gb/library/dn749874.aspx
 Azure Documentation – Data Lake
https://azure.microsoft.com/en-gb/documentation/services/data-lake-analytics/
 Apache Hadoop http://hadoop.apache.org/
O’Reliy Books– Hadoop: The Definitive Guide 4th Edition

Useful Hadoop Commands
 To list the contents of a directory: hadoop fs -ls /<DirectoryPath>
 To see contents of a file: hadoop fs -cat /<FilePath>
 To create a directory in HDFS: hadoop fs -mkdir / <DiretoryPath>
 To upload files from local file system to the Hadoop : hadoop fs -put <localSrcPath> /<hdfsDstPath>
 To download files from the Hadoop data file system to the local file system: hadoop fs -get /<FilePath>
 To copy a file from source to destination: hadoop fs -cp /<SrcFilePath> /<DstFilePath>
 To copy a file from Local file system to HDFS: hadoop fs -copyFromLocal <LocalSrcPath> /<hdfsDstPath>
 To copy a file to Local file system from HDFS: hadoop fs -copyToLocal /<hdfsSrcFilePath> /<DstFilePath>
 To remove a file from HDFS: hadoop fs -rm /<FilePath>
 To remove a directory from HDFS: hadoop fs -rm -r /<DirectoryPath>

Coming soon…
 Introduction to Azure Data Factory, and Data Lake Analytics with U-SQL
 Introduction to Hive on HDInsight
 Event & Stream Processing on Microsoft Azure
 NoSQL on Microsoft Azure
 Introduction to Spark on HDInsight
 Introduction to Azure Batch
Stay tuned

Acknowledgement
Thanks for Paul Lineham for answering
all my stupid big data questions, patiently…

Thank you!

Intorducing Big Data and Microsoft Azure

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Intorducing Big Data and Microsoft Azure

Similar a Intorducing Big Data and Microsoft Azure (20)

Más de Khalid Salama

Más de Khalid Salama (6)

Último

Último (20)

Intorducing Big Data and Microsoft Azure