Más contenido relacionado La actualidad más candente (20) Similar a Intorducing Big Data and Microsoft Azure (20) Intorducing Big Data and Microsoft Azure1. | © Copyright 2015 Hitachi Consulting1
Introducing Big Data
with Microsoft Azure
Khalid M. Salama
Microsoft Business Intelligence
Hitachi Consulting UK
We Make it Happen. Better.
2. | © Copyright 2015 Hitachi Consulting2
Outline
What is Big Data?
Why Big Data Platforms?
Fundamentals of a Big Data Platform
Distributed Processing & CAP Theorem
Big Data Solutions vs. Traditional RDBMS
Where Big Data Fits in Enterprise Data Platforms?
Hadoop Ecosystem: Apache Tools for Big Data
Big Data on Microsoft Azure
How to Get Started with Big Data?
4. | © Copyright 2015 Hitachi Consulting4
What is Big Data?
“Data that is too complex for processing using traditional relational
databases efficiently and cost-effectively.”
In a nutshell…
5. | © Copyright 2015 Hitachi Consulting5
What is Big Data?
“Data that is too complex for processing using traditional relational
databases efficiently and cost-effectively.”
Big Data attributes…
Complex (3 V’s)
Volume – Huge amounts of data to process
Variety – A mixture of structured and unstructured data
Velocity – High frequency or (near) real-time data processing
6. | © Copyright 2015 Hitachi Consulting6
What is Big Data?
“Data that is too complex for processing using traditional relational
databases efficiently and cost-effectively.”
Tell me more…
Complex (3 V’s)
Volume – Huge amounts of data to process
Variety – A mixture of structured and unstructured data
Velocity – High frequency or (near) real-time data processing
Processing
Stream (operational)
Batch (Analytical)
Efficiently
Availability/Scalability
Performance/Throughputs
Cost-Effectively
Acquiring
Scaling up/down
7. | © Copyright 2015 Hitachi Consulting7
What is Big Data?
Common examples and applications
• User Experience Improvement
• Recommendation & Target Advertising
Clickstream
• Predictive Maintenance
• Energy Efficiency – Smart City
Sensor/Devices
• Sentiment Analysis
• Crisis Management
Social Media
• Push Notifications
• Process Optimisation
Spatial & GPS
• Proactive securityImages/Audio/Video
• Analysis of customer reviews/feedbacks/complaints
• Automatic news summarization/analysis
Free Text
8. | © Copyright 2015 Hitachi Consulting8
Why Big Data Platforms?
Traditional Data Platforms
9. | © Copyright 2015 Hitachi Consulting9
Why Big Data Platforms?
Breaking points of traditional Data Platforms – Volume
10. | © Copyright 2015 Hitachi Consulting10
Why Big Data Platforms?
Breaking points of traditional Data Platforms – Variety
11. | © Copyright 2015 Hitachi Consulting11
Why Big Data Platforms?
Breaking points of traditional Data Platforms – Velocity
12. | © Copyright 2015 Hitachi Consulting12
Enterprise-wide data scale
Terabytes
Gigabytes
13. | © Copyright 2015 Hitachi Consulting13
Enterprise-wide data scale
Terabytes
Gigabytes
14. | © Copyright 2015 Hitachi Consulting14
Enterprise-wide data scale
Terabytes
Gigabytes
15. | © Copyright 2015 Hitachi Consulting15
Enterprise-wide data scale
Terabytes
Gigabytes
Non-
Transactional
Transactional
16. | © Copyright 2015 Hitachi Consulting16
Addressing Big Data Challenges
17. | © Copyright 2015 Hitachi Consulting17
Addressing Big Data Challenges
Addressing the three “V”s…
Volume
Variety
Velocity
Challenges
18. | © Copyright 2015 Hitachi Consulting18
Addressing Big Data Challenges
Addressing the three “V”s…
Volume
Variety
Velocity
Distributed
Computing
Challenges Solutions
19. | © Copyright 2015 Hitachi Consulting19
Addressing Big Data Challenges
Addressing the three “V”s…
Volume
Variety
Velocity
Distributed
Computing
Batch Processing
Challenges Solutions
20. | © Copyright 2015 Hitachi Consulting20
Addressing Big Data Challenges
Addressing the three “V”s…
Volume
Variety
Velocity
Distributed
Computing
Stream
Processing
Batch Processing
Challenges Solutions
21. | © Copyright 2015 Hitachi Consulting21
Addressing Big Data Challenges
Addressing the three “V”s…
Volume
Variety
Velocity
Distributed
Computing
NoSQL
Stream
Processing
Batch Processing
Challenges Solutions
22. | © Copyright 2015 Hitachi Consulting22
Addressing Big Data Challenges
Addressing the three “V”s…
Volume
Variety
Velocity
Distributed
Computing
NoSQL
Stream
Processing
In-Memory
Processing
Batch Processing
Challenges Solutions
23. | © Copyright 2015 Hitachi Consulting23
Addressing Big Data Challenges
Addressing the three “V”s…
Volume
Variety
Velocity
Distributed
Computing
NoSQL
Stream
Processing
In-Memory
Processing
Batch Processing
Consistency/Availability
/FaultToleranceTrade-off
(CAP)
Challenges Solutions
24. | © Copyright 2015 Hitachi Consulting24
Addressing Big Data Challenges
Tell me more….
Distributed Computing
Batch Processing
In-Memory Processing
Stream Processing
NoSQL
Distributed
Available/ Fault Tolerant
Random read/write access
Supports Batch & Stream
Cluster of many data/compute nodes (commodity hardware)
Data Partitioning (sharding)
Data partitions are processed in parallel
Easy/cheap to scale-out
Process massive amount of data
Write once / read many
High latency
Iterative processing of the same data in memory
Data size that fits into the memory
Low latency
Process continuous stream of data
Small data chunks
Low latency
Key-value stores
Column family stores
Document stores
Graph stores
Distributed
Available/ Fault Tolerant
Eventually Consistent
High throughputs
Distributed
Available/ Fault Tolerant
Eventually Consistent
Distributed
Available/ Fault Tolerant
Eventually Consistent
25. | © Copyright 2015 Hitachi Consulting25
Fundamental Components
26. | © Copyright 2015 Hitachi Consulting26
Fundamentals of a Big Data Platform
Basic Architectural Components
Distributed File System
….
27. | © Copyright 2015 Hitachi Consulting27
Basic Architectural Components
Distributed File System
….
Data file are stored in
raw form (no schema)
Partitioned across data
nodes (disks)
A partition is replicated
to M nodes
Fault-tolerance
Fundamentals of a Big Data Platform
28. | © Copyright 2015 Hitachi Consulting28
Basic Architectural Components
Distributed File System
Compute Cluster
Head Compute
1
….
….
Data file are stored in
raw form (no schema)
Partitioned across data
nodes (disks)
A partition is replicated
to M nodes
Fault-tolerance
Compute
2
Compute
N
Resource Manager
Fundamentals of a Big Data Platform
29. | © Copyright 2015 Hitachi Consulting29
Basic Architectural Components
Distributed File System
Compute Cluster
Head Compute
1
….
….
Data file are stored in
raw form (no schema)
Partitioned across data
nodes (disks)
A partition is replicated
to M nodes
Fault-tolerance
Plus an extra
failover head node
Availability
Compute
2
Compute
N
Resource Manager
Manage and execute jobs
Distributed execution
model
Fundamentals of a Big Data Platform
30. | © Copyright 2015 Hitachi Consulting30
Basic Architectural Components
Distributed File System
Compute Cluster
Applications
Batch In-Memory Stream SQL NoSQL
Head Compute
1
….
….
Data file are stored in
raw form (no schema)
Partitioned across data
nodes (disks)
A partition is replicated
to M nodes
Fault-tolerance
Plus an extra
failover head node
Availability
Acquisition
Compute
2
Compute
N
Resource Manager
Manage and execute jobs
Distributed execution
model
Fundamentals of a Big Data Platform
31. | © Copyright 2015 Hitachi Consulting31
Basic Architectural Components
Distributed File System
Compute Cluster
Applications
Batch In-Memory Stream SQL NoSQL
Head Compute
1
….
….
Data file are stored in
raw form (no schema)
Partitioned across data
nodes (disks)
A partition is replicated
to M nodes
Fault-tolerance
Plus an extra
failover head node
Availability
Support Batch/Speed
workloads
Acquisition
Compute
2
Compute
N
Resource Manager
Manage and execute jobs
Distributed execution
model
Fundamentals of a Big Data Platform
32. | © Copyright 2015 Hitachi Consulting32
Fundamentals of a Big Data Platform
Lambda Architecture
Data is dispatched to both the batch layer and the
speed layer for processing.
The batch layer manages the master dataset (write
once- read many), and pre-computes the batch
views. Handle large data volumes with high latency.
The speed layer indexes the batch views so that
they can be queried in low-latency, ad-hoc way.
Deals with recent, limited window of data only.
The serving layer answer and incoming query by
merging results from batch views and real-time views
Hot Path
Cold Path
34. | © Copyright 2015 Hitachi Consulting34
Distributed Processing & CAP Theorem
In order to handle large volume of data processing efficiently, we need to scale out, i.e.
partition the data and distribute the computation
The trade-off…
35. | © Copyright 2015 Hitachi Consulting35
Distributed Processing & CAP Theorem
In order to handle large volume of data processing efficiently, we need to scale out, i.e.
partition the data and distribute the computation
Now we face a trade-off between Consistency, Availability, and Partition Tolerance
The trade-off…
36. | © Copyright 2015 Hitachi Consulting36
Distributed Processing & CAP Theorem
In order to handle large volume of data processing efficiently, we need to scale out, i.e.
partition the data and distribute the computation
Now we face a trade-off between Consistency, Availability, and Partition Tolerance
Consistency: Data is in a consistent state across all the nodes.
That is, all the reads would get you the same, most recent write.
The trade-off…
37. | © Copyright 2015 Hitachi Consulting37
Distributed Processing & CAP Theorem
In order to handle large volume of data processing efficiently, we need to scale out, i.e.
partition the data and distribute the computation
Now we face a trade-off between Consistency, Availability, and Partition Tolerance
Consistency: Data is in a consistent state across all the nodes.
That is, all the reads would get you the same, most recent write.
Availability: Every request to the system gets a response (i.e., executed) on success/failure.
That is, system responsiveness (latency)
The trade-off…
38. | © Copyright 2015 Hitachi Consulting38
Distributed Processing & CAP Theorem
In order to handle large volume of data processing efficiently, we need to scale out, i.e.
partition the data and distribute the computation
Now we face a trade-off between Consistency, Availability, and Partition Tolerance
Consistency: Data is in a consistent state across all the nodes.
That is, all the reads would get you the same, most recent write.
Availability: Every request to the system gets a response (i.e., executed) on success/failure.
That is, system responsiveness (latency)
Partition Tolerance: The system continuous to work despite of message loss or partition
(node) failure. That is, the system can sustain partial network failures.
The trade-off…
39. | © Copyright 2015 Hitachi Consulting39
Distributed Processing & CAP Theorem
In order to handle large volume of data processing efficiently, we need to scale out, i.e.
partition the data and distribute the computation
Now we face a trade-off between Consistency, Availability, and Partition Tolerance
Consistency: Data is in a consistent state across all the nodes.
That is, all the reads would get you the same, most recent write.
Availability: Every request to the system gets a response (i.e., executed) on success/failure.
That is, system responsiveness (latency).
Partition Tolerance: The system continuous to work despite of message loss or partition
(node) failure. That is, the system can sustain partial network failures.
CAP Theorem: only two out of three properties can be satisfied in a distributed data
system. In facet, it is consistency vs availability, wrt partition tolerance!
The trade-off…
40. | © Copyright 2015 Hitachi Consulting40
Distributed Processing & CAP Theorem
The trade-off…
Continues working if
partition is not reachable
by the systemP
C A
41. | © Copyright 2015 Hitachi Consulting41
Distributed Processing & CAP Theorem
The trade-off…
Continues working if
partition is not reachable
by the systemP
C A
Big Data Systems
BASE Mode – Eventually Consistency
Remains available (operational &
responsive)
partition tolerant, i.e., sacrifices
consistency
42. | © Copyright 2015 Hitachi Consulting42
Distributed Processing & CAP Theorem
The trade-off…
Continues working if
partition is not reachable
by the system
Transactional RDBMS
ACID Mode – Strong Consistency
Commits are atomic across the
entre system
Not partition tolerant,
i.e., sacrifices availability
P
C A
Big Data Systems
BASE Mode – Eventually Consistency
Remains available (operational &
responsive)
partition tolerant, i.e., sacrifices
consistency
43. | © Copyright 2015 Hitachi Consulting43
Distributed Processing & CAP Theorem
The trade-off…
Continues working if
partition is not reachable
by the system
Transactional RDBMS
ACID Mode – Strong Consistency
Commits are atomic across the
entre system
Not partition tolerant,
i.e., sacrifices availability
P
C A
Big Data Systems
BASE Mode – Eventually Consistency
Remains available (operational &
responsive)
partition tolerant, i.e., sacrifices
consistency
ACID
Atomic: Everything in a transaction succeeds
or the entire transaction is rolled back.
Consistent: A transaction cannot leave the
database in an inconsistent state.
Isolated: Transactions cannot interfere with
each other.
Durable: Completed transactions persist,
even when servers restart etc.
BASE
Basic Availability
Soft-state
Eventual consistency
44. | © Copyright 2015 Hitachi Consulting44
Distributed Processing & CAP Theorem
The trade-off…
Continues working if
partition is not reachable
by the system
Transactional RDBMS
ACID Mode – Strong Consistency
Commits are atomic across the
entre system
Not partition tolerant,
i.e., sacrifices availability
P
C A
Big Data Systems
BASE Mode – Eventually Consistency
Remains available (operational &
responsive)
partition tolerant, i.e., sacrifices
consistency
ACID
Atomic: Everything in a transaction succeeds
or the entire transaction is rolled back.
Consistent: A transaction cannot leave the
database in an inconsistent state.
Isolated: Transactions cannot interfere with
each other.
Durable: Completed transactions persist,
even when servers restart etc.
BASE
Basic Availability
Soft-state
Eventual consistency
NoSQL: Strong vs. Eventual Consistency
45. | © Copyright 2015 Hitachi Consulting45
Big Data Solutions vs. Traditional RDMS
The face-off…
Feature RDBMS Big Data (Batch) Big Data (Stream & NoSQL)
Data Integrity Strong Consistency
– ACID Transactions
Eventual Consistency
– BASE Model
Depending on the technology
(Strong vs. Eventual Consistency)
Schema Static – required on write Dynamic – schema on read Flexible – extensible
Data types and
formats
Structured Structured , Semi-structured, and
unstructured
Semi-structured
Read and write
pattern
Fully repeatable read/write Write once, repeatable read Fully repeatable read/write
Storage volume Gigabytes to terabytes Terabytes, petabytes, and beyond Terabytes, petabytes, and beyond
- (small data chunks for stream processing)
Scalability Scale up with more powerful hardware Scale out with additional servers Scale out with additional servers
Data processing
distribution
Limited or none Distributed across the cluster Distributed across the cluster
Economics Expensive hardware and software Commodity hardware and open
source software
Commodity hardware and open
source software
Microsoft Patterns & Practises
46. | © Copyright 2015 Hitachi Consulting46
Enterprise Big Data Platform
47. | © Copyright 2015 Hitachi Consulting47
Big Data Fit in Enterprise Data Platform
Enterprise Data Platform
48. | © Copyright 2015 Hitachi Consulting48
Big Data Fit in Enterprise Data Platform
Use Case 1: Data Exploration/ Experiments Platform
101
100
Microsoft Patterns & Practises
49. | © Copyright 2015 Hitachi Consulting49
Big Data Fit in Enterprise Data Platform
Use Case 2: Data Processing (ETL)
MPP
MPP
Microsoft Patterns & Practises
50. | © Copyright 2015 Hitachi Consulting50
Big Data Fit in Enterprise Data Platform
Use Case 3: Data Warehouse
Microsoft Patterns & Practises
51. | © Copyright 2015 Hitachi Consulting51
Big Data Fit in Enterprise Data Platform
Use Case 4: Full Data/BI Integration
Microsoft Patterns & Practises
1 – ETL Level Integration
2 – DW Level Integration
3 – BI Level Integration
Corporate Data Model
Reports/Dashboard (Mashup)
MPP
52. | © Copyright 2015 Hitachi Consulting52
Big Data Fit in Enterprise Data Platform
Use Case 4: Full Data/BI Integration
Microsoft Patterns & Practises
1 – ETL Level Integration
2 – DW Level Integration
3 – BI Level Integration
Corporate Data Model
Reports/Dashboard (Mashup)
MPP
Operational Apps
54. | © Copyright 2015 Hitachi Consulting54
Introducing Hadoop
Apache Hadoop Ecosystem - “A” Big Data Platform
Hadoop Distributed File System (HDFS)
Applications
In-Memory Stream SQL
Spark-
SQL
NoSQL Machine
Learning
….
Batch
Yet Another Resource Negotiator (YARN)
Search Orchest.
MgmntAcquisition
Named
Node
DataNode 1 DataNode 2 DataNode 3 DataNode N
55. | © Copyright 2015 Hitachi Consulting55
Introducing Hadoop
Apache Hadoop Ecosystem - “A” Big Data Platform
A programming model for distributed
processing large data on a cluster
A scripting platform for processing and
analysing large data sets
The de facto standard for SQL queries in
Hadoop
Efficiently transfers bulk data between
Apache Hadoop and relational data stores
An algorithm library for scalable machine
learning on Hadoop
Provides workflow scheduling services
manage Hadoop jobs
A system for processing streaming data in
real time
A fast, scalable, fault-tolerant messaging
system
In-Memory compute for ETL, Machine
Learning, SQL, and streaming
A distributed key-value store with cell-based
access control
CouchDB: JSON document-oriented data
store
Provides random read/write access to a
distributed, fault tolerant, NoSQL data store
56. | © Copyright 2015 Hitachi Consulting56
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
Oct-03 Dec-04 Jan-06 Feb-06 Apr-06 May-06 Apr-07 Jun-07 Oct-07 Jan-08 Feb-08 Jul-08 Oct-08 Nov-08 Mar-09 Apr-09 May-10 Jun-10 Sep-10 Jan-11 Mar-11 Jun-11 Jan-12 Nov-12 Feb-14 Jun-15
Introducing Hadoop
History
57. | © Copyright 2015 Hitachi Consulting57
Introducing Hadoop
MapReduce - Distributed Programing Model
58. | © Copyright 2015 Hitachi Consulting58
Introducing Hadoop
MapReduce - Distributed Programing Model
Read lines
from file
Convert line
to
Key-Value
Pair(s)
Filter
(by
key/value)
Combine
Values with
similar Keys
Shuffle data
across
nodes
for reduces
by Key
Map
59. | © Copyright 2015 Hitachi Consulting59
Introducing Hadoop
MapReduce - Distributed Programing Model
Read lines
from file
Convert line
to
Key-Value
Pair(s)
Filter
(by
key/value)
Combine
Values with
similar Keys
Shuffle data
across
nodes
for reduces
by Key
Sort by Key
Aggregate
(reduce)
Filter
(based on
aggregated
value)
Write
results
to file
Map Reduce
60. | © Copyright 2015 Hitachi Consulting60
Introducing Hadoop
MapReduce - Distributed Programing Model
Read lines
from file
Convert line
to
Key-Value
Pair(s)
Filter
(by
key/value)
Combine
Values with
similar Keys
Shuffle data
across
nodes
for reduces
by Key
Sort by Key
Aggregate
(reduce)
Filter
(based on
aggregated
value)
Write
results
to file
Map Reduce
Input
Mapper
Mapper
Mapper
61. | © Copyright 2015 Hitachi Consulting61
Introducing Hadoop
MapReduce - Distributed Programing Model
Read lines
from file
Convert line
to
Key-Value
Pair(s)
Filter
(by
key/value)
Combine
Values with
similar Keys
Shuffle data
across
nodes
for reduces
by Key
Sort by Key
Aggregate
(reduce)
Filter
(based on
aggregated
value)
Write
results
to file
Map Reduce
Input
Mapper
Mapper
Mapper
HashShuffling
(Key1, Value1)
(Key2, Value2)
(Key1, Value3)
62. | © Copyright 2015 Hitachi Consulting62
Introducing Hadoop
MapReduce - Distributed Programing Model
Read lines
from file
Convert line
to
Key-Value
Pair(s)
Filter
(by
key/value)
Combine
Values with
similar Keys
Shuffle data
across
nodes
for reduces
by Key
Sort by Key
Aggregate
(reduce)
Filter
(based on
aggregated
value)
Write
results
to file
Map Reduce
Input
Mapper
Mapper
Mapper
Reducer
Reducer
HashShuffling
Output
(Key1, Value1)
(Key2, Value2)
(Key1, Value3)
Key1: {Value1, Value3}
Key 2: {Value2}
63. | © Copyright 2015 Hitachi Consulting63
Introducing Hadoop
MapReduce - Example
SELECT Month, City, SUM(SalesValue) FROM Sales WHERE Product = ‘Bike’ GROUP BY City Having SUM(SalesValue) > 50,000
Read lines
from file
Convert line
to (Month-
Cirty,
Value) Pair
Discard
lines
where
Product is
not ‘Bike’
Combine
Values with
similar Keys
Shuffle data
across
nodes
for reduces
by Key
Sort by Key
Sum all the
values
in a given
key
Discard
records
where sum
<= 50,000
Write
results
to file
64. | © Copyright 2015 Hitachi Consulting64
Introducing Hadoop
MapReduce - Example
SELECT Month, City, SUM(SalesValue) FROM Sales WHERE Product = ‘Bike’ GROUP BY City Having SUM(SalesValue) > 50,000
Read lines
from file
Convert line
to (Month-
Cirty,
Value) Pair
Discard
lines
where
Product is
not ‘Bike’
Combine
Values with
similar Keys
Shuffle data
across
nodes
for reduces
by Key
Sort by Key
Sum all the
values
in a given
key
Discard
records
where sum
<= 50,000
Write
results
to file
65. | © Copyright 2015 Hitachi Consulting65
Big Data with Microsoft Azure
66. | © Copyright 2015 Hitachi Consulting66
Big Data on Microsoft Azure
Virtual Machines
(IaaS)
Azure Services
(Data Acquisition, Stream Processing, Machine Learning, NoSQL)
Azure HDInsight
(IaaS+)
Azure Data Lake
(PaaS)
67. | © Copyright 2015 Hitachi Consulting67
Big Data on Microsoft Azure
Infrastructure as a Service (IaaS).
Different distributions of Hadoop, still 100% Hadoop
(plus distribution specific extra tools).
You are responsible for provisioning, configuring, managing,
and updating the cluster with new tools.
The Distributed File System is part the compute cluster,
that is, killing the cluster means loosing the data
Hortonworks/Cloudera/MapR Virtual Machines
68. | © Copyright 2015 Hitachi Consulting68
Big Data on Microsoft Azure
Azure HDInsight
Infrastructure as a Service+ (SaaS+).
Hortonworks distribution of Hadoop.
You pay for the cluster (infrastructure), and the Blob Storage, rather than the jobs.
Yet, you are NOT responsible for configuring, managing,
and updating the cluster with new tools (Managed by Microsoft).
On-demand Provisioning/shutting down.
Independent of the Distributed File System (Azure Blob Storage),
that is, killing the cluster will not cause loosing the data.
Data can be shared by multiple clusters.
69. | © Copyright 2015 Hitachi Consulting69
Big Data on Microsoft Azure
Azure HDInsight
Windows Azure Blob Storage (WABS) Distributed File System
Applications (by cluster type)
Spark Storm HBase
….
Hadoop
Yet Another Resource Negotiator (YARN)
70. | © Copyright 2015 Hitachi Consulting70
Big Data on Microsoft Azure
Azure HDInsight
Windows Azure Blob Storage (WABS) Distributed File System
Applications (by cluster type)
Spark Storm HBase
….
Hadoop
Yet Another Resource Negotiator (YARN)
Acquisition
Azure Data Factory
Stream Processing
• Steam Analytics
• Event Hub
Machine Learning
Azure Machine
Learning
NoSQL
Table Storage
DocumentDB
71. | © Copyright 2015 Hitachi Consulting71
Big Data on Microsoft Azure
The PaaS zoo on the cloud…
Data Factory - Defines and automates the
movement, processing, and transformation of data by
through data flow pipelines.
Stream Analytics - Real-time event processing engine
for real-time analytic computations on data streams
Event Hub - highly scalable data ingress (message
queuing) service that can ingest millions of events
per second for downstream processing
Machine Learning - Cloud-based predictive analytics
service rapid creation and deployment predictive
models as analytics solutions
Table Storage - Stores structured key/attribute
NoSQL data store in the cloud.
DocumentDB - fully managed NoSQL JSON database
service for high performance, high availability,
automatic scaling, and ease of development
72. | © Copyright 2015 Hitachi Consulting72
Data Lake Analytics
Big Data on Microsoft Azure
Azure Data Lake
Data Lake Storage
….
U-SQL
Acquisition
Azure Data Factory
Stream Processing
• Steam Analytics
• Event Hub
Machine Learning
Azure Machine
Learning
NoSQL
Table Storage
DocumentDB
Yet Another Resource Negotiator (YARN)
73. | © Copyright 2015 Hitachi Consulting73
Big Data on Microsoft Azure
Azure Data Lake
Platform as a Service (PaaS).
Microsoft’s own implementation of Big Data Platform, like Google (GCP) and
Amazon (AWS), rather than a distribution of Hadoop.
U-SQL for batch data processing.
You pay for the jobs, and the data lake storage.
Optimized Distributed File System (Data Lake) for analytical workloads.
74. | © Copyright 2015 Hitachi Consulting74
Big Data on Microsoft Azure
Microsoft Azure Big Data Analytics Options
Microsoft Advanced Analytics laboratory
75. | © Copyright 2015 Hitachi Consulting75
Big Data on Microsoft Azure
Microsoft Azure – Cortana Analytical Suite
Microsoft
76. | © Copyright 2015 Hitachi Consulting76
How to Get Started with Big Data?
Read these slides!
Coursera – Big Data Specialization
https://www.coursera.org/specializations/big-data
Azure Documentation – HDInsight Emulator
https://azure.microsoft.com/en-gb/documentation/articles/hdinsight-hadoop-emulator-get-started
MVA – Big Data Analytics
https://mva.microsoft.com/en-US/training-courses/big-data-analytics-8255?l=ogCizYKy_9604984382
MVA – Big Data Analytics with HDInsight: Hadoop on Azure
https://mva.microsoft.com/en-US/training-courses/big-data-analytics-with-hdinsight-hadoop-on-azure-10551
MVA – Implementing Big Data Analysis
https://mva.microsoft.com/en-US/training-courses/implementing-big-data-analysis-8311?l=44REr2Yy_5404984382
Azure Documentation – Getting Started with HDInsight
https://azure.microsoft.com/en-gb/documentation/services/hdinsight/
Microsoft Patterns & Practice – Developing big data solutions on Microsoft Azure HDInsight
https://msdn.microsoft.com/en-gb/library/dn749874.aspx
Azure Documentation – Data Lake
https://azure.microsoft.com/en-gb/documentation/services/data-lake-analytics/
Apache Hadoop http://hadoop.apache.org/
O’Reliy Books– Hadoop: The Definitive Guide 4th Edition
77. | © Copyright 2015 Hitachi Consulting77
Useful Hadoop Commands
To list the contents of a directory: hadoop fs -ls /<DirectoryPath>
To see contents of a file: hadoop fs -cat /<FilePath>
To create a directory in HDFS: hadoop fs -mkdir / <DiretoryPath>
To upload files from local file system to the Hadoop : hadoop fs -put <localSrcPath> /<hdfsDstPath>
To download files from the Hadoop data file system to the local file system: hadoop fs -get /<FilePath>
To copy a file from source to destination: hadoop fs -cp /<SrcFilePath> /<DstFilePath>
To copy a file from Local file system to HDFS: hadoop fs -copyFromLocal <LocalSrcPath> /<hdfsDstPath>
To copy a file to Local file system from HDFS: hadoop fs -copyToLocal /<hdfsSrcFilePath> /<DstFilePath>
To remove a file from HDFS: hadoop fs -rm /<FilePath>
To remove a directory from HDFS: hadoop fs -rm -r /<DirectoryPath>
78. | © Copyright 2015 Hitachi Consulting78
Coming soon…
Introduction to Azure Data Factory, and Data Lake Analytics with U-SQL
Introduction to Hive on HDInsight
Event & Stream Processing on Microsoft Azure
NoSQL on Microsoft Azure
Introduction to Spark on HDInsight
Introduction to Azure Batch
Stay tuned
79. | © Copyright 2015 Hitachi Consulting79
Acknowledgement
Thanks for Paul Lineham for answering
all my stupid big data questions, patiently…