Data Science with Windows Azure is an introduction to HDInsight and Hadoop offerings from Microsoft Machine Learning and Big Data Cloud based platform. This was presented at Microsoft Data Science Group – Tampa Analytics Professionals.
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
Data science with Windows Azure - A Brief Introduction
1. A D N A N M A S O O D , P H D
S Y S T E M S A R C H I T E C T / S O F T W A R E E N G I N E E R
A D N A N . M A S O O D @ O W A S P . O R G
( H T T P : / / B L O G . A D N A N M A S O O D . C O M )
G I T H U B ( G I T H U B . C O M / A D N A N M A S O O D ) ,
T W I T T E R ( @ A D N A N M A S O O D ) .
P R E S E N T E D A T M I C R O S O F T D A T A S C I E N C E G R O U P –
T A M P A A N A L Y T I C S P R O F E S S I O N A L S
H T T P : / / W W W . M E E T U P . C O M / A N A L Y T I C S - P R O F E S S I O N A L S - O F -
T A M P A / E V E N T S / 2 2 8 7 9 6 3 4 3 /
Data Science with Windows Azure
2. About the Speaker
Adnan Masood, Ph.D. is a developer, software architect, and researcher and specializes in machine
learning and Bayesian belief networks. Before joining PDS Health care, and GDC (a leading prepaid
financial technology institution), he enjoyed life as a principal engineer of a start-up and worked for a
leading UK based nonprofit organization as a solutions architect.
A strong believer in the development community, Adnan is an active member of the Open Web
Application Security Project (OWASP), an organization dedicated to software security. In the .NET
community, he is a cofounder and president of the Pasadena .NET Developers group, which he has been
successfully leading for 8 years. He led a number of successful enterprise solutions and consulted for
several Fortune 500 company projects.
Adnan devotes himself to his own continual, practical education. He holds certifications in big data,
machine learning, and systems architecture from Massachusetts Institute of Technology; an Application
Security certification from Stanford University; an SOA Smarts certification from Carnegie Mellon
University; and certifications as a ScrumMaster, Microsoft Certified Trainer, Microsoft Certified Solutions
Developer, and Sun Certified Java Developer.
3. Key Take Aways from this Talk
Understand what Microsoft Offers for Data Science in Windows Azure. (or
how to write mapReduce jobs in C#)
Diagrams are Courtesy of Microsoft Corporation
12. What is Hadoop?
At Google MapReduce operation are run on a
special file system called Google File System (GFS)
that is highly optimized for this purpose.
GFS is not open source.
Doug Cutting and others at Yahoo! reverse
engineered the GFS and called it Hadoop Distributed
File System (HDFS).
The software framework that supports HDFS,
MapReduce and other related entities is called the
project Hadoop or simply Hadoop.
This is open source and distributed by Apache.
12
13. MapReduce13
MapReduce is a framework for processing parallelizable
problems across huge datasets using a large number of
computers (nodes), collectively referred to as a cluster
or a grid.
6/16/2015
14. Classes of problems “mapreducable”
Benchmark for comparing: Jim Gray’s challenge on data-
intensive computing. Ex: “Sort”
Google uses it for wordcount, adwords, pagerank, indexing data.
Simple algorithms such as grep, text-indexing, reverse indexing
Bayesian classification: data mining domain
Facebook uses it for various operations: demographics
Financial services use it for analytics
Astronomy: Gaussian analysis for locating extra-terrestrial
objects.
Expected to play a critical role in semantic web and in web 3.0
14
15. Apache Spark
Apache Spark is an open source cluster
computing framework originally developed in the
AMPlab at UC Berkley.
Spark in-memory provides performance up to
100 times faster for certain applications.
Spark is well suited for machine learning
algorithms.
Spark requires a cluster manager and a
distributed storage system.
Spark supports Hadoop YARN.
6/16/2015
15
17. Example: counting the number of occurrences for each word
in a collection of documents
The input file is a repository of documents, and each
document is an element. The Map function for this example
uses keys that are of type String (the words) and values
that are integers. The Map task reads a document and
breaks it into its sequence of words w1,w2, . . . ,wn. It then
emits a sequence of key-value pairs where the value is
always 1. That is, the output of the Map task for this
document is the sequence of key-value pairs:
(w1, 1), (w2, 1), . . . , (wn, 1)
6/16/2015
17
18. Key Players in Hadoop World
HortonWorks
Cloudera
MAPR
19. Hortonworks is a Business computer software company based in Palo
Alto,California
Hortonworks supports & develops Apache Hadoop framework, that
allows distributed processing of large data sets across clusters of
computers
They are the sponsors of Apache Software Foundation
Founded in June 2011 by Yahoo and Benchmark capital as an
independent company. It went public on December 2014
Below are the list of company collaborated with Hortonworks
Microsoft on October 2011 to develop Azure & Window server
Infomatica on November 2011 to develop HParser
Teradata on February 2012 to develop Aster data system
SAP AG on September 2012 announced it would resell Hortonworks
distribution
6/16/2015
Hortonworks
20. About Cloudera
Cloudera is “The commercial Hadoop company”
Founded by leading experts on Hadoop from
Facebook, Google, Oracle and Yahoo
Provides consulting and training services for
Hadoop users
Staff includes several committers to Hadoop
projects
6/16/2015
20
21. HaaS example
Amazon Web Services(AWS) -Amazon Elastic
MapReduce (EMR) providing Hadoop based
platform for data analysis with S3 as the storage
system and EC2 as the compute system
Microsoft HDInsight, Cloudera CDH3, IBM
Infoshpere BigInsights, EMC GreenPlum HD and
Windows Azure HDInsight Service are the primary
HaaS services by global IT giants
22. What is MapReduce Used For?
In research:
Analyzing Wikipedia conflicts (PARC)
Natural language processing (CMU)
Climate simulation (Washington)
Bioinformatics (Maryland)
Particle physics (Nebraska)
<Your application here>
23. Example: Word Count
def mapper(line):
foreach word in line.split():
output(word, 1)
def reducer(key, values):
output(key, sum(values))
24. Key Cloud Solution Providers for Hadoop as A Service
• Windows azure
• Aws
• Google
25. Windows Azure
Enterprise-level on-demand capacity builder
Fabric of cycles and storage available on-request for
a cost
You have to use Azure API to work with the
infrastructure offered by Microsoft
Significant features: web role, worker role , blob
storage, table and drive-storage
25
26. Amazon EC2
EC2 provided an API for instantiating computing
instances with any of the operating systems
supported.
Excellent distribution, load balancing, cloud
monitoring tools
26
27. Google App Engine
Google offers the same reliability, availability and
scalability at par with Google’s own applications
27
28. MapReduce Engine
MapReduce requires a distributed file system and an
engine that can distribute, coordinate, monitor and
gather the results.
Hadoop provides that engine through (the file system
we discussed earlier) and the JobTracker +
TaskTracker system.
JobTracker is simply a scheduler.
TaskTracker is assigned a Map or Reduce (or other
operations); Map or Reduce run on node and so is
the TaskTracker; each task is run on its own JVM on
a node.
28
29. Building a Custom MapReduce Job in .NET
A .NET map-reduce program comprises a number of
parts
Job definition
Mapper, Reducer, and Combiner classes
Input data
Job executor
The Map() function alone is enough for a simple calculation like determining square roots. So your Reducer class
would not have any processing code or logic in this case. You can choose to omit it because Reduce and Combine are
optional operations in a MapReduce job. However, it is a good practice to have the skeleton class for the Reducer,
which derives from the ReducerCombinerBase .NET Framework class, as shown in You can write your code
in the overridden Reduce() method later if you need to implement any reduce operations.