SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
A D N A N M A S O O D , P H D
S Y S T E M S A R C H I T E C T / S O F T W A R E E N G I N E E R
A D N A N . M A S O O D @ O W A S P . O R G
( H T T P : / / B L O G . A D N A N M A S O O D . C O M )
G I T H U B ( G I T H U B . C O M / A D N A N M A S O O D ) ,
T W I T T E R ( @ A D N A N M A S O O D ) .
P R E S E N T E D A T M I C R O S O F T D A T A S C I E N C E G R O U P –
T A M P A A N A L Y T I C S P R O F E S S I O N A L S
H T T P : / / W W W . M E E T U P . C O M / A N A L Y T I C S - P R O F E S S I O N A L S - O F -
T A M P A / E V E N T S / 2 2 8 7 9 6 3 4 3 /
Data Science with Windows Azure
About the Speaker
Adnan Masood, Ph.D. is a developer, software architect, and researcher and specializes in machine
learning and Bayesian belief networks. Before joining PDS Health care, and GDC (a leading prepaid
financial technology institution), he enjoyed life as a principal engineer of a start-up and worked for a
leading UK based nonprofit organization as a solutions architect.
A strong believer in the development community, Adnan is an active member of the Open Web
Application Security Project (OWASP), an organization dedicated to software security. In the .NET
community, he is a cofounder and president of the Pasadena .NET Developers group, which he has been
successfully leading for 8 years. He led a number of successful enterprise solutions and consulted for
several Fortune 500 company projects.
Adnan devotes himself to his own continual, practical education. He holds certifications in big data,
machine learning, and systems architecture from Massachusetts Institute of Technology; an Application
Security certification from Stanford University; an SOA Smarts certification from Carnegie Mellon
University; and certifications as a ScrumMaster, Microsoft Certified Trainer, Microsoft Certified Solutions
Developer, and Sun Certified Java Developer.
Key Take Aways from this Talk
Understand what Microsoft Offers for Data Science in Windows Azure. (or
how to write mapReduce jobs in C#)
Diagrams are Courtesy of Microsoft Corporation
Data science with Windows Azure - A Brief Introduction
Diagrams are Courtesy of Microsoft Corporation
Diagrams are Courtesy of Microsoft Corporation
Diagrams are Courtesy of Microsoft Corporation
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
10 6/16/2015
11 6/16/2015
What is Hadoop?
 At Google MapReduce operation are run on a
special file system called Google File System (GFS)
that is highly optimized for this purpose.
 GFS is not open source.
 Doug Cutting and others at Yahoo! reverse
engineered the GFS and called it Hadoop Distributed
File System (HDFS).
 The software framework that supports HDFS,
MapReduce and other related entities is called the
project Hadoop or simply Hadoop.
 This is open source and distributed by Apache.
12
MapReduce13
MapReduce is a framework for processing parallelizable
problems across huge datasets using a large number of
computers (nodes), collectively referred to as a cluster
or a grid.
6/16/2015
Classes of problems “mapreducable”
 Benchmark for comparing: Jim Gray’s challenge on data-
intensive computing. Ex: “Sort”
 Google uses it for wordcount, adwords, pagerank, indexing data.
 Simple algorithms such as grep, text-indexing, reverse indexing
 Bayesian classification: data mining domain
 Facebook uses it for various operations: demographics
 Financial services use it for analytics
 Astronomy: Gaussian analysis for locating extra-terrestrial
objects.
 Expected to play a critical role in semantic web and in web 3.0
14
Apache Spark
 Apache Spark is an open source cluster
computing framework originally developed in the
AMPlab at UC Berkley.
 Spark in-memory provides performance up to
100 times faster for certain applications.
 Spark is well suited for machine learning
algorithms.
 Spark requires a cluster manager and a
distributed storage system.
 Spark supports Hadoop YARN.
6/16/2015
15
How Hadoop Operates
16
6/16/2015
Example: counting the number of occurrences for each word
in a collection of documents
 The input file is a repository of documents, and each
document is an element. The Map function for this example
uses keys that are of type String (the words) and values
that are integers. The Map task reads a document and
breaks it into its sequence of words w1,w2, . . . ,wn. It then
emits a sequence of key-value pairs where the value is
always 1. That is, the output of the Map task for this
document is the sequence of key-value pairs:
 (w1, 1), (w2, 1), . . . , (wn, 1)
6/16/2015
17
Key Players in Hadoop World
 HortonWorks
 Cloudera
 MAPR
 Hortonworks is a Business computer software company based in Palo
Alto,California
 Hortonworks supports & develops Apache Hadoop framework, that
allows distributed processing of large data sets across clusters of
computers
 They are the sponsors of Apache Software Foundation
 Founded in June 2011 by Yahoo and Benchmark capital as an
independent company. It went public on December 2014
 Below are the list of company collaborated with Hortonworks
 Microsoft on October 2011 to develop Azure & Window server
 Infomatica on November 2011 to develop HParser
 Teradata on February 2012 to develop Aster data system
 SAP AG on September 2012 announced it would resell Hortonworks
distribution
6/16/2015
Hortonworks
About Cloudera
 Cloudera is “The commercial Hadoop company”
 Founded by leading experts on Hadoop from
Facebook, Google, Oracle and Yahoo
 Provides consulting and training services for
Hadoop users
 Staff includes several committers to Hadoop
projects
6/16/2015
20
HaaS example
Amazon Web Services(AWS) -Amazon Elastic
MapReduce (EMR) providing Hadoop based
platform for data analysis with S3 as the storage
system and EC2 as the compute system
Microsoft HDInsight, Cloudera CDH3, IBM
Infoshpere BigInsights, EMC GreenPlum HD and
Windows Azure HDInsight Service are the primary
HaaS services by global IT giants
What is MapReduce Used For?
 In research:
 Analyzing Wikipedia conflicts (PARC)
 Natural language processing (CMU)
 Climate simulation (Washington)
 Bioinformatics (Maryland)
 Particle physics (Nebraska)
 <Your application here>
Example: Word Count
def mapper(line):
foreach word in line.split():
output(word, 1)
def reducer(key, values):
output(key, sum(values))
Key Cloud Solution Providers for Hadoop as A Service
• Windows azure
• Aws
• Google
Windows Azure
 Enterprise-level on-demand capacity builder
 Fabric of cycles and storage available on-request for
a cost
 You have to use Azure API to work with the
infrastructure offered by Microsoft
 Significant features: web role, worker role , blob
storage, table and drive-storage
25
Amazon EC2
 EC2 provided an API for instantiating computing
instances with any of the operating systems
supported.
 Excellent distribution, load balancing, cloud
monitoring tools
26
Google App Engine
 Google offers the same reliability, availability and
scalability at par with Google’s own applications
27
MapReduce Engine
 MapReduce requires a distributed file system and an
engine that can distribute, coordinate, monitor and
gather the results.
 Hadoop provides that engine through (the file system
we discussed earlier) and the JobTracker +
TaskTracker system.
 JobTracker is simply a scheduler.
 TaskTracker is assigned a Map or Reduce (or other
operations); Map or Reduce run on node and so is
the TaskTracker; each task is run on its own JVM on
a node.
28
Building a Custom MapReduce Job in .NET
 A .NET map-reduce program comprises a number of
parts
 Job definition
 Mapper, Reducer, and Combiner classes
 Input data
 Job executor
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
References & Further Reading
References & Further Reading
 https://azure.microsoft.com/en-
us/documentation/articles/hdinsight-use-
mapreduce/
 https://azure.microsoft.com/en-
us/documentation/articles/hdinsight-apache-spark-
zeppelin-notebook-jupyter-spark-sql/
 https://azure.microsoft.com/en-
us/services/machine-learning/
Questions

Más contenido relacionado

La actualidad más candente

Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopAdam Muise
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asiaMuhammad Rifqi
 
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Edureka!
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...CloudxLab
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013Adam Muise
 
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |EdurekaHadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |EdurekaEdureka!
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
 
Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersEdureka!
 
Hadoop essentials by shiva achari - sample chapter
Hadoop essentials by shiva achari - sample chapterHadoop essentials by shiva achari - sample chapter
Hadoop essentials by shiva achari - sample chapterShiva Achari
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course pptNjain85
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java ProfessionalsEdureka!
 
Hadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionalsHadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionalsEdureka!
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in detailsMahmoud Yassin
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data HadoopApache Apex
 
DBA to Data Scientist
DBA to Data ScientistDBA to Data Scientist
DBA to Data Scientistpasalapudi
 

La actualidad más candente (20)

Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of Hadoop
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
 
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013
 
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |EdurekaHadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-Programmers
 
Hadoop essentials by shiva achari - sample chapter
Hadoop essentials by shiva achari - sample chapterHadoop essentials by shiva achari - sample chapter
Hadoop essentials by shiva achari - sample chapter
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java Professionals
 
Hadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionalsHadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionals
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
DBA to Data Scientist
DBA to Data ScientistDBA to Data Scientist
DBA to Data Scientist
 
Hadoop
HadoopHadoop
Hadoop
 

Destacado

The path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial ServicesThe path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial ServicesHortonworks
 
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhDSpark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhDAdnan Masood
 
Spark with HDInsight
Spark with HDInsightSpark with HDInsight
Spark with HDInsightKhalid Salama
 
Applying Machine Learning using H2O
Applying Machine Learning using H2OApplying Machine Learning using H2O
Applying Machine Learning using H2OSri Ambati
 
Restructuring Technical Debt - A Software and System Quality Approach
Restructuring Technical Debt - A Software and System Quality ApproachRestructuring Technical Debt - A Software and System Quality Approach
Restructuring Technical Debt - A Software and System Quality ApproachAdnan Masood
 
Business Intelligence Barista: What DataViz Tool to Use, and When?
Business Intelligence Barista: What DataViz Tool to Use, and When?Business Intelligence Barista: What DataViz Tool to Use, and When?
Business Intelligence Barista: What DataViz Tool to Use, and When?Jen Stirrup
 
Business Intelligence Barista: What DataViz Tool to Use, and When?
Business Intelligence Barista: What DataViz Tool to Use, and When?Business Intelligence Barista: What DataViz Tool to Use, and When?
Business Intelligence Barista: What DataViz Tool to Use, and When?Jen Stirrup
 
Visualising the tabular model for power view upload
Visualising the tabular model for power view uploadVisualising the tabular model for power view upload
Visualising the tabular model for power view uploadJen Stirrup
 
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data VisualisationDigital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data VisualisationJen Stirrup
 
Cloud Computing Architecture Primer
Cloud Computing Architecture PrimerCloud Computing Architecture Primer
Cloud Computing Architecture PrimerIlham Ahmed
 
System Quality Attributes for Software Architecture
System Quality Attributes for Software ArchitectureSystem Quality Attributes for Software Architecture
System Quality Attributes for Software ArchitectureAdnan Masood
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight ServiceNeil Mackenzie
 
How Universities Use Big Data to Transform Education
How Universities Use Big Data to Transform EducationHow Universities Use Big Data to Transform Education
How Universities Use Big Data to Transform EducationHortonworks
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureKhalid Salama
 
Hive - 1455: Cloud Storage
Hive - 1455: Cloud StorageHive - 1455: Cloud Storage
Hive - 1455: Cloud StorageHortonworks
 
How to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDBHow to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDBHortonworks
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
Dynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDPDynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDPHortonworks
 

Destacado (20)

The path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial ServicesThe path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial Services
 
Realtime analytics with_hadoop
Realtime analytics with_hadoopRealtime analytics with_hadoop
Realtime analytics with_hadoop
 
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhDSpark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
 
Spark with HDInsight
Spark with HDInsightSpark with HDInsight
Spark with HDInsight
 
Applying Machine Learning using H2O
Applying Machine Learning using H2OApplying Machine Learning using H2O
Applying Machine Learning using H2O
 
Restructuring Technical Debt - A Software and System Quality Approach
Restructuring Technical Debt - A Software and System Quality ApproachRestructuring Technical Debt - A Software and System Quality Approach
Restructuring Technical Debt - A Software and System Quality Approach
 
Business Intelligence Barista: What DataViz Tool to Use, and When?
Business Intelligence Barista: What DataViz Tool to Use, and When?Business Intelligence Barista: What DataViz Tool to Use, and When?
Business Intelligence Barista: What DataViz Tool to Use, and When?
 
Business Intelligence Barista: What DataViz Tool to Use, and When?
Business Intelligence Barista: What DataViz Tool to Use, and When?Business Intelligence Barista: What DataViz Tool to Use, and When?
Business Intelligence Barista: What DataViz Tool to Use, and When?
 
Cloud computing by Bhavesh
Cloud computing by BhaveshCloud computing by Bhavesh
Cloud computing by Bhavesh
 
Visualising the tabular model for power view upload
Visualising the tabular model for power view uploadVisualising the tabular model for power view upload
Visualising the tabular model for power view upload
 
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data VisualisationDigital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
 
Cloud Computing Architecture Primer
Cloud Computing Architecture PrimerCloud Computing Architecture Primer
Cloud Computing Architecture Primer
 
System Quality Attributes for Software Architecture
System Quality Attributes for Software ArchitectureSystem Quality Attributes for Software Architecture
System Quality Attributes for Software Architecture
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
 
How Universities Use Big Data to Transform Education
How Universities Use Big Data to Transform EducationHow Universities Use Big Data to Transform Education
How Universities Use Big Data to Transform Education
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft Azure
 
Hive - 1455: Cloud Storage
Hive - 1455: Cloud StorageHive - 1455: Cloud Storage
Hive - 1455: Cloud Storage
 
How to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDBHow to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDB
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Dynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDPDynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDP
 

Similar a Data science with Windows Azure - A Brief Introduction

Similar a Data science with Windows Azure - A Brief Introduction (20)

Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop Overview
Hadoop OverviewHadoop Overview
Hadoop Overview
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Big Data-Survey
Big Data-SurveyBig Data-Survey
Big Data-Survey
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
IJET-V3I2P14
 
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Madhu
MadhuMadhu
Madhu
 
Aioug big data and hadoop
Aioug  big data and hadoopAioug  big data and hadoop
Aioug big data and hadoop
 
Rajesh Angadi Brochure
Rajesh Angadi Brochure Rajesh Angadi Brochure
Rajesh Angadi Brochure
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Top 10 renowned big data companies
Top 10 renowned big data companiesTop 10 renowned big data companies
Top 10 renowned big data companies
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
CSB_community
CSB_communityCSB_community
CSB_community
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 

Más de Adnan Masood

Agile Software Development
Agile Software DevelopmentAgile Software Development
Agile Software DevelopmentAdnan Masood
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationAdnan Masood
 
Bayesian Networks and Association Analysis
Bayesian Networks and Association AnalysisBayesian Networks and Association Analysis
Bayesian Networks and Association AnalysisAdnan Masood
 
Probabilistic Interestingness Measures - An Introduction with Bayesian Belief...
Probabilistic Interestingness Measures - An Introduction with Bayesian Belief...Probabilistic Interestingness Measures - An Introduction with Bayesian Belief...
Probabilistic Interestingness Measures - An Introduction with Bayesian Belief...Adnan Masood
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionAdnan Masood
 
Web API or WCF - An Architectural Comparison
Web API or WCF - An Architectural ComparisonWeb API or WCF - An Architectural Comparison
Web API or WCF - An Architectural ComparisonAdnan Masood
 
SOLID Principles of Refactoring Presentation - Inland Empire User Group
SOLID Principles of Refactoring Presentation - Inland Empire User GroupSOLID Principles of Refactoring Presentation - Inland Empire User Group
SOLID Principles of Refactoring Presentation - Inland Empire User GroupAdnan Masood
 
Brief bibliography of interestingness measure, bayesian belief network and ca...
Brief bibliography of interestingness measure, bayesian belief network and ca...Brief bibliography of interestingness measure, bayesian belief network and ca...
Brief bibliography of interestingness measure, bayesian belief network and ca...Adnan Masood
 

Más de Adnan Masood (8)

Agile Software Development
Agile Software DevelopmentAgile Software Development
Agile Software Development
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian Classification
 
Bayesian Networks and Association Analysis
Bayesian Networks and Association AnalysisBayesian Networks and Association Analysis
Bayesian Networks and Association Analysis
 
Probabilistic Interestingness Measures - An Introduction with Bayesian Belief...
Probabilistic Interestingness Measures - An Introduction with Bayesian Belief...Probabilistic Interestingness Measures - An Introduction with Bayesian Belief...
Probabilistic Interestingness Measures - An Introduction with Bayesian Belief...
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief Introduction
 
Web API or WCF - An Architectural Comparison
Web API or WCF - An Architectural ComparisonWeb API or WCF - An Architectural Comparison
Web API or WCF - An Architectural Comparison
 
SOLID Principles of Refactoring Presentation - Inland Empire User Group
SOLID Principles of Refactoring Presentation - Inland Empire User GroupSOLID Principles of Refactoring Presentation - Inland Empire User Group
SOLID Principles of Refactoring Presentation - Inland Empire User Group
 
Brief bibliography of interestingness measure, bayesian belief network and ca...
Brief bibliography of interestingness measure, bayesian belief network and ca...Brief bibliography of interestingness measure, bayesian belief network and ca...
Brief bibliography of interestingness measure, bayesian belief network and ca...
 

Último

Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 

Último (20)

201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 

Data science with Windows Azure - A Brief Introduction

  • 1. A D N A N M A S O O D , P H D S Y S T E M S A R C H I T E C T / S O F T W A R E E N G I N E E R A D N A N . M A S O O D @ O W A S P . O R G ( H T T P : / / B L O G . A D N A N M A S O O D . C O M ) G I T H U B ( G I T H U B . C O M / A D N A N M A S O O D ) , T W I T T E R ( @ A D N A N M A S O O D ) . P R E S E N T E D A T M I C R O S O F T D A T A S C I E N C E G R O U P – T A M P A A N A L Y T I C S P R O F E S S I O N A L S H T T P : / / W W W . M E E T U P . C O M / A N A L Y T I C S - P R O F E S S I O N A L S - O F - T A M P A / E V E N T S / 2 2 8 7 9 6 3 4 3 / Data Science with Windows Azure
  • 2. About the Speaker Adnan Masood, Ph.D. is a developer, software architect, and researcher and specializes in machine learning and Bayesian belief networks. Before joining PDS Health care, and GDC (a leading prepaid financial technology institution), he enjoyed life as a principal engineer of a start-up and worked for a leading UK based nonprofit organization as a solutions architect. A strong believer in the development community, Adnan is an active member of the Open Web Application Security Project (OWASP), an organization dedicated to software security. In the .NET community, he is a cofounder and president of the Pasadena .NET Developers group, which he has been successfully leading for 8 years. He led a number of successful enterprise solutions and consulted for several Fortune 500 company projects. Adnan devotes himself to his own continual, practical education. He holds certifications in big data, machine learning, and systems architecture from Massachusetts Institute of Technology; an Application Security certification from Stanford University; an SOA Smarts certification from Carnegie Mellon University; and certifications as a ScrumMaster, Microsoft Certified Trainer, Microsoft Certified Solutions Developer, and Sun Certified Java Developer.
  • 3. Key Take Aways from this Talk Understand what Microsoft Offers for Data Science in Windows Azure. (or how to write mapReduce jobs in C#) Diagrams are Courtesy of Microsoft Corporation
  • 5. Diagrams are Courtesy of Microsoft Corporation
  • 6. Diagrams are Courtesy of Microsoft Corporation
  • 7. Diagrams are Courtesy of Microsoft Corporation
  • 12. What is Hadoop?  At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose.  GFS is not open source.  Doug Cutting and others at Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS).  The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop.  This is open source and distributed by Apache. 12
  • 13. MapReduce13 MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster or a grid. 6/16/2015
  • 14. Classes of problems “mapreducable”  Benchmark for comparing: Jim Gray’s challenge on data- intensive computing. Ex: “Sort”  Google uses it for wordcount, adwords, pagerank, indexing data.  Simple algorithms such as grep, text-indexing, reverse indexing  Bayesian classification: data mining domain  Facebook uses it for various operations: demographics  Financial services use it for analytics  Astronomy: Gaussian analysis for locating extra-terrestrial objects.  Expected to play a critical role in semantic web and in web 3.0 14
  • 15. Apache Spark  Apache Spark is an open source cluster computing framework originally developed in the AMPlab at UC Berkley.  Spark in-memory provides performance up to 100 times faster for certain applications.  Spark is well suited for machine learning algorithms.  Spark requires a cluster manager and a distributed storage system.  Spark supports Hadoop YARN. 6/16/2015 15
  • 17. Example: counting the number of occurrences for each word in a collection of documents  The input file is a repository of documents, and each document is an element. The Map function for this example uses keys that are of type String (the words) and values that are integers. The Map task reads a document and breaks it into its sequence of words w1,w2, . . . ,wn. It then emits a sequence of key-value pairs where the value is always 1. That is, the output of the Map task for this document is the sequence of key-value pairs:  (w1, 1), (w2, 1), . . . , (wn, 1) 6/16/2015 17
  • 18. Key Players in Hadoop World  HortonWorks  Cloudera  MAPR
  • 19.  Hortonworks is a Business computer software company based in Palo Alto,California  Hortonworks supports & develops Apache Hadoop framework, that allows distributed processing of large data sets across clusters of computers  They are the sponsors of Apache Software Foundation  Founded in June 2011 by Yahoo and Benchmark capital as an independent company. It went public on December 2014  Below are the list of company collaborated with Hortonworks  Microsoft on October 2011 to develop Azure & Window server  Infomatica on November 2011 to develop HParser  Teradata on February 2012 to develop Aster data system  SAP AG on September 2012 announced it would resell Hortonworks distribution 6/16/2015 Hortonworks
  • 20. About Cloudera  Cloudera is “The commercial Hadoop company”  Founded by leading experts on Hadoop from Facebook, Google, Oracle and Yahoo  Provides consulting and training services for Hadoop users  Staff includes several committers to Hadoop projects 6/16/2015 20
  • 21. HaaS example Amazon Web Services(AWS) -Amazon Elastic MapReduce (EMR) providing Hadoop based platform for data analysis with S3 as the storage system and EC2 as the compute system Microsoft HDInsight, Cloudera CDH3, IBM Infoshpere BigInsights, EMC GreenPlum HD and Windows Azure HDInsight Service are the primary HaaS services by global IT giants
  • 22. What is MapReduce Used For?  In research:  Analyzing Wikipedia conflicts (PARC)  Natural language processing (CMU)  Climate simulation (Washington)  Bioinformatics (Maryland)  Particle physics (Nebraska)  <Your application here>
  • 23. Example: Word Count def mapper(line): foreach word in line.split(): output(word, 1) def reducer(key, values): output(key, sum(values))
  • 24. Key Cloud Solution Providers for Hadoop as A Service • Windows azure • Aws • Google
  • 25. Windows Azure  Enterprise-level on-demand capacity builder  Fabric of cycles and storage available on-request for a cost  You have to use Azure API to work with the infrastructure offered by Microsoft  Significant features: web role, worker role , blob storage, table and drive-storage 25
  • 26. Amazon EC2  EC2 provided an API for instantiating computing instances with any of the operating systems supported.  Excellent distribution, load balancing, cloud monitoring tools 26
  • 27. Google App Engine  Google offers the same reliability, availability and scalability at par with Google’s own applications 27
  • 28. MapReduce Engine  MapReduce requires a distributed file system and an engine that can distribute, coordinate, monitor and gather the results.  Hadoop provides that engine through (the file system we discussed earlier) and the JobTracker + TaskTracker system.  JobTracker is simply a scheduler.  TaskTracker is assigned a Map or Reduce (or other operations); Map or Reduce run on node and so is the TaskTracker; each task is run on its own JVM on a node. 28
  • 29. Building a Custom MapReduce Job in .NET  A .NET map-reduce program comprises a number of parts  Job definition  Mapper, Reducer, and Combiner classes  Input data  Job executor
  • 39. References & Further Reading  https://azure.microsoft.com/en- us/documentation/articles/hdinsight-use- mapreduce/  https://azure.microsoft.com/en- us/documentation/articles/hdinsight-apache-spark- zeppelin-notebook-jupyter-spark-sql/  https://azure.microsoft.com/en- us/services/machine-learning/

Notas del editor

  1. The Map() function alone is enough for a simple calculation like determining square roots. So your Reducer class would not have any processing code or logic in this case. You can choose to omit it because Reduce and Combine are optional operations in a MapReduce job. However, it is a good practice to have the skeleton class for the Reducer, which derives from the ReducerCombinerBase .NET Framework class, as shown in You can write your code in the overridden Reduce() method later if you need to implement any reduce operations.