SlideShare una empresa de Scribd logo
1 de 29
Map-Reduce and Apache
Hadoop for Big Data
Computations
Svetlin Nakov
Training and Inspiration
Software University
http://softuni.bg
BGOUG Seminar, Borovets, 4-June-2016
2
1. Big Data
2. Map-Reduce
 What is Map-Reduce?
 How It Works?
 Mappers and Reducers
 Examples
3. Apache Hadoop
Table of Contents
Big Data
What is Big Data Processing?
4
 Big data == process very large data sets (terabytes / petabytes)
 So large or complex, that traditional data processing is inadequate
 Usually stored and processed by distributed databases
 Often related to analysis of very large data sets
 Typical components of big data systems
 Distributed databases (like Cassandra, HBase and Hive)
 Distributed processing frameworks (like Apache Hadoop)
 Distributed processing systems (like Map-Reduce)
 Distributed file systems (like HDFS)
Big Data
Map-Reduce
6
 Map-Reduce is a distributed processing framework
 Computational model for processing huge data-sets (terabytes)
 Using parallel processing on large clusters (thousands of nodes)
 Relying on distributed infrastructure
 Like Apache Hadoop or MongoDB cluster
 The input and output data is stored in a distributed file system (or
distributed database)
 The framework takes care of scheduling, executing and monitoring
tasks, and re-executes the failed tasks
What is Map-Reduce?
7
 How map-reduce works?
1. Splits the input data-set into independent chunks
2. Process each chunk by "map" tasks in parallel manner
 The "map" function groups the input data into key-value pairs
 Equal keys are processed by the same "reduce" node
3. Outputs of the "map" tasks are
 Sorted, grouped by key, then sent as input to the "reduce" tasks
4. The "reduce" tasks
 Aggregate the results per each key and produces the final output
Map-Reduce: How It Works?
8
 The map-reduce process is a sequence of transformations,
executed on several nodes in parallel
 Map: groups input chunks of data to key-value pairs
 E.g. splits documents(id, chunk-content) into words(word, count)
 Combine: sorts and combines all values by the same key
 E.g. produce a list of counts for each word
 Reduce: combines (aggregates) all values for certain key
The Map-Reduce Process
(input) <key1, val1>  map  <key2, val2>  combine 
<key2, list<val2>>  reduce  <key3, val3> (output)
9
 We have a very large set of documents (e.g. 200 terabytes)
 We want to count how many times each word occurs
 Input: set of documents {key + content}
 Mapper:
 Extract the words from each document (words are used as keys)
 Transforms documents {key + content}  word-count-pairs {word, count}
 Reducer:
 Sums the counts for each word
 Transforms {word, list<count>}  word-count-pairs {word, count}
Example: Counting Words
10
Counting Words: Mapper and Reducer
public void map(Object offset, Text docText, Context context)
throws IOException, InterruptedException {
String[] words = docText.toString().toLowerCase().split("W+");
for (String word : words)
context.write(new Text(word), new IntWritable(1));
}
public void reduce(Text word, Iterable<IntWritable> counts,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable count : counts)
sum += count.get();
context.write(word, new IntWritable(sum));
}
Word Count in Apache Hadoop
Live Demo
12
 We are given a CSV file holding real estate sales data:
 Estate address, city, ZIP code, state, # beds, # baths, square foots,
sale date price and GPS coordinates (latitude + longitude)
 Find all cities that have sales in price range [100 000 … 200 000]
 As side effect, find the sum of all sales by city
Example: Extract Data from CSV Report
13
Process CSV Report – How It Works?
SELECT
city,
SUM(price)
FROM
Sales
GROUP BY
city
chunkingchunking
map map
map
city sum(price)
SACRAMENTO 625140
LINCOLN 843620
RIO LINDA 348500
reduce
reduce
reduce
14
Process CSV Report: Mapper and Reducer
public void map(Object offset, Text inputCSVLine, Context context)
throws IOException, InterruptedException {
String[] fields = inputCSVLine.toString().split(",");
String city = fields[1];
int price = Integer.parseInt(fields[9]);
if (price > 100000 && price < 200000)
context.write(new Text(city), new LongWritable(price);
}
public void reduce(Text city, Iterable<LongWritable> prices,
Context context) throws IOException, InterruptedException {
long sum = 0;
for (LongWritable val : prices)
sum += val.get();
context.write(city, new LongWritable(sum));
}
Processing CSV Report
in Apache Hadoop
Live Demo
Apache Hadoop
Distributed Processing Framework
17
 Apache Hadoop project develops open-source software for reliable,
scalable, distributed computing
 Hadoop Distributed File System (HDFS) – a distributed file system that
transparently moves data across Hadoop cluster nodes
 Hadoop MapReduce – the map-reduce framework
 HBase – a scalable, distributed database for large tables
 Hive – SQL-like query for large datasets
 Pig – a high-level data-flow language for parallel computation
 Hadoop is driven by big players like IBM, Microsoft, Facebook,
VMware, LinkedIn, Yahoo, Cloudera, Intel, Twitter, Hortonworks, …
Apache Hadoop
18
Hadoop Ecosystem
HDFS Storage
Redundant (3 copies)
For large files – large blocks
64 MB or 128 MB / block
Can scale to 1000s of nodes
MapReduce API
Batch (Job) processing
Distributed and localized to
clusters
Auto-parallelizable for huge
amounts of data
Fault-tolerant (auto retries)
Adds high availability and
more
Hadoop Libraries
Pig
Hive
HBase
Others
19
Hadoop Cluster HDFS (Physical) Storage
Name Node
Data Node
1
Data Node
2
Data Node
3
Secondary
Name Node
• Contains web site to view cluster
information
• V2 Hadoop uses multiple Name Nodes
for HA
One Name Node
• 3 copies of each node by default
Many Data Nodes
• Using common Linux shell commands
• Block size is 64 or 128 MB
Work with data in HDFS
 Tips:
 sudo means "run as administrator" (super user)
 Some distributions use hadoop dfs rather than hadoop fs
Common Hadoop Shell Commands
hadoop fs –cat file:///file2
hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2
hadoop fs –copyFromLocal <fromDir> <toDir>
hadoop fs –put <localfile> hdfs://nn.example.com/hadoopfile
sudo hadoop jar <jarFileName> <class> <fromDir> <toDir>
hadoop fs –ls /user/hadoop/dir1
hadoop fs –cat hdfs://nn1.example.com/file1
hadoop fs –get /user/hadoop/file <localfile>
Hadoop Shell Commands
Live Demo
22
 Apache Hadoop MapReduce
 The world's leading implementation of the map-reduce
computational model
 Provides parallelized (scalable) computing
 For processing very large data sets
 Fault tolerant
 Runs on commodity of hardware
 Implemented in many cloud platforms: Amazon EMR, Azure
HDInsight, Google Cloud, Cloudera, Rackspace, HP Cloud, …
Hadoop MapReduce
23
Hadoop Map-Reduce Pipeline
24
 Download and install Java and Hadoop
 http://hadoop.apache.org/releases.html
 You will need to install Java first
 Download a pre-installed Hadoop virtual machine (VM)
 Hortonworks Sandbox
 Cloudera QuickStart VM
 You can use Hadoop in the cloud / local emulator
 E.g. Azure HDInsight Emulator
Hadoop: Getting Started
Playing with Apache Hadoop
Live Demo
26
 Big data == processing huge datasets that are too big for
processing on a single machine
 Use a cluster of computing nodes
 Map-reduce == computational paradigm
for parallel data processing of huge data-sets
 Data is chunked, then mapped into groups,
then groups are processed and the results are aggregated
 Highly scalable, can process petabytes of data
 Apache Hadoop – industry's leading Map-Reduce framework
Summary
?
Map-Reduce
https://softuni.bg/trainings/1194/Algorithms-September-2015
License
 This course (slides, examples, labs, videos, homework, etc.)
is licensed under the "Creative Commons Attribution-
NonCommercial-ShareAlike 4.0 International" license
28
 Attribution: this work may contain portions from
 "Fundamentals of Computer Programming with C#" book by Svetlin Nakov & Co. under CC-BY-SA license
 "Data Structures and Algorithms" course by Telerik Academy under CC-BY-NC-SA license
Free Trainings @ Software University
 Software University Foundation – softuni.org
 Software University – High-Quality Education,
Profession and Job for Software Developers
 softuni.bg
 Software University @ Facebook
 facebook.com/SoftwareUniversity
 Software University @ YouTube
 youtube.com/SoftwareUniversity
 Software University Forums – forum.softuni.bg

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Spring Data JPA
Spring Data JPASpring Data JPA
Spring Data JPA
 
2001: JNDI Its all in the Context
2001:  JNDI Its all in the Context2001:  JNDI Its all in the Context
2001: JNDI Its all in the Context
 
Jdbc 4.0 New Features And Enhancements
Jdbc 4.0 New Features And EnhancementsJdbc 4.0 New Features And Enhancements
Jdbc 4.0 New Features And Enhancements
 
Simple Jdbc With Spring 2.5
Simple Jdbc With Spring 2.5Simple Jdbc With Spring 2.5
Simple Jdbc With Spring 2.5
 
JDBC
JDBCJDBC
JDBC
 
Jndi
JndiJndi
Jndi
 
ADO.NET
ADO.NETADO.NET
ADO.NET
 
Jndi (1)
Jndi (1)Jndi (1)
Jndi (1)
 
Slickdemo
SlickdemoSlickdemo
Slickdemo
 
04 data accesstechnologies
04 data accesstechnologies04 data accesstechnologies
04 data accesstechnologies
 
ADO .Net
ADO .Net ADO .Net
ADO .Net
 
ADO.NET by ASP.NET Development Company in india
ADO.NET by ASP.NET  Development Company in indiaADO.NET by ASP.NET  Development Company in india
ADO.NET by ASP.NET Development Company in india
 
Database programming
Database programmingDatabase programming
Database programming
 
Http4s, Doobie and Circe: The Functional Web Stack
Http4s, Doobie and Circe: The Functional Web StackHttp4s, Doobie and Circe: The Functional Web Stack
Http4s, Doobie and Circe: The Functional Web Stack
 
Java OOP Programming language (Part 3) - Class and Object
Java OOP Programming language (Part 3) - Class and ObjectJava OOP Programming language (Part 3) - Class and Object
Java OOP Programming language (Part 3) - Class and Object
 
JAXP
JAXPJAXP
JAXP
 
Spring framework part 2
Spring framework part 2Spring framework part 2
Spring framework part 2
 
Entity Framework: Code First and Magic Unicorns
Entity Framework: Code First and Magic UnicornsEntity Framework: Code First and Magic Unicorns
Entity Framework: Code First and Magic Unicorns
 
Wed 1630 greene_robert_color
Wed 1630 greene_robert_colorWed 1630 greene_robert_color
Wed 1630 greene_robert_color
 
Jdbc
JdbcJdbc
Jdbc
 

Similar a Map-Reduce and Apache Hadoop

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopDilum Bandara
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache HadoopOleksiy Krotov
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingCollin Bennett
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueShay Sofer
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014rpbrehm
 
Escalando Aplicaciones Web
Escalando Aplicaciones WebEscalando Aplicaciones Web
Escalando Aplicaciones WebSantiago Coffey
 

Similar a Map-Reduce and Apache Hadoop (20)

Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
מיכאל
מיכאלמיכאל
מיכאל
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop
HadoopHadoop
Hadoop
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Hadoop
HadoopHadoop
Hadoop
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Map reducefunnyslide
Map reducefunnyslideMap reducefunnyslide
Map reducefunnyslide
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Escalando Aplicaciones Web
Escalando Aplicaciones WebEscalando Aplicaciones Web
Escalando Aplicaciones Web
 

Más de Svetlin Nakov

BG-IT-Edu: отворено учебно съдържание за ИТ учители
BG-IT-Edu: отворено учебно съдържание за ИТ учителиBG-IT-Edu: отворено учебно съдържание за ИТ учители
BG-IT-Edu: отворено учебно съдържание за ИТ учителиSvetlin Nakov
 
Programming World in 2024
Programming World in 2024Programming World in 2024
Programming World in 2024Svetlin Nakov
 
AI Tools for Business and Startups
AI Tools for Business and StartupsAI Tools for Business and Startups
AI Tools for Business and StartupsSvetlin Nakov
 
AI Tools for Scientists - Nakov (Oct 2023)
AI Tools for Scientists - Nakov (Oct 2023)AI Tools for Scientists - Nakov (Oct 2023)
AI Tools for Scientists - Nakov (Oct 2023)Svetlin Nakov
 
AI Tools for Entrepreneurs
AI Tools for EntrepreneursAI Tools for Entrepreneurs
AI Tools for EntrepreneursSvetlin Nakov
 
Bulgarian Tech Industry - Nakov at Dev.BG All in One Conference 2023
Bulgarian Tech Industry - Nakov at Dev.BG All in One Conference 2023Bulgarian Tech Industry - Nakov at Dev.BG All in One Conference 2023
Bulgarian Tech Industry - Nakov at Dev.BG All in One Conference 2023Svetlin Nakov
 
AI Tools for Business and Personal Life
AI Tools for Business and Personal LifeAI Tools for Business and Personal Life
AI Tools for Business and Personal LifeSvetlin Nakov
 
Дипломна работа: учебно съдържание по ООП - Светлин Наков
Дипломна работа: учебно съдържание по ООП - Светлин НаковДипломна работа: учебно съдържание по ООП - Светлин Наков
Дипломна работа: учебно съдържание по ООП - Светлин НаковSvetlin Nakov
 
Дипломна работа: учебно съдържание по ООП
Дипломна работа: учебно съдържание по ООПДипломна работа: учебно съдържание по ООП
Дипломна работа: учебно съдържание по ООПSvetlin Nakov
 
Свободно ИТ учебно съдържание за учители по програмиране и ИТ
Свободно ИТ учебно съдържание за учители по програмиране и ИТСвободно ИТ учебно съдържание за учители по програмиране и ИТ
Свободно ИТ учебно съдържание за учители по програмиране и ИТSvetlin Nakov
 
AI and the Professions of the Future
AI and the Professions of the FutureAI and the Professions of the Future
AI and the Professions of the FutureSvetlin Nakov
 
Programming Languages Trends for 2023
Programming Languages Trends for 2023Programming Languages Trends for 2023
Programming Languages Trends for 2023Svetlin Nakov
 
IT Professions and How to Become a Developer
IT Professions and How to Become a DeveloperIT Professions and How to Become a Developer
IT Professions and How to Become a DeveloperSvetlin Nakov
 
GitHub Actions (Nakov at RuseConf, Sept 2022)
GitHub Actions (Nakov at RuseConf, Sept 2022)GitHub Actions (Nakov at RuseConf, Sept 2022)
GitHub Actions (Nakov at RuseConf, Sept 2022)Svetlin Nakov
 
IT Professions and Their Future
IT Professions and Their FutureIT Professions and Their Future
IT Professions and Their FutureSvetlin Nakov
 
How to Become a QA Engineer and Start a Job
How to Become a QA Engineer and Start a JobHow to Become a QA Engineer and Start a Job
How to Become a QA Engineer and Start a JobSvetlin Nakov
 
Призвание и цели: моята рецепта
Призвание и цели: моята рецептаПризвание и цели: моята рецепта
Призвание и цели: моята рецептаSvetlin Nakov
 
What Mongolian IT Industry Can Learn from Bulgaria?
What Mongolian IT Industry Can Learn from Bulgaria?What Mongolian IT Industry Can Learn from Bulgaria?
What Mongolian IT Industry Can Learn from Bulgaria?Svetlin Nakov
 
How to Become a Software Developer - Nakov in Mongolia (Oct 2022)
How to Become a Software Developer - Nakov in Mongolia (Oct 2022)How to Become a Software Developer - Nakov in Mongolia (Oct 2022)
How to Become a Software Developer - Nakov in Mongolia (Oct 2022)Svetlin Nakov
 
Blockchain and DeFi Overview (Nakov, Sept 2021)
Blockchain and DeFi Overview (Nakov, Sept 2021)Blockchain and DeFi Overview (Nakov, Sept 2021)
Blockchain and DeFi Overview (Nakov, Sept 2021)Svetlin Nakov
 

Más de Svetlin Nakov (20)

BG-IT-Edu: отворено учебно съдържание за ИТ учители
BG-IT-Edu: отворено учебно съдържание за ИТ учителиBG-IT-Edu: отворено учебно съдържание за ИТ учители
BG-IT-Edu: отворено учебно съдържание за ИТ учители
 
Programming World in 2024
Programming World in 2024Programming World in 2024
Programming World in 2024
 
AI Tools for Business and Startups
AI Tools for Business and StartupsAI Tools for Business and Startups
AI Tools for Business and Startups
 
AI Tools for Scientists - Nakov (Oct 2023)
AI Tools for Scientists - Nakov (Oct 2023)AI Tools for Scientists - Nakov (Oct 2023)
AI Tools for Scientists - Nakov (Oct 2023)
 
AI Tools for Entrepreneurs
AI Tools for EntrepreneursAI Tools for Entrepreneurs
AI Tools for Entrepreneurs
 
Bulgarian Tech Industry - Nakov at Dev.BG All in One Conference 2023
Bulgarian Tech Industry - Nakov at Dev.BG All in One Conference 2023Bulgarian Tech Industry - Nakov at Dev.BG All in One Conference 2023
Bulgarian Tech Industry - Nakov at Dev.BG All in One Conference 2023
 
AI Tools for Business and Personal Life
AI Tools for Business and Personal LifeAI Tools for Business and Personal Life
AI Tools for Business and Personal Life
 
Дипломна работа: учебно съдържание по ООП - Светлин Наков
Дипломна работа: учебно съдържание по ООП - Светлин НаковДипломна работа: учебно съдържание по ООП - Светлин Наков
Дипломна работа: учебно съдържание по ООП - Светлин Наков
 
Дипломна работа: учебно съдържание по ООП
Дипломна работа: учебно съдържание по ООПДипломна работа: учебно съдържание по ООП
Дипломна работа: учебно съдържание по ООП
 
Свободно ИТ учебно съдържание за учители по програмиране и ИТ
Свободно ИТ учебно съдържание за учители по програмиране и ИТСвободно ИТ учебно съдържание за учители по програмиране и ИТ
Свободно ИТ учебно съдържание за учители по програмиране и ИТ
 
AI and the Professions of the Future
AI and the Professions of the FutureAI and the Professions of the Future
AI and the Professions of the Future
 
Programming Languages Trends for 2023
Programming Languages Trends for 2023Programming Languages Trends for 2023
Programming Languages Trends for 2023
 
IT Professions and How to Become a Developer
IT Professions and How to Become a DeveloperIT Professions and How to Become a Developer
IT Professions and How to Become a Developer
 
GitHub Actions (Nakov at RuseConf, Sept 2022)
GitHub Actions (Nakov at RuseConf, Sept 2022)GitHub Actions (Nakov at RuseConf, Sept 2022)
GitHub Actions (Nakov at RuseConf, Sept 2022)
 
IT Professions and Their Future
IT Professions and Their FutureIT Professions and Their Future
IT Professions and Their Future
 
How to Become a QA Engineer and Start a Job
How to Become a QA Engineer and Start a JobHow to Become a QA Engineer and Start a Job
How to Become a QA Engineer and Start a Job
 
Призвание и цели: моята рецепта
Призвание и цели: моята рецептаПризвание и цели: моята рецепта
Призвание и цели: моята рецепта
 
What Mongolian IT Industry Can Learn from Bulgaria?
What Mongolian IT Industry Can Learn from Bulgaria?What Mongolian IT Industry Can Learn from Bulgaria?
What Mongolian IT Industry Can Learn from Bulgaria?
 
How to Become a Software Developer - Nakov in Mongolia (Oct 2022)
How to Become a Software Developer - Nakov in Mongolia (Oct 2022)How to Become a Software Developer - Nakov in Mongolia (Oct 2022)
How to Become a Software Developer - Nakov in Mongolia (Oct 2022)
 
Blockchain and DeFi Overview (Nakov, Sept 2021)
Blockchain and DeFi Overview (Nakov, Sept 2021)Blockchain and DeFi Overview (Nakov, Sept 2021)
Blockchain and DeFi Overview (Nakov, Sept 2021)
 

Último

Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxPooja Bhuva
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 

Último (20)

Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 

Map-Reduce and Apache Hadoop

  • 1. Map-Reduce and Apache Hadoop for Big Data Computations Svetlin Nakov Training and Inspiration Software University http://softuni.bg BGOUG Seminar, Borovets, 4-June-2016
  • 2. 2 1. Big Data 2. Map-Reduce  What is Map-Reduce?  How It Works?  Mappers and Reducers  Examples 3. Apache Hadoop Table of Contents
  • 3. Big Data What is Big Data Processing?
  • 4. 4  Big data == process very large data sets (terabytes / petabytes)  So large or complex, that traditional data processing is inadequate  Usually stored and processed by distributed databases  Often related to analysis of very large data sets  Typical components of big data systems  Distributed databases (like Cassandra, HBase and Hive)  Distributed processing frameworks (like Apache Hadoop)  Distributed processing systems (like Map-Reduce)  Distributed file systems (like HDFS) Big Data
  • 6. 6  Map-Reduce is a distributed processing framework  Computational model for processing huge data-sets (terabytes)  Using parallel processing on large clusters (thousands of nodes)  Relying on distributed infrastructure  Like Apache Hadoop or MongoDB cluster  The input and output data is stored in a distributed file system (or distributed database)  The framework takes care of scheduling, executing and monitoring tasks, and re-executes the failed tasks What is Map-Reduce?
  • 7. 7  How map-reduce works? 1. Splits the input data-set into independent chunks 2. Process each chunk by "map" tasks in parallel manner  The "map" function groups the input data into key-value pairs  Equal keys are processed by the same "reduce" node 3. Outputs of the "map" tasks are  Sorted, grouped by key, then sent as input to the "reduce" tasks 4. The "reduce" tasks  Aggregate the results per each key and produces the final output Map-Reduce: How It Works?
  • 8. 8  The map-reduce process is a sequence of transformations, executed on several nodes in parallel  Map: groups input chunks of data to key-value pairs  E.g. splits documents(id, chunk-content) into words(word, count)  Combine: sorts and combines all values by the same key  E.g. produce a list of counts for each word  Reduce: combines (aggregates) all values for certain key The Map-Reduce Process (input) <key1, val1>  map  <key2, val2>  combine  <key2, list<val2>>  reduce  <key3, val3> (output)
  • 9. 9  We have a very large set of documents (e.g. 200 terabytes)  We want to count how many times each word occurs  Input: set of documents {key + content}  Mapper:  Extract the words from each document (words are used as keys)  Transforms documents {key + content}  word-count-pairs {word, count}  Reducer:  Sums the counts for each word  Transforms {word, list<count>}  word-count-pairs {word, count} Example: Counting Words
  • 10. 10 Counting Words: Mapper and Reducer public void map(Object offset, Text docText, Context context) throws IOException, InterruptedException { String[] words = docText.toString().toLowerCase().split("W+"); for (String word : words) context.write(new Text(word), new IntWritable(1)); } public void reduce(Text word, Iterable<IntWritable> counts, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable count : counts) sum += count.get(); context.write(word, new IntWritable(sum)); }
  • 11. Word Count in Apache Hadoop Live Demo
  • 12. 12  We are given a CSV file holding real estate sales data:  Estate address, city, ZIP code, state, # beds, # baths, square foots, sale date price and GPS coordinates (latitude + longitude)  Find all cities that have sales in price range [100 000 … 200 000]  As side effect, find the sum of all sales by city Example: Extract Data from CSV Report
  • 13. 13 Process CSV Report – How It Works? SELECT city, SUM(price) FROM Sales GROUP BY city chunkingchunking map map map city sum(price) SACRAMENTO 625140 LINCOLN 843620 RIO LINDA 348500 reduce reduce reduce
  • 14. 14 Process CSV Report: Mapper and Reducer public void map(Object offset, Text inputCSVLine, Context context) throws IOException, InterruptedException { String[] fields = inputCSVLine.toString().split(","); String city = fields[1]; int price = Integer.parseInt(fields[9]); if (price > 100000 && price < 200000) context.write(new Text(city), new LongWritable(price); } public void reduce(Text city, Iterable<LongWritable> prices, Context context) throws IOException, InterruptedException { long sum = 0; for (LongWritable val : prices) sum += val.get(); context.write(city, new LongWritable(sum)); }
  • 15. Processing CSV Report in Apache Hadoop Live Demo
  • 17. 17  Apache Hadoop project develops open-source software for reliable, scalable, distributed computing  Hadoop Distributed File System (HDFS) – a distributed file system that transparently moves data across Hadoop cluster nodes  Hadoop MapReduce – the map-reduce framework  HBase – a scalable, distributed database for large tables  Hive – SQL-like query for large datasets  Pig – a high-level data-flow language for parallel computation  Hadoop is driven by big players like IBM, Microsoft, Facebook, VMware, LinkedIn, Yahoo, Cloudera, Intel, Twitter, Hortonworks, … Apache Hadoop
  • 18. 18 Hadoop Ecosystem HDFS Storage Redundant (3 copies) For large files – large blocks 64 MB or 128 MB / block Can scale to 1000s of nodes MapReduce API Batch (Job) processing Distributed and localized to clusters Auto-parallelizable for huge amounts of data Fault-tolerant (auto retries) Adds high availability and more Hadoop Libraries Pig Hive HBase Others
  • 19. 19 Hadoop Cluster HDFS (Physical) Storage Name Node Data Node 1 Data Node 2 Data Node 3 Secondary Name Node • Contains web site to view cluster information • V2 Hadoop uses multiple Name Nodes for HA One Name Node • 3 copies of each node by default Many Data Nodes • Using common Linux shell commands • Block size is 64 or 128 MB Work with data in HDFS
  • 20.  Tips:  sudo means "run as administrator" (super user)  Some distributions use hadoop dfs rather than hadoop fs Common Hadoop Shell Commands hadoop fs –cat file:///file2 hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2 hadoop fs –copyFromLocal <fromDir> <toDir> hadoop fs –put <localfile> hdfs://nn.example.com/hadoopfile sudo hadoop jar <jarFileName> <class> <fromDir> <toDir> hadoop fs –ls /user/hadoop/dir1 hadoop fs –cat hdfs://nn1.example.com/file1 hadoop fs –get /user/hadoop/file <localfile>
  • 22. 22  Apache Hadoop MapReduce  The world's leading implementation of the map-reduce computational model  Provides parallelized (scalable) computing  For processing very large data sets  Fault tolerant  Runs on commodity of hardware  Implemented in many cloud platforms: Amazon EMR, Azure HDInsight, Google Cloud, Cloudera, Rackspace, HP Cloud, … Hadoop MapReduce
  • 24. 24  Download and install Java and Hadoop  http://hadoop.apache.org/releases.html  You will need to install Java first  Download a pre-installed Hadoop virtual machine (VM)  Hortonworks Sandbox  Cloudera QuickStart VM  You can use Hadoop in the cloud / local emulator  E.g. Azure HDInsight Emulator Hadoop: Getting Started
  • 25. Playing with Apache Hadoop Live Demo
  • 26. 26  Big data == processing huge datasets that are too big for processing on a single machine  Use a cluster of computing nodes  Map-reduce == computational paradigm for parallel data processing of huge data-sets  Data is chunked, then mapped into groups, then groups are processed and the results are aggregated  Highly scalable, can process petabytes of data  Apache Hadoop – industry's leading Map-Reduce framework Summary
  • 28. License  This course (slides, examples, labs, videos, homework, etc.) is licensed under the "Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International" license 28  Attribution: this work may contain portions from  "Fundamentals of Computer Programming with C#" book by Svetlin Nakov & Co. under CC-BY-SA license  "Data Structures and Algorithms" course by Telerik Academy under CC-BY-NC-SA license
  • 29. Free Trainings @ Software University  Software University Foundation – softuni.org  Software University – High-Quality Education, Profession and Job for Software Developers  softuni.bg  Software University @ Facebook  facebook.com/SoftwareUniversity  Software University @ YouTube  youtube.com/SoftwareUniversity  Software University Forums – forum.softuni.bg

Notas del editor

  1. http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html http://nsinfra.blogspot.in/2012/06/difference-between-hadoop-dfs-and.html