SlideShare una empresa de Scribd logo
1 de 18
Hadoop


    Rahul Jain
    Software Engineer
    http://www.linkedin.com/in/rahuldausa



                                      1
Agenda
• Hadoop
   –   Introduction
   –   Hadoop (Why)
   –   Hadoop History
   –   Uses of Hadoop
   –   High Level Architecture
   –   Map-Reduce
• HDFS
   – GFS (Google File System)
   – HDFS Architecture
• Installation/Configuration
• Examples

                                     2
Introduction
 An open source software framework
 Supports Data intensive Distributed Applications.
 Enables Application to work thousand of computational
  independent computers and petabytes of data.
 Derived from Google’s Map-Reduce and Google File System
  papers.
 Written in the Java Programming Language.
 Started by Doug Cutting, who named it after his son’s toy
  elephant to support distribution for the Nutch ( A sub-project
  of Lucene)




                                                                   3
Hadoop (Why)
• Need to process huge datasets on large no. of computers.
• It is expensive to build reliability into each application.
• Nodes fails everyday
        - Failure is expected, rather than exceptional.
- Need common infrastructure
   - Efficient, reliable, easy to use.
   - Open sourced , Apache License




                                                                4
Hadoop History
•   Dec 2004 – Google GFS paper published
•   July 2005 – Nutch uses Map-Reduce
•   Jan 2006 – Doug Cutting joins Yahoo!
•   Feb 2006 – Become Lucene Subproject
•   Apr 2007 – Yahoo! On 1000 node cluster
•   Jan 2008 – An Apache Top Level Project
•   Feb 2008 – Yahoo Production search index

                                               5
What is Hadoop Used for ?
•   Searching (Yahoo)
•   Log Processing
•   Recommendation Systems (Facebook, LinkedIn, eBay, Amazon)
•   Analytics(Facebook, LinkedIn)
•   Video and Image Analysis (NASA)
•   Data Retention



                                                            6
Hadoop High Level Architecture




                                 7
Map-Reduce
Framework for processing parallel
problems across huge datasets using a
large numbers of computers(nodes),
collectively referred as
Cluster : If all nodes are on same local network and
uses similar network.
Or
Grid: If the nodes are shared across geographically
and uses more heterogeneous hardware.

Consists Two Step :
1.Map Step- The master node takes the input,
divides it into smaller sub-problems, and distributes
them to worker nodes. A worker node may do this
again in turn, leading to a multi-level tree structure.
The worker node processes the smaller problem,
and passes the answer back to its master node.
2.Reduce Step -The master node then collects
the answers to all the sub-problems and combines
them in some way to form the output – the answer
to the problem it was originally trying to solve.

                                                          Multiple Map-Reduce phases

                                                                                       8
Map-Reduce Life-Cycle




Credit : http://code.google.com/edu/parallel/mapreduce-tutorial.html
                                                                       9
HDFS
Hadoop Distributed File System




                                 10
Lets Understand GFS first …
               Google File System




                                    11
GFS Architecture




                   12
Goals of HDFS
1. Very Large Distributed File System
      - 10K nodes, 100 million files, 10 PB
2. Assumes Commodity Hardware
      - Files are replicated to handle hardware failure
      - Detect failures and recovers from them
3. Optimized for Batch Processing
      - Data locations exposed so that computation can move to
      where data resides.


                                                            13
14
Installation/ Configuration
  [rjain@10.1.110.12 hadoop-1.0.3]$ vi conf/hdfs-site.xml     [rjain@10.1.110.12 hadoop-1.0.3]$ pwd
  <configuration>                                             /home/rjain/hadoop-1.0.3
  <property>
       <name>dfs.replication</name>
                                                              [rjain@10.1.110.12 hadoop-1.0.3]$ bin/start-all.sh
       <value>1</value>
      </property>                                             [rjain@10.1.110.12 hadoop-1.0.3]$ bin/start-mapred.sh
   <property>
      <name>dfs.permissions</name>
                                                              [rjain@10.1.110.12 hadoop-1.0.3]$ bin/start-dfs.sh
      <value>true</value>
      </property>                                             [rjain@10.1.110.12 hadoop-1.0.3]$ bin/hadoop fs
  <property>                                                  Usage: java FsShell
     <name>dfs.data.dir</name>                                      [-ls <path>]
     <value>/home/rjain/rahul/hdfs/data</value>                     [-lsr <path>] : Recursive version of ls. Similar to Unix ls -R.
    </property>                                                     [-du <path>] : Displays aggregate length of files contained in the directory or the length of a file.
    <property>                                                      [-dus <path>] : Displays a summary of file lengths.
     <name>dfs.name.dir</name>                                      [-count[-q] <path>]
     <value>/home/rjain/rahul/hdfs/name</value>                     [-mv <src> <dst>]
    </property>                                                     [-cp <src> <dst>]
  </configuration>                                                  [-rm [-skipTrash] <path>]
                                                                    [-rmr [-skipTrash] <path>] : Recursive version of delete(rm).
  [rjain@10.1.110.12 hadoop-1.0.3]$ vi conf/mapred-site.xml         [-expunge] : Empty the Trash
  <configuration>                                                   [-put <localsrc> ... <dst>] : Copy single src, or multiple srcs from local file system to the
      <property>                                              destination filesystem
           <name>mapred.job.tracker</name>                          [-copyFromLocal <localsrc> ... <dst>]
           <value>localhost:9001</value>                            [-moveFromLocal <localsrc> ... <dst>]
      </property>                                                   [-get [-ignoreCrc] [-crc] <src> <localdst>]
  </configuration>                                                  [-getmerge <src> <localdst> [addnl]]
                                                                    [-cat <src>]
  [rjain@10.1.110.12 hadoop-1.0.3]$ vi conf/core-site.xml           [-text <src>] : Takes a source file and outputs the file in text format. The allowed formats are zip
  <configuration>                                             and TextRecordInputStream.
      <property>                                                    [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>]
           <name>fs.default.name</name>                             [-moveToLocal [-crc] <src> <localdst>]
            <value>hdfs://localhost:9000</value>                    [-mkdir <path>]
      </property>                                                   [-setrep [-R] [-w] <rep> <path/file>] : Changes the replication factor of a file
  </configuration>                                                  [-touchz <path>] : Create a file of zero length.
                                                                    [-test -[ezd] <path>] : -e check to see if the file exists. Return 0 if true. -z check to see if the file is
[rjain@10.1.110.12 hadoop-1.0.3]$ jps                         zero length. Return 0 if true. -d check to see if the path is directory. Return 0 if true.
29756 SecondaryNameNode                                             [-stat [format] <path>] : Returns the stat information on the path like created time of dir
19847 TaskTracker                                                   [-tail [-f] <file>] : Displays last kilobyte of the file to stdout
18756 Jps                                                           [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
29483 NameNode                                                      [-chown [-R] [OWNER][:[GROUP]] PATH...]
29619 DataNode                                                      [-chgrp [-R] GROUP PATH...]                                                                       15
19711 JobTracker                                                    [-help [cmd]]
HDFS- Read/Write Example
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);

Given an input/output file name as string, we construct inFile/outFile Path objects.
Most of the FileSystem APIs accepts Path objects.

Path inFile = new Path(argv[0]);
Path outFile = new Path(argv[1]);

Validate the input/output paths before reading/writing.

if (!fs.exists(inFile))
  printAndExit("Input file not found");
if (!fs.isFile(inFile))
  printAndExit("Input should be a file");
if (fs.exists(outFile))
  printAndExit("Output already exists");

Open inFile for reading.

FSDataInputStream in = fs.open(inFile);

Open outFile for writing.

FSDataOutputStream out = fs.create(outFile);

Read from input stream and write to output stream until EOF.

while ((bytesRead = in.read(buffer)) > 0) {
  out.write(buffer, 0, bytesRead);
}

Close the streams when done.

in.close();
out.close();
                                                                                       16
Hadoop Sub-Projects
•   Hadoop Common: The common utilities that support the other Hadoop subprojects.
•   Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput
    access to application data.
•   Hadoop MapReduce: A software framework for distributed processing of large data sets on
    compute clusters.

Other Hadoop-related projects at Apache include:

•   Avro™: A data serialization system.
•   Cassandra™: A scalable multi-master database with no single points of failure.
•   Chukwa™: A data collection system for managing large distributed systems.
•   HBase™: A scalable, distributed database that supports structured data storage for large tables.
•   Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
•   Mahout™: A Scalable machine learning and data mining library.
•   Pig™: A high-level data-flow language and execution framework for parallel computation.
•   ZooKeeper™: A high-performance coordination service for distributed applications.




                                                                                                       17
Questions ?




              18

Más contenido relacionado

La actualidad más candente

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 

La actualidad más candente (20)

Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
6.hive
6.hive6.hive
6.hive
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Sqoop
SqoopSqoop
Sqoop
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
From docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native wayFrom docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native way
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Nov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on HadoopNov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on Hadoop
 

Destacado

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 

Destacado (18)

Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
Hadoop
HadoopHadoop
Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
reveal.js 3.0.0
reveal.js 3.0.0reveal.js 3.0.0
reveal.js 3.0.0
 

Similar a Hadoop & HDFS for Beginners

The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
Nam Nham
 
Hadoop 20111117
Hadoop 20111117Hadoop 20111117
Hadoop 20111117
exsuns
 

Similar a Hadoop & HDFS for Beginners (20)

Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFS
 
MapReduce1.pptx
MapReduce1.pptxMapReduce1.pptx
MapReduce1.pptx
 
HDFS_Command_Reference
HDFS_Command_ReferenceHDFS_Command_Reference
HDFS_Command_Reference
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
 
Big data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with InstallationBig data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with Installation
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
 
Data analysis on hadoop
Data analysis on hadoopData analysis on hadoop
Data analysis on hadoop
 
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Hadoop 2.x  HDFS Cluster Installation (VirtualBox)Hadoop 2.x  HDFS Cluster Installation (VirtualBox)
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
 
Hadoop 20111117
Hadoop 20111117Hadoop 20111117
Hadoop 20111117
 
Unit 1
Unit 1Unit 1
Unit 1
 
Drupal cambs ansible for drupal april 2015
Drupal cambs ansible for drupal april 2015Drupal cambs ansible for drupal april 2015
Drupal cambs ansible for drupal april 2015
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 

Más de Rahul Jain

Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 

Más de Rahul Jain (13)

Flipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and RecommendationFlipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and Recommendation
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 
Hibernate tutorial for beginners
Hibernate tutorial for beginnersHibernate tutorial for beginners
Hibernate tutorial for beginners
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Hadoop & HDFS for Beginners

  • 1. Hadoop Rahul Jain Software Engineer http://www.linkedin.com/in/rahuldausa 1
  • 2. Agenda • Hadoop – Introduction – Hadoop (Why) – Hadoop History – Uses of Hadoop – High Level Architecture – Map-Reduce • HDFS – GFS (Google File System) – HDFS Architecture • Installation/Configuration • Examples 2
  • 3. Introduction  An open source software framework  Supports Data intensive Distributed Applications.  Enables Application to work thousand of computational independent computers and petabytes of data.  Derived from Google’s Map-Reduce and Google File System papers.  Written in the Java Programming Language.  Started by Doug Cutting, who named it after his son’s toy elephant to support distribution for the Nutch ( A sub-project of Lucene) 3
  • 4. Hadoop (Why) • Need to process huge datasets on large no. of computers. • It is expensive to build reliability into each application. • Nodes fails everyday - Failure is expected, rather than exceptional. - Need common infrastructure - Efficient, reliable, easy to use. - Open sourced , Apache License 4
  • 5. Hadoop History • Dec 2004 – Google GFS paper published • July 2005 – Nutch uses Map-Reduce • Jan 2006 – Doug Cutting joins Yahoo! • Feb 2006 – Become Lucene Subproject • Apr 2007 – Yahoo! On 1000 node cluster • Jan 2008 – An Apache Top Level Project • Feb 2008 – Yahoo Production search index 5
  • 6. What is Hadoop Used for ? • Searching (Yahoo) • Log Processing • Recommendation Systems (Facebook, LinkedIn, eBay, Amazon) • Analytics(Facebook, LinkedIn) • Video and Image Analysis (NASA) • Data Retention 6
  • 7. Hadoop High Level Architecture 7
  • 8. Map-Reduce Framework for processing parallel problems across huge datasets using a large numbers of computers(nodes), collectively referred as Cluster : If all nodes are on same local network and uses similar network. Or Grid: If the nodes are shared across geographically and uses more heterogeneous hardware. Consists Two Step : 1.Map Step- The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. 2.Reduce Step -The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. Multiple Map-Reduce phases 8
  • 9. Map-Reduce Life-Cycle Credit : http://code.google.com/edu/parallel/mapreduce-tutorial.html 9
  • 11. Lets Understand GFS first … Google File System 11
  • 13. Goals of HDFS 1. Very Large Distributed File System - 10K nodes, 100 million files, 10 PB 2. Assumes Commodity Hardware - Files are replicated to handle hardware failure - Detect failures and recovers from them 3. Optimized for Batch Processing - Data locations exposed so that computation can move to where data resides. 13
  • 14. 14
  • 15. Installation/ Configuration [rjain@10.1.110.12 hadoop-1.0.3]$ vi conf/hdfs-site.xml [rjain@10.1.110.12 hadoop-1.0.3]$ pwd <configuration> /home/rjain/hadoop-1.0.3 <property> <name>dfs.replication</name> [rjain@10.1.110.12 hadoop-1.0.3]$ bin/start-all.sh <value>1</value> </property> [rjain@10.1.110.12 hadoop-1.0.3]$ bin/start-mapred.sh <property> <name>dfs.permissions</name> [rjain@10.1.110.12 hadoop-1.0.3]$ bin/start-dfs.sh <value>true</value> </property> [rjain@10.1.110.12 hadoop-1.0.3]$ bin/hadoop fs <property> Usage: java FsShell <name>dfs.data.dir</name> [-ls <path>] <value>/home/rjain/rahul/hdfs/data</value> [-lsr <path>] : Recursive version of ls. Similar to Unix ls -R. </property> [-du <path>] : Displays aggregate length of files contained in the directory or the length of a file. <property> [-dus <path>] : Displays a summary of file lengths. <name>dfs.name.dir</name> [-count[-q] <path>] <value>/home/rjain/rahul/hdfs/name</value> [-mv <src> <dst>] </property> [-cp <src> <dst>] </configuration> [-rm [-skipTrash] <path>] [-rmr [-skipTrash] <path>] : Recursive version of delete(rm). [rjain@10.1.110.12 hadoop-1.0.3]$ vi conf/mapred-site.xml [-expunge] : Empty the Trash <configuration> [-put <localsrc> ... <dst>] : Copy single src, or multiple srcs from local file system to the <property> destination filesystem <name>mapred.job.tracker</name> [-copyFromLocal <localsrc> ... <dst>] <value>localhost:9001</value> [-moveFromLocal <localsrc> ... <dst>] </property> [-get [-ignoreCrc] [-crc] <src> <localdst>] </configuration> [-getmerge <src> <localdst> [addnl]] [-cat <src>] [rjain@10.1.110.12 hadoop-1.0.3]$ vi conf/core-site.xml [-text <src>] : Takes a source file and outputs the file in text format. The allowed formats are zip <configuration> and TextRecordInputStream. <property> [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>] <name>fs.default.name</name> [-moveToLocal [-crc] <src> <localdst>] <value>hdfs://localhost:9000</value> [-mkdir <path>] </property> [-setrep [-R] [-w] <rep> <path/file>] : Changes the replication factor of a file </configuration> [-touchz <path>] : Create a file of zero length. [-test -[ezd] <path>] : -e check to see if the file exists. Return 0 if true. -z check to see if the file is [rjain@10.1.110.12 hadoop-1.0.3]$ jps zero length. Return 0 if true. -d check to see if the path is directory. Return 0 if true. 29756 SecondaryNameNode [-stat [format] <path>] : Returns the stat information on the path like created time of dir 19847 TaskTracker [-tail [-f] <file>] : Displays last kilobyte of the file to stdout 18756 Jps [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...] 29483 NameNode [-chown [-R] [OWNER][:[GROUP]] PATH...] 29619 DataNode [-chgrp [-R] GROUP PATH...] 15 19711 JobTracker [-help [cmd]]
  • 16. HDFS- Read/Write Example Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Given an input/output file name as string, we construct inFile/outFile Path objects. Most of the FileSystem APIs accepts Path objects. Path inFile = new Path(argv[0]); Path outFile = new Path(argv[1]); Validate the input/output paths before reading/writing. if (!fs.exists(inFile)) printAndExit("Input file not found"); if (!fs.isFile(inFile)) printAndExit("Input should be a file"); if (fs.exists(outFile)) printAndExit("Output already exists"); Open inFile for reading. FSDataInputStream in = fs.open(inFile); Open outFile for writing. FSDataOutputStream out = fs.create(outFile); Read from input stream and write to output stream until EOF. while ((bytesRead = in.read(buffer)) > 0) { out.write(buffer, 0, bytesRead); } Close the streams when done. in.close(); out.close(); 16
  • 17. Hadoop Sub-Projects • Hadoop Common: The common utilities that support the other Hadoop subprojects. • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. • Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters. Other Hadoop-related projects at Apache include: • Avro™: A data serialization system. • Cassandra™: A scalable multi-master database with no single points of failure. • Chukwa™: A data collection system for managing large distributed systems. • HBase™: A scalable, distributed database that supports structured data storage for large tables. • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. • Mahout™: A Scalable machine learning and data mining library. • Pig™: A high-level data-flow language and execution framework for parallel computation. • ZooKeeper™: A high-performance coordination service for distributed applications. 17