SlideShare una empresa de Scribd logo
1 de 22
Descargar para leer sin conexión
∞
Agenda
   Need for a new processing platform (BigData)
   Origin of Hadoop
   What is Hadoop & what it is not ?
   Hadoop architecture
   Hadoop components
    (Common/HDFS/MapReduce)
   Hadoop ecosystem
   When should we go for Hadoop ?
   Real world use cases
   Questions
Need for a new processing
platform (Big Data)
   What is BigData ?
       - Twitter (over 7~ TB/day)
       - Facebook (over 10~ TB/day)
       - Google (over 20~ PB/day)
   Where does it come from ?
   Why to take so much of pain ?
        - Information everywhere, but where is the
          knowledge?
   Existing systems (vertical scalibility)
   Why Hadoop (horizontal scalibility)?
Origin of Hadoop
   Seminal whitepapers by Google in 2004
    on a new programming paradigm to
    handle data at internet scale
   Hadoop started as a part of the Nutch
    project.
   In Jan 2006 Doug Cutting started working
    on Hadoop at Yahoo
   Factored out of Nutch in Feb 2006
   First release of Apache Hadoop in
    September 2007
   Jan 2008 - Hadoop became a top level
    Apache project
Hadoop distributions

   Amazon
   Cloudera
   MapR
   HortonWorks
   Microsoft Windows Azure.
   IBM InfoSphere Biginsights
   Datameer
   EMC Greenplum HD Hadoop distribution
   Hadapt
What is Hadoop ?
 Flexibleinfrastructure for large
  scale computation & data
  processing on a network of
  commodity hardware
 Completely written in java
 Open source & distributed under
  Apache license
 Hadoop Common, HDFS &
  MapReduce
What Hadoop is not

A  replacement for existing data
  warehouse systems
 A File system
 An online transaction
  processing (OLTP) system
 Replacement of all
  programming logic
 A database
Hadoop architecture
   High level view (NN, DN, JT, TT) –
HDFS (Hadoop Distributed File
         System)
   Hadoop distributed file system
   Default storage for the Hadoop cluster
   NameNode/DataNode
   The File System Namespace(similar to our local
    file system)
   Master/slave architecture (1 master 'n' slaves)
   Virtual not physical
   Provides configurable replication (user specific)
   Data is stored as chunks (64 MB default, but
    configurable) across all the nodes
HDFS architecture
Data replication in HDFS.
Rack awareness




Typically large Hadoop clusters are arranged in racks and
network traffic between different nodes with in the same rack
is much more desirable than network traffic across the racks.
In addition Namenode tries to place replicas of block on
multiple racks for improved fault tolerance. A default
installation assumes all the nodes belong to the same rack.
MapReduce
   Framework provided by Hadoop to process
    large amount of data across a cluster of
    machines in a parallel manner
   Comprises of three classes –
    Mapper class
    Reducer class
    Driver class
   Tasktracker/ Jobtracker
   Reducer phase will start only after mapper is
    done
   Takes (k,v) pairs and emits (k,v) pair
   public static class Map extends Mapper<LongWritable,
    Text, Text, IntWritable> {
      private final static IntWritable one = new IntWritable(1);
      private Text word = new Text(); public void
      map(LongWritable key, Text value, Context context)
throws
       IOException, InterruptedException {
               String line = value.toString();
         StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
         word.set(tokenizer.nextToken());
         context.write(word, one); } } }
MapReduce job flow
Modes of operation

 Standalone   mode


 Pseudo-distributed    mode


 Fully-distributed   mode
Hadoop ecosystem
When should we go for
       Hadoop?
 Data   is too huge
 Processes    are independent
 Online   analytical processing
 (OLAP)
 Better   scalability
 Parallelism

 Unstructured    data
Real world use cases

Clickstream   analysis
Sentiment   analysis
Recommendation         engines
Ad   Targeting
Search   Quality
   What I have been doing…
     Seismic   Data Management & Processing
     WITSML    Server & Drilling Analytics
     Orchestra      Permission Map management for
      Search
     SDIS   (just started)
   Next steps: Get your hands dirty with
    code in a workshop on …
     Hadoop     Configuration
     HDFS    Data loading
     Map    Reduce programming
     Hbase

     Hive   & Pig
QUESTIONS ?

Más contenido relacionado

La actualidad más candente

Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
prabakaranbrick
 

La actualidad más candente (20)

Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node Cluster
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Word count program execution steps in hadoop
Word count program execution steps in hadoopWord count program execution steps in hadoop
Word count program execution steps in hadoop
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop Administration
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Administer Hadoop Cluster
Administer Hadoop ClusterAdminister Hadoop Cluster
Administer Hadoop Cluster
 
Hadoop admin
Hadoop adminHadoop admin
Hadoop admin
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an example
 
Secure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosSecure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With Kerberos
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Hadoop administration
Hadoop administrationHadoop administration
Hadoop administration
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 

Destacado

Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
nathanmarz
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 

Destacado (20)

Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Welcome to Tolteq: The Leader in MWD Technology
Welcome to Tolteq: The Leader in MWD TechnologyWelcome to Tolteq: The Leader in MWD Technology
Welcome to Tolteq: The Leader in MWD Technology
 
Cath preso
Cath presoCath preso
Cath preso
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
The Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data SystemsThe Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data Systems
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtime
 
Big Data
Big DataBig Data
Big Data
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar a Introduction to apache hadoop

Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
BOSC 2010
 

Similar a Introduction to apache hadoop (20)

Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Hadoop
HadoopHadoop
Hadoop
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 

Más de Shashwat Shriparv

LibreOffice 7.3.pptx
LibreOffice 7.3.pptxLibreOffice 7.3.pptx
LibreOffice 7.3.pptx
Shashwat Shriparv
 

Más de Shashwat Shriparv (20)

Learning Linux Series Administrator Commands.pptx
Learning Linux Series Administrator Commands.pptxLearning Linux Series Administrator Commands.pptx
Learning Linux Series Administrator Commands.pptx
 
LibreOffice 7.3.pptx
LibreOffice 7.3.pptxLibreOffice 7.3.pptx
LibreOffice 7.3.pptx
 
Kerberos Architecture.pptx
Kerberos Architecture.pptxKerberos Architecture.pptx
Kerberos Architecture.pptx
 
Suspending a Process in Linux.pptx
Suspending a Process in Linux.pptxSuspending a Process in Linux.pptx
Suspending a Process in Linux.pptx
 
Kerberos Architecture.pptx
Kerberos Architecture.pptxKerberos Architecture.pptx
Kerberos Architecture.pptx
 
Command Seperators.pptx
Command Seperators.pptxCommand Seperators.pptx
Command Seperators.pptx
 
R language introduction
R language introductionR language introduction
R language introduction
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinity
 
H base introduction & development
H base introduction & developmentH base introduction & development
H base introduction & development
 
Hbase interact with shell
Hbase interact with shellHbase interact with shell
Hbase interact with shell
 
H base development
H base developmentH base development
H base development
 
Hbase
HbaseHbase
Hbase
 
H base
H baseH base
H base
 
My sql
My sqlMy sql
My sql
 
Apache tomcat
Apache tomcatApache tomcat
Apache tomcat
 
Linux 4 you
Linux 4 youLinux 4 you
Linux 4 you
 
Java interview questions
Java interview questionsJava interview questions
Java interview questions
 
C# interview quesions
C# interview quesionsC# interview quesions
C# interview quesions
 
I pv6
I pv6I pv6
I pv6
 
Inventory system
Inventory systemInventory system
Inventory system
 

Último

Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 

Último (20)

Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 

Introduction to apache hadoop

  • 1.
  • 2. Agenda  Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)  Hadoop ecosystem  When should we go for Hadoop ?  Real world use cases  Questions
  • 3. Need for a new processing platform (Big Data)  What is BigData ? - Twitter (over 7~ TB/day) - Facebook (over 10~ TB/day) - Google (over 20~ PB/day)  Where does it come from ?  Why to take so much of pain ?  - Information everywhere, but where is the  knowledge?  Existing systems (vertical scalibility)  Why Hadoop (horizontal scalibility)?
  • 4. Origin of Hadoop  Seminal whitepapers by Google in 2004 on a new programming paradigm to handle data at internet scale  Hadoop started as a part of the Nutch project.  In Jan 2006 Doug Cutting started working on Hadoop at Yahoo  Factored out of Nutch in Feb 2006  First release of Apache Hadoop in September 2007  Jan 2008 - Hadoop became a top level Apache project
  • 5. Hadoop distributions  Amazon  Cloudera  MapR  HortonWorks  Microsoft Windows Azure.  IBM InfoSphere Biginsights  Datameer  EMC Greenplum HD Hadoop distribution  Hadapt
  • 6. What is Hadoop ?  Flexibleinfrastructure for large scale computation & data processing on a network of commodity hardware  Completely written in java  Open source & distributed under Apache license  Hadoop Common, HDFS & MapReduce
  • 7. What Hadoop is not A replacement for existing data warehouse systems  A File system  An online transaction processing (OLTP) system  Replacement of all programming logic  A database
  • 8. Hadoop architecture  High level view (NN, DN, JT, TT) –
  • 9. HDFS (Hadoop Distributed File System)  Hadoop distributed file system  Default storage for the Hadoop cluster  NameNode/DataNode  The File System Namespace(similar to our local file system)  Master/slave architecture (1 master 'n' slaves)  Virtual not physical  Provides configurable replication (user specific)  Data is stored as chunks (64 MB default, but configurable) across all the nodes
  • 12. Rack awareness Typically large Hadoop clusters are arranged in racks and network traffic between different nodes with in the same rack is much more desirable than network traffic across the racks. In addition Namenode tries to place replicas of block on multiple racks for improved fault tolerance. A default installation assumes all the nodes belong to the same rack.
  • 13. MapReduce  Framework provided by Hadoop to process large amount of data across a cluster of machines in a parallel manner  Comprises of three classes – Mapper class Reducer class Driver class  Tasktracker/ Jobtracker  Reducer phase will start only after mapper is done  Takes (k,v) pairs and emits (k,v) pair
  • 14.
  • 15. public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
  • 17. Modes of operation  Standalone mode  Pseudo-distributed mode  Fully-distributed mode
  • 19. When should we go for Hadoop?  Data is too huge  Processes are independent  Online analytical processing (OLAP)  Better scalability  Parallelism  Unstructured data
  • 20. Real world use cases Clickstream analysis Sentiment analysis Recommendation engines Ad Targeting Search Quality
  • 21. What I have been doing…  Seismic Data Management & Processing  WITSML Server & Drilling Analytics  Orchestra Permission Map management for Search  SDIS (just started)  Next steps: Get your hands dirty with code in a workshop on …  Hadoop Configuration  HDFS Data loading  Map Reduce programming  Hbase  Hive & Pig