SlideShare una empresa de Scribd logo
1 de 42
Descargar para leer sin conexión
hadoop

Tech share
• Hadoop Core, our flagship sub-project,
  provides a distributed filesystem (HDFS) and
  support for the MapReduce distributed
  computing metaphor.
• Pig is a high-level data-flow language and
  execution framework for parallel computation.
  It is built on top of Hadoop Core.
ZooKeeper
• ZooKeeper is a highly available and reliable
  coordination system. Distributed applications
  use ZooKeeper to store and mediate updates
  for critical shared state.
JobTracker
• JobTracker: The JobTracker provides command
  and control for job management. It supplies
  the primary user interface to a MapReduce
  cluster. It also handles the distribution and
  management of tasks. There is one instance of
  this server running on a cluster. The machine
  running the JobTracker server is the
  MapReduce master.
TaskTracker
• TaskTracker: The TaskTracker provides
  execution services for the submitted jobs.
  Each TaskTracker manages the execution of
  tasks on an individual compute node in the
  MapReduce cluster. The JobTracker manages
  all of the TaskTracker processes. There is one
  instance of this server per compute node.
NameNode
• NameNode: The NameNode provides metadata
  storage for the shared file system. The
  NameNode supplies the primary user interface to
  the HDFS. It also manages all of the metadata for
  the HDFS. There is one instance of this server
  running on a cluster. The metadata includes such
  critical information as the file directory structure
  and which DataNodes have copies of the data
  blocks that contain each file’s data. The machine
  running the NameNode server process is the
  HDFS master.
Secondary NameNode
• Secondary NameNode: The secondary
  NameNode provides both file system metadata
  backup and metadata compaction. It supplies
  near real-time backup of the metadata for the
  NameNode. There is at least one instance of this
  server running on a cluster, ideally on a separate
  physical machine than the one running the
  NameNode. The secondary NameNode also
  merges the metadata change history, the edit log,
  into the NameNode’s file system image.
Design of HDFS
• Design of HDFS
  – Very large files
  – Streaming data access
  – Commodity hardware
• not a good fit
  – Low-latency data access
  – Lots of small files
  – Multiple writers, arbitrary file modifications
blocks
• normally 512 bytes
• HDFS : 64 MB by default
HDFS文件读取
               内存
•
HDFS文件写入
HDFS文件写入
• Outputsream.write()
• Outputstream.flush() 刷新,超过一个block
  的时候,才会读到。
• Outputstream.sync() 强制同步
• Outputstream.close() 包括sync()
DistCp分布式复制
• hadoop distcp -update hdfs://namenode1/foo
  hdfs://namenode2/bar

• hadoop distcp –update ……
  – 只更新修改过的文件
• hadoop distcp –overwrite ……
  – 覆盖
• hadoop distcp –m 100 ……
  – 复制任务被分成N个MAP执行
Hadoop 文件归档
• Har文件

• Hadoop archive –archiveName file.har
  /myfiles /outpath

• Hadoop fs –ls /outpath/file.har
• Hadoop fs –lsr har:///outpath/file.har
文件操作
• Hadoop fs –rm hdfs://192.168.126.133:9000/xxx


   •cat             •cp         •lsr             •rmr
   •chgrp           •du         •mkdir           •setrep
   •chmod           •dus        •moveFromLocal   •stat
   •chown           •expunge    •moveToLocal     •tail
   •copyFromLocal   •get        •mv              •test
   •copyToLocal     •getmerge   •put             •text
   •count           •ls         •rm              •touchz
分布式部署
• Master&slave 192.168.0.10
• Slave 192.168.0.20

• 修改conf/master
  – 192.168.0.10
• 修改Conf/slave
  – 192.168.0.10
  – 192.168.0.20
安装hadoop
• ssh-keygen-tdsa –P '‘ –f ~/.ssh/id_dsa

• Cat ~/.ssh/id_dsa.pub >>
  ~/.ssh/authorized_keys

• 关闭防火墙Sudo ufw disable
分布式部署Core-site.xml
             (master&slave相同)
• <configuration>

• <property>
•      <name>hadoop.tmp.dir</name>
•      <value>/home/tony/tmp/tmp</value>
•      <description>Abaseforothertemporarydirectories.</description>
• </property>

• <property>
•      <name>fs.default.name</name>
•      <value>hdfs://192.168.0.10:9000</value>
• </property>

• </configuration>
分布式部署Hdfs-site.xml
               (master&slave)
•   <configuration>
•   <property>
•        <name>dfs.replication</name>
•        <value>1</value>
•      </property>
•   <property>
•        <name>dfs.name.dir</name>
•        <value>/home/tony/tmp/name</value>
•      </property>
•   <property>
•        <name>dfs.data.dir</name>
•        <value>/home/tony/tmp/data</value>
•      </property>
•   </configuration>
•   并且保证当前机器有该目录
分布式部署Mapred-site.xml
• <configuration>
• <property>
•     <name>mapred.job.tracker</name>
•     <value>192.168.0.10:9001</value>
•   </property>

• </configuration>
• 所有的机器都配成master的地址
Run
• Hadoop namenode –format
  – 每次fotmat前,先stop-all,并清空tmp一下的
    所有目录
• Start-all.sh
• 显示运行情况:
  – http://192.168.0.20:50070/dfshealth.jsp
  – 或 hadoop dfsadmin -report
Hadoop 20111117
Hadoop 20111117
could only be replicated
• java.io.IOException: could only be replicated
  to 0 nodes, instead of 1.

• 解决:
  – XML的配置不正确,要保证slave的mapred-
    site.xml和core-site.xml的地址都跟master一致
Incompatible namespaceIDs
• java.io.IOException: Incompatible
  namespaceIDs in /home/hadoop/data:
  namenode namespaceID = 1214734841;
  datanode namespaceID = 1600742075
• 原因:
  – 格式化前没清空tmp,导致ID不一致
• 解决:
  – 修改 namenode 的
    /home/hadoop/name/current/VERSION
UnknownHostException
• # hostname
• Vi /etc/hostname 修改hostname
• Vi /etc/hosts 增加hostname对应的IP
error in shuffle in fetcher
• org.apache.hadoop.mapreduce.task.reduce.Sh
  uffle$ShuffleError: error in shuffle in fetcher
• 解决方式:
  – 问题出在hosts文件的配置上,在所有节点的
    /etc/hosts文件中加入其他节点的主机名和IP映
    射
Hadoop 20111117
Auto sync
动态增加datanode
• 主机的conf/slaves中,增加namenode的地址
•
• 启动新增的namenode
 – bin/hadoop-daemon.sh start datanode
   bin/hadoop-daemon.sh start tasktracker
•
• 启动后,Hadoop自动识别。
screenshot
容错
• 如果一个节点很长时间没反应,就会清出
  集群,并且其它节点会把replication补上
Hadoop 20111117
执行 MapReduce
• hadoop jar a.jar com.Map1
  hdfs://192.168.126.133:9000/hadoopconf/
  hdfs://192.168.126.133:9000/output2/
Read From Hadoop URL
•   //execute: hadoop ReadFromHDFS
•   public class ReadFromHDFS {
•      static {
•       URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
•     }
•      public static void main(String[] args){
•        try {
•   URL uri = new URL("hdfs://192.168.126.133:9000/t1/a1.txt");
•   IOUtils.copyBytes(uri.openStream(), System.out, 4096, false);
•   }catch (FileNotFoundException e) {
•   e.printStackTrace();
•        } catch (IOException e) {
•   e.printStackTrace();
•        }
•      }
•   }
Read By FileSystem API
•   //execute : hadoop ReadByFileSystemAPI
•   public class ReadByFileSystemAPI {
•      public static void main(String[] args) throws Exception {
•        String uri = ("hdfs://192.168.126.133:9000/t1/a2.txt");;
•        Configuration conf = new Configuration();
•        FileSystem fs = FileSystem.get(URI.create(uri), conf);
•        FSDataInputStream in = null;
•        try {
•   in = fs.open(new Path(uri));
•   IOUtils.copyBytes(in, System.out, 4096, false);
•        } finally {
•   IOUtils.closeStream(in);
•        }
•      }
•   }
FileSystemAPI
•   Path path = new Path(URI.create("hdfs://192.168.126.133:9000/t1/tt/"));
•   if(fs.exists(path)){
•      fs.delete(path,true);
•      System.out.println("deleted-----------");
•   }else{
•      fs.mkdirs(path);
•      System.out.println("creted=====");
•   }

•   /**
•    * List files
•    */
•   FileStatus[] fileStatuses = fs.listStatus(new Path(URI.create("hdfs://192.168.126.133:9000/")));
•   for(FileStatus fileStatus : fileStatuses){
•      System.out.println("" + fileStatus.getPath().toUri().toString() + " dir:" + fileStatus.isDirectory());
•   }

•   PathFilter pathFilter = new PathFilter(){
•      @Override
•      public boolean accept(Path path) {
•        return true;
•      }
•   };
文件写入策略
•   在创建一个文件之后,在文件系统的命名空间中是可见的,如下所示:
•   1. Path p = new Path("p");
•   2. Fs.create(p);
•   3. assertThat(fs.exists(p),is(true));
•   但是,写入文件的内容并不保证能被看见,即使数据流已经被刷新。所以文
    件长度显
•   示为0:
•   1. Path p = new Path("p");
•   2. OutputStream out = fs.create(p);
•   3. out.write("content".getBytes("UTF-8"));
•   4. out.flush();
•   5. assertThat(fs.getFileStatus(p).getLen(),is(0L));
•   一旦写入的数据超过一个块的数据,新的读取者就能看见第一个块。对于之
    后的块也
•   是这样。总之,它始终是当前正在被写入的块,其他读取者是看不见它的。
•   out.sync(); 强制同步, close()的时候会自动调用sync()
集群复制 归档
• hadoop distcp -update hdfs://n1/foo
  hdfs://n2/bar/foo
• 归档
  – hadoop archive -archiveName files.har /my/files
    /my
• 使用归档
  – hadoop fs -lsr har:///my/files.har
  – hadoop fs -lsr har://hdfs-localhost:8020/my/files.har/my/files/di
• 归档缺点:修改文件、增加删除文件 都需重新归档
SequenceFile Reader&Writer
•   Configuration conf = new Configuration();
•       SequenceFile.Writer writer =null ;
•       try {
•         System.out.println("start....................");
•         FileSystem fileSystem = FileSystem.newInstance(conf);
•         IntWritable key = new IntWritable(1);
•         Text value = new Text("");
•         Path path = new Path("hdfs://192.168.126.133:9000/t1/seq");
•         if(!fileSystem.exists(path)){
•             fileSystem.create(path);
•             writer = SequenceFile.createWriter(fileSystem, conf, path, key.getClass(), value.getClass());

•            for(int i=1; i<10; i++){
•               writer.append(new IntWritable(i), new Text("value" + i));
•            }
•            writer.close();
•          }else{
•            SequenceFile.Reader reader = new SequenceFile.Reader(fileSystem,path,conf);
•            System.out.println("now while segment");
•            while(reader.next(key, value)){
•               System.out.println("key:" + key.get() + " value:" + value + " position" + reader.getPosition());
•            };
•          }
•       } catch (IOException e) {
•          e.printStackTrace();
•       } finally{
•          IOUtils.closeStream(writer);
•       }
SequenceFile
•   1 value1
•   2 value2
•   3 value3
•   4 value4
•   5 value5
•   6 value6
•   7 value7
•   8 value8
•   9 value9
•   包括一个Key 和一个 Value
•   可以用hadoop fs –text hdfs://……… 来显示文件
SequenceMap
• 重建索引:MapFile.fix(fileSystem, path,
  key.getClass(), value.getClass(), true, conf);

• MapFile.Writer writer = new MapFile.Writer(conf,
  fileSystem, path.toString(), key.getClass(),
  value.getClass());

• MapFile.Reader reader = new
  MapFile.Reader(fileSystem,path.toString(),conf);

Más contenido relacionado

La actualidad más candente

Hadoop Installation and basic configuration
Hadoop Installation and basic configurationHadoop Installation and basic configuration
Hadoop Installation and basic configurationGerrit van Vuuren
 
Hadoop spark performance comparison
Hadoop spark performance comparisonHadoop spark performance comparison
Hadoop spark performance comparisonarunkumar sadhasivam
 
Perl for System Automation - 01 Advanced File Processing
Perl for System Automation - 01 Advanced File ProcessingPerl for System Automation - 01 Advanced File Processing
Perl for System Automation - 01 Advanced File ProcessingDanairat Thanabodithammachari
 
Open Source Backup Conference 2014: Workshop bareos introduction, by Philipp ...
Open Source Backup Conference 2014: Workshop bareos introduction, by Philipp ...Open Source Backup Conference 2014: Workshop bareos introduction, by Philipp ...
Open Source Backup Conference 2014: Workshop bareos introduction, by Philipp ...NETWAYS
 
Hadoop installation
Hadoop installationHadoop installation
Hadoop installationAnkit Desai
 
Commands documentaion
Commands documentaionCommands documentaion
Commands documentaionTejalNijai
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelTakahiro Inoue
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configurationSubhas Kumar Ghosh
 
Hadoop installation on windows
Hadoop installation on windows Hadoop installation on windows
Hadoop installation on windows habeebulla g
 
DBD::Gofer 200809
DBD::Gofer 200809DBD::Gofer 200809
DBD::Gofer 200809Tim Bunce
 
第2回 Hadoop 輪読会
第2回 Hadoop 輪読会第2回 Hadoop 輪読会
第2回 Hadoop 輪読会Toshihiro Suzuki
 
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)Nag Arvind Gudiseva
 
Perl Memory Use - LPW2013
Perl Memory Use - LPW2013Perl Memory Use - LPW2013
Perl Memory Use - LPW2013Tim Bunce
 
eZ Publish cluster unleashed revisited
eZ Publish cluster unleashed revisitedeZ Publish cluster unleashed revisited
eZ Publish cluster unleashed revisitedBertrand Dunogier
 

La actualidad más candente (20)

Hadoop Installation and basic configuration
Hadoop Installation and basic configurationHadoop Installation and basic configuration
Hadoop Installation and basic configuration
 
Hadoop spark performance comparison
Hadoop spark performance comparisonHadoop spark performance comparison
Hadoop spark performance comparison
 
Perl Programming - 03 Programming File
Perl Programming - 03 Programming FilePerl Programming - 03 Programming File
Perl Programming - 03 Programming File
 
Perl for System Automation - 01 Advanced File Processing
Perl for System Automation - 01 Advanced File ProcessingPerl for System Automation - 01 Advanced File Processing
Perl for System Automation - 01 Advanced File Processing
 
Open Source Backup Conference 2014: Workshop bareos introduction, by Philipp ...
Open Source Backup Conference 2014: Workshop bareos introduction, by Philipp ...Open Source Backup Conference 2014: Workshop bareos introduction, by Philipp ...
Open Source Backup Conference 2014: Workshop bareos introduction, by Philipp ...
 
Hadoop installation
Hadoop installationHadoop installation
Hadoop installation
 
Beginning hive and_apache_pig
Beginning hive and_apache_pigBeginning hive and_apache_pig
Beginning hive and_apache_pig
 
Commands documentaion
Commands documentaionCommands documentaion
Commands documentaion
 
Hadoop
HadoopHadoop
Hadoop
 
Hive commands
Hive commandsHive commands
Hive commands
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 
2015 bioinformatics bio_python
2015 bioinformatics bio_python2015 bioinformatics bio_python
2015 bioinformatics bio_python
 
Hadoop installation on windows
Hadoop installation on windows Hadoop installation on windows
Hadoop installation on windows
 
DBD::Gofer 200809
DBD::Gofer 200809DBD::Gofer 200809
DBD::Gofer 200809
 
第2回 Hadoop 輪読会
第2回 Hadoop 輪読会第2回 Hadoop 輪読会
第2回 Hadoop 輪読会
 
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
 
Perl Memory Use - LPW2013
Perl Memory Use - LPW2013Perl Memory Use - LPW2013
Perl Memory Use - LPW2013
 
eZ Publish cluster unleashed revisited
eZ Publish cluster unleashed revisitedeZ Publish cluster unleashed revisited
eZ Publish cluster unleashed revisited
 

Destacado

Seminar report on Symbian OS
Seminar report on Symbian OSSeminar report on Symbian OS
Seminar report on Symbian OSDarsh Kotecha
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Seminar report on paper battery
Seminar report on paper batterySeminar report on paper battery
Seminar report on paper batterymanish katara
 
Report on cloud computing by prashant gupta
Report on cloud computing by prashant guptaReport on cloud computing by prashant gupta
Report on cloud computing by prashant guptaPrashant Gupta
 

Destacado (6)

Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Seminar report on Symbian OS
Seminar report on Symbian OSSeminar report on Symbian OS
Seminar report on Symbian OS
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Seminar report on paper battery
Seminar report on paper batterySeminar report on paper battery
Seminar report on paper battery
 
Hadoop Report
Hadoop ReportHadoop Report
Hadoop Report
 
Report on cloud computing by prashant gupta
Report on cloud computing by prashant guptaReport on cloud computing by prashant gupta
Report on cloud computing by prashant gupta
 

Similar a Hadoop 20111117

20090514 Introducing Puppet To Sasag
20090514 Introducing Puppet To Sasag20090514 Introducing Puppet To Sasag
20090514 Introducing Puppet To Sasaggarrett honeycutt
 
Big data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with InstallationBig data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with Installationmellempudilavanya999
 
Nov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars georgeNov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars georgeO'Reilly Media
 
Hadoop single node setup
Hadoop single node setupHadoop single node setup
Hadoop single node setupMohammad_Tariq
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...Cloudera, Inc.
 
New microsoft power point presentation
New microsoft power point presentationNew microsoft power point presentation
New microsoft power point presentationrajsandhu1989
 
Ericas-Linux-Plus-Study-Guide
Ericas-Linux-Plus-Study-GuideEricas-Linux-Plus-Study-Guide
Ericas-Linux-Plus-Study-GuideErica StJohn
 
20100425 Configuration Management With Puppet Lfnw
20100425 Configuration Management With Puppet Lfnw20100425 Configuration Management With Puppet Lfnw
20100425 Configuration Management With Puppet Lfnwgarrett honeycutt
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015NoSQLmatters
 
Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)
Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)
Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)Adin Ermie
 
Javase7 1641812
Javase7 1641812Javase7 1641812
Javase7 1641812Vinay H G
 
Facing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopFacing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopfann wu
 
Hadoop File System.pptx
Hadoop File System.pptxHadoop File System.pptx
Hadoop File System.pptxAakashBerlia1
 

Similar a Hadoop 20111117 (20)

20090514 Introducing Puppet To Sasag
20090514 Introducing Puppet To Sasag20090514 Introducing Puppet To Sasag
20090514 Introducing Puppet To Sasag
 
Big data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with InstallationBig data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with Installation
 
AHUG Presentation: Fun with Hadoop File Systems
AHUG Presentation: Fun with Hadoop File SystemsAHUG Presentation: Fun with Hadoop File Systems
AHUG Presentation: Fun with Hadoop File Systems
 
Nov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars georgeNov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars george
 
Hadoop HDFS
Hadoop HDFS Hadoop HDFS
Hadoop HDFS
 
Hdfs java api
Hdfs java apiHdfs java api
Hdfs java api
 
Hadoop single node setup
Hadoop single node setupHadoop single node setup
Hadoop single node setup
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
 
New microsoft power point presentation
New microsoft power point presentationNew microsoft power point presentation
New microsoft power point presentation
 
Ericas-Linux-Plus-Study-Guide
Ericas-Linux-Plus-Study-GuideEricas-Linux-Plus-Study-Guide
Ericas-Linux-Plus-Study-Guide
 
20100425 Configuration Management With Puppet Lfnw
20100425 Configuration Management With Puppet Lfnw20100425 Configuration Management With Puppet Lfnw
20100425 Configuration Management With Puppet Lfnw
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
#WeSpeakLinux Session
#WeSpeakLinux Session#WeSpeakLinux Session
#WeSpeakLinux Session
 
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
 
Exp-3.pptx
Exp-3.pptxExp-3.pptx
Exp-3.pptx
 
Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)
Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)
Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)
 
PyFilesystem
PyFilesystemPyFilesystem
PyFilesystem
 
Javase7 1641812
Javase7 1641812Javase7 1641812
Javase7 1641812
 
Facing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopFacing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoop
 
Hadoop File System.pptx
Hadoop File System.pptxHadoop File System.pptx
Hadoop File System.pptx
 

Último

Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncObject Automation
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfAnna Loughnan Colquhoun
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIUdaiappa Ramachandran
 

Último (20)

Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation Inc
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdf
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AI
 

Hadoop 20111117

  • 2. • Hadoop Core, our flagship sub-project, provides a distributed filesystem (HDFS) and support for the MapReduce distributed computing metaphor. • Pig is a high-level data-flow language and execution framework for parallel computation. It is built on top of Hadoop Core.
  • 3. ZooKeeper • ZooKeeper is a highly available and reliable coordination system. Distributed applications use ZooKeeper to store and mediate updates for critical shared state.
  • 4. JobTracker • JobTracker: The JobTracker provides command and control for job management. It supplies the primary user interface to a MapReduce cluster. It also handles the distribution and management of tasks. There is one instance of this server running on a cluster. The machine running the JobTracker server is the MapReduce master.
  • 5. TaskTracker • TaskTracker: The TaskTracker provides execution services for the submitted jobs. Each TaskTracker manages the execution of tasks on an individual compute node in the MapReduce cluster. The JobTracker manages all of the TaskTracker processes. There is one instance of this server per compute node.
  • 6. NameNode • NameNode: The NameNode provides metadata storage for the shared file system. The NameNode supplies the primary user interface to the HDFS. It also manages all of the metadata for the HDFS. There is one instance of this server running on a cluster. The metadata includes such critical information as the file directory structure and which DataNodes have copies of the data blocks that contain each file’s data. The machine running the NameNode server process is the HDFS master.
  • 7. Secondary NameNode • Secondary NameNode: The secondary NameNode provides both file system metadata backup and metadata compaction. It supplies near real-time backup of the metadata for the NameNode. There is at least one instance of this server running on a cluster, ideally on a separate physical machine than the one running the NameNode. The secondary NameNode also merges the metadata change history, the edit log, into the NameNode’s file system image.
  • 8. Design of HDFS • Design of HDFS – Very large files – Streaming data access – Commodity hardware • not a good fit – Low-latency data access – Lots of small files – Multiple writers, arbitrary file modifications
  • 9. blocks • normally 512 bytes • HDFS : 64 MB by default
  • 10. HDFS文件读取 内存 •
  • 12. HDFS文件写入 • Outputsream.write() • Outputstream.flush() 刷新,超过一个block 的时候,才会读到。 • Outputstream.sync() 强制同步 • Outputstream.close() 包括sync()
  • 13. DistCp分布式复制 • hadoop distcp -update hdfs://namenode1/foo hdfs://namenode2/bar • hadoop distcp –update …… – 只更新修改过的文件 • hadoop distcp –overwrite …… – 覆盖 • hadoop distcp –m 100 …… – 复制任务被分成N个MAP执行
  • 14. Hadoop 文件归档 • Har文件 • Hadoop archive –archiveName file.har /myfiles /outpath • Hadoop fs –ls /outpath/file.har • Hadoop fs –lsr har:///outpath/file.har
  • 15. 文件操作 • Hadoop fs –rm hdfs://192.168.126.133:9000/xxx •cat •cp •lsr •rmr •chgrp •du •mkdir •setrep •chmod •dus •moveFromLocal •stat •chown •expunge •moveToLocal •tail •copyFromLocal •get •mv •test •copyToLocal •getmerge •put •text •count •ls •rm •touchz
  • 16. 分布式部署 • Master&slave 192.168.0.10 • Slave 192.168.0.20 • 修改conf/master – 192.168.0.10 • 修改Conf/slave – 192.168.0.10 – 192.168.0.20
  • 17. 安装hadoop • ssh-keygen-tdsa –P '‘ –f ~/.ssh/id_dsa • Cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys • 关闭防火墙Sudo ufw disable
  • 18. 分布式部署Core-site.xml (master&slave相同) • <configuration> • <property> • <name>hadoop.tmp.dir</name> • <value>/home/tony/tmp/tmp</value> • <description>Abaseforothertemporarydirectories.</description> • </property> • <property> • <name>fs.default.name</name> • <value>hdfs://192.168.0.10:9000</value> • </property> • </configuration>
  • 19. 分布式部署Hdfs-site.xml (master&slave) • <configuration> • <property> • <name>dfs.replication</name> • <value>1</value> • </property> • <property> • <name>dfs.name.dir</name> • <value>/home/tony/tmp/name</value> • </property> • <property> • <name>dfs.data.dir</name> • <value>/home/tony/tmp/data</value> • </property> • </configuration> • 并且保证当前机器有该目录
  • 20. 分布式部署Mapred-site.xml • <configuration> • <property> • <name>mapred.job.tracker</name> • <value>192.168.0.10:9001</value> • </property> • </configuration> • 所有的机器都配成master的地址
  • 21. Run • Hadoop namenode –format – 每次fotmat前,先stop-all,并清空tmp一下的 所有目录 • Start-all.sh • 显示运行情况: – http://192.168.0.20:50070/dfshealth.jsp – 或 hadoop dfsadmin -report
  • 24. could only be replicated • java.io.IOException: could only be replicated to 0 nodes, instead of 1. • 解决: – XML的配置不正确,要保证slave的mapred- site.xml和core-site.xml的地址都跟master一致
  • 25. Incompatible namespaceIDs • java.io.IOException: Incompatible namespaceIDs in /home/hadoop/data: namenode namespaceID = 1214734841; datanode namespaceID = 1600742075 • 原因: – 格式化前没清空tmp,导致ID不一致 • 解决: – 修改 namenode 的 /home/hadoop/name/current/VERSION
  • 26. UnknownHostException • # hostname • Vi /etc/hostname 修改hostname • Vi /etc/hosts 增加hostname对应的IP
  • 27. error in shuffle in fetcher • org.apache.hadoop.mapreduce.task.reduce.Sh uffle$ShuffleError: error in shuffle in fetcher • 解决方式: – 问题出在hosts文件的配置上,在所有节点的 /etc/hosts文件中加入其他节点的主机名和IP映 射
  • 30. 动态增加datanode • 主机的conf/slaves中,增加namenode的地址 • • 启动新增的namenode – bin/hadoop-daemon.sh start datanode bin/hadoop-daemon.sh start tasktracker • • 启动后,Hadoop自动识别。
  • 32. 容错 • 如果一个节点很长时间没反应,就会清出 集群,并且其它节点会把replication补上
  • 34. 执行 MapReduce • hadoop jar a.jar com.Map1 hdfs://192.168.126.133:9000/hadoopconf/ hdfs://192.168.126.133:9000/output2/
  • 35. Read From Hadoop URL • //execute: hadoop ReadFromHDFS • public class ReadFromHDFS { • static { • URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); • } • public static void main(String[] args){ • try { • URL uri = new URL("hdfs://192.168.126.133:9000/t1/a1.txt"); • IOUtils.copyBytes(uri.openStream(), System.out, 4096, false); • }catch (FileNotFoundException e) { • e.printStackTrace(); • } catch (IOException e) { • e.printStackTrace(); • } • } • }
  • 36. Read By FileSystem API • //execute : hadoop ReadByFileSystemAPI • public class ReadByFileSystemAPI { • public static void main(String[] args) throws Exception { • String uri = ("hdfs://192.168.126.133:9000/t1/a2.txt");; • Configuration conf = new Configuration(); • FileSystem fs = FileSystem.get(URI.create(uri), conf); • FSDataInputStream in = null; • try { • in = fs.open(new Path(uri)); • IOUtils.copyBytes(in, System.out, 4096, false); • } finally { • IOUtils.closeStream(in); • } • } • }
  • 37. FileSystemAPI • Path path = new Path(URI.create("hdfs://192.168.126.133:9000/t1/tt/")); • if(fs.exists(path)){ • fs.delete(path,true); • System.out.println("deleted-----------"); • }else{ • fs.mkdirs(path); • System.out.println("creted====="); • } • /** • * List files • */ • FileStatus[] fileStatuses = fs.listStatus(new Path(URI.create("hdfs://192.168.126.133:9000/"))); • for(FileStatus fileStatus : fileStatuses){ • System.out.println("" + fileStatus.getPath().toUri().toString() + " dir:" + fileStatus.isDirectory()); • } • PathFilter pathFilter = new PathFilter(){ • @Override • public boolean accept(Path path) { • return true; • } • };
  • 38. 文件写入策略 • 在创建一个文件之后,在文件系统的命名空间中是可见的,如下所示: • 1. Path p = new Path("p"); • 2. Fs.create(p); • 3. assertThat(fs.exists(p),is(true)); • 但是,写入文件的内容并不保证能被看见,即使数据流已经被刷新。所以文 件长度显 • 示为0: • 1. Path p = new Path("p"); • 2. OutputStream out = fs.create(p); • 3. out.write("content".getBytes("UTF-8")); • 4. out.flush(); • 5. assertThat(fs.getFileStatus(p).getLen(),is(0L)); • 一旦写入的数据超过一个块的数据,新的读取者就能看见第一个块。对于之 后的块也 • 是这样。总之,它始终是当前正在被写入的块,其他读取者是看不见它的。 • out.sync(); 强制同步, close()的时候会自动调用sync()
  • 39. 集群复制 归档 • hadoop distcp -update hdfs://n1/foo hdfs://n2/bar/foo • 归档 – hadoop archive -archiveName files.har /my/files /my • 使用归档 – hadoop fs -lsr har:///my/files.har – hadoop fs -lsr har://hdfs-localhost:8020/my/files.har/my/files/di • 归档缺点:修改文件、增加删除文件 都需重新归档
  • 40. SequenceFile Reader&Writer • Configuration conf = new Configuration(); • SequenceFile.Writer writer =null ; • try { • System.out.println("start...................."); • FileSystem fileSystem = FileSystem.newInstance(conf); • IntWritable key = new IntWritable(1); • Text value = new Text(""); • Path path = new Path("hdfs://192.168.126.133:9000/t1/seq"); • if(!fileSystem.exists(path)){ • fileSystem.create(path); • writer = SequenceFile.createWriter(fileSystem, conf, path, key.getClass(), value.getClass()); • for(int i=1; i<10; i++){ • writer.append(new IntWritable(i), new Text("value" + i)); • } • writer.close(); • }else{ • SequenceFile.Reader reader = new SequenceFile.Reader(fileSystem,path,conf); • System.out.println("now while segment"); • while(reader.next(key, value)){ • System.out.println("key:" + key.get() + " value:" + value + " position" + reader.getPosition()); • }; • } • } catch (IOException e) { • e.printStackTrace(); • } finally{ • IOUtils.closeStream(writer); • }
  • 41. SequenceFile • 1 value1 • 2 value2 • 3 value3 • 4 value4 • 5 value5 • 6 value6 • 7 value7 • 8 value8 • 9 value9 • 包括一个Key 和一个 Value • 可以用hadoop fs –text hdfs://……… 来显示文件
  • 42. SequenceMap • 重建索引:MapFile.fix(fileSystem, path, key.getClass(), value.getClass(), true, conf); • MapFile.Writer writer = new MapFile.Writer(conf, fileSystem, path.toString(), key.getClass(), value.getClass()); • MapFile.Reader reader = new MapFile.Reader(fileSystem,path.toString(),conf);