SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
Scalable Computing
        with
      Hadoop

          Doug Cutting
      cutting@apache.org
    dcutting@yahoo-inc.com

            5/4/06
Seek versus Transfer
●   B-Tree
    –   requires seek per access
    –   unless to recent, cached page
    –   so can buffer & pre-sort accesses
    –   but, w/ fragmentation, must still seek per page


                                                G O U


    A B C D E F                H   I    J K L M           ...
Seek versus Transfer
●   update by merging
    –   merge sort takes log(updates), at transfer rate
    –   merging updates is linear in db size, at transfer rate
●   if 10MB/s xfer, 10ms seek, 1% update of TB db
    –   100b entries, 10kb pages, 10B entries, 1B pages
    –   seek per update requires 1000 days!
    –   seek per page requires 100 days!
    –   transfer entire db takes 1 day
Hadoop DFS
●   modelled after Google's GFS
●   single namenode
    –   maps name → <blockId>*
    –   maps blockId → <host:port>replication_level
●   many datanodes, one per disk generally
    –   map blockId → <byte>*
    –   poll namenode for replication, deletion, etc. requests
●   client code talks to both
Hadoop MapReduce
●   Platform for reliable, scalable computing.
●   All data is sequences of <key,value> pairs.
●   Programmer specifies two primary methods:
    –   map(k, v) → <k', v'>*
    –   reduce(k', <v'>*) → <k', v'>*
    –   also partition(), compare(), & others
●   All v' with same k' are reduced together, in order.
    –   bonus: built-in support for sort/merge!
MapReduce job processing

input     map tasks   reduce tasks   output
split 0   map()
split 1   map()        reduce()      part 0

split 2   map()        reduce()      part 1

split 3   map()        reduce()      part 2

split 4   map()
Example: RegexMapper
public class RegexMapper implements Mapper {
  private Pattern pattern;
  private int group;

    public void configure(JobConf job) {
      pattern = Pattern.compile(job.get("mapred.mapper.regex"));
      group = job.getInt("mapred.mapper.regex.group", 0);
    }

    public void map(WritableComparable key, Writable value,
                    OutputCollector output, Reporter reporter)
      throws IOException {
      String text = ((UTF8)value).toString();
      Matcher matcher = pattern.matcher(text);
      while (matcher.find()) {
        output.collect(new UTF8(matcher.group(group)),
                       new LongWritable(1));
      }
    }
}
Example: LongSumReducer
public class LongSumReducer implements Reducer {

    public void configure(JobConf job) {}

    public void reduce(WritableComparable key, Iterator values,
                       OutputCollector output, Reporter reporter)
      throws IOException {

        long sum = 0;

        while (values.hasNext()) {
          sum += ((LongWritable)values.next()).get();
        }

        output.collect(key, new LongWritable(sum));
    }
}
Example: main()
public static void main(String[] args) throws IOException {
  NutchConf defaults = NutchConf.get();
  JobConf job = new JobConf(defaults);

    job.setInputDir(new File(args[0]));

    job.setMapperClass(RegexMapper.class);
    job.set("mapred.mapper.regex", args[2]);
    job.set("mapred.mapper.regex.group", args[3]);

    job.setReducerClass(LongSumReducer.class);

    job.setOutputDir(args[1]);
    job.setOutputKeyClass(UTF8.class);
    job.setOutputValueClass(LongWritable.class);

    JobClient.runJob(job);
}
Nutch Algorithms
●   inject urls into a crawl db, to bootstrap it.
●   loop:
    –   generate a set of urls to fetch from crawl db;
    –   fetch a set of urls into a segment;
    –   parse fetched content of a segment;
    –   update crawl db with data parsed from a segment.
●   invert links parsed from segments
●   index segment text & inlink anchor text
Nutch on MapReduce & NDFS
●   Nutch's major algorithms converted in 2 weeks.
●   Before:
    –   several were undistributed scalabilty bottlenecks
    –   distributable algorithms were complex to manage
    –   collections larger than 100M pages impractical
●   After:
    –   all are scalable, distributed, easy to operate
    –   code is substantially smaller & simpler
    –   should permit multi-billion page collections
Data Structure: Crawl DB
●   CrawlDb is a directory of files containing:
    <URL, CrawlDatum>
●   CrawlDatum:
    <status, date, interval, failures, linkCount, ...>
●   Status:
    {db_unfetched, db_fetched, db_gone,
      linked,
      fetch_success, fetch_fail, fetch_gone}
Algorithm: Inject
●   MapReduce1: Convert input to DB format
    In: flat text file of urls
    Map(line) → <url, CrawlDatum>; status=db_unfetched
    Reduce() is identity;
    Output: directory of temporary files
●   MapReduce2: Merge into existing DB
    Input: output of Step1 and existing DB files
    Map() is identity.
    Reduce: merge CrawlDatum's into single entry
    Out: new version of DB
Algorithm: Generate
●   MapReduce1: select urls due for fetch
    In: Crawl DB files
    Map() → if date≥now, invert to <CrawlDatum, url>
    Partition by value hash (!) to randomize
    Reduce:
       compare() order by decreasing CrawlDatum.linkCount
       output only top-N most-linked entries
●   MapReduce2: prepare for fetch
    Map() is invert; Partition() by host, Reduce() is identity.
    Out: Set of <url,CrawlDatum> files to fetch in parallel
Algorithm: Fetch
●   MapReduce: fetch a set of urls
    In: <url,CrawlDatum>, partition by host, sort by hash
    Map(url,CrawlDatum) → <url, FetcherOutput>
       multi-threaded, async map implementation
       calls existing Nutch protocol plugins
    FetcherOutput: <CrawlDatum, Content>
    Reduce is identity
    Out: two files: <url,CrawlDatum>, <url,Content>
Algorithm: Parse
●   MapReduce: parse content
    In: <url, Content> files from Fetch
    Map(url, Content) → <url, Parse>
     calls existing Nutch parser plugins
    Reduce is identity.
    Parse: <ParseText, ParseData>
    Out: split in three: <url,ParseText>, <url,ParseData>
     and <url,CrawlDatum> for outlinks.
Algorithm: Update Crawl DB
●   MapReduce: integrate fetch & parse out into db
    In: <url,CrawlDatum> existing db plus fetch & parse
      out
    Map() is identity
    Reduce() merges all entries into a single new entry
       overwrite previous db status w/ new from fetch
       sum count of links from parse w/ previous from db
    Out: new crawl db
Algorithm: Invert Links
●   MapReduce: compute inlinks for all urls
    In: <url,ParseData>, containing page outlinks
    Map(srcUrl, ParseData> → <destUrl, Inlinks>
       collect a single-element Inlinks for each outlink
       limit number of outlinks per page
    Inlinks: <srcUrl, anchorText>*
    Reduce() appends inlinks
    Out: <url, Inlinks>, a complete link inversion
Algorithm: Index
●   MapReduce: create Lucene indexes
    In: multiple files, values wrapped in <Class, Object>
       <url, ParseData> from parse, for title, metadata, etc.
       <url, ParseText> from parse, for text
       <url, Inlinks> from invert, for anchors
       <url, CrawlDatum> from fetch, for fetch date
    Map() is identity
    Reduce() create a Lucene Document
       call existing Nutch indexing plugins
    Out: build Lucene index; copy to fs at end
Hadoop MapReduce Extensions
●   Split output to multiple files
    –   saves subsequent i/o, since inputs are smaller
●   Mix input value types
    –   saves MapReduce passes to convert values
●   Async Map
    –   permits multi-threaded Fetcher
●   Partition by Value
    –   facilitates selecting subsets w/ maximum key values
Thanks!




http://lucene.apache.org/hadoop/

Más contenido relacionado

La actualidad más candente

Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAJISC GECO
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0Sigmoid
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxData
 
“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKANChengjen Lee
 
MapDB - taking Java collections to the next level
MapDB - taking Java collections to the next levelMapDB - taking Java collections to the next level
MapDB - taking Java collections to the next levelJavaDayUA
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2PoguttuezhiniVP
 
Jsonnet, terraform & packer
Jsonnet, terraform & packerJsonnet, terraform & packer
Jsonnet, terraform & packerDavid Cunningham
 
Save Java memory
Save Java memorySave Java memory
Save Java memoryJavaDayUA
 
Developing and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDWDeveloping and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDWJonathan Katz
 
Server side geo_tools_in_drupal_pnw_2012
Server side geo_tools_in_drupal_pnw_2012Server side geo_tools_in_drupal_pnw_2012
Server side geo_tools_in_drupal_pnw_2012Mack Hardy
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Аліна Шепшелей
 
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareClickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareAltinity Ltd
 
RethinkDB - the open-source database for the realtime web
RethinkDB - the open-source database for the realtime webRethinkDB - the open-source database for the realtime web
RethinkDB - the open-source database for the realtime webAlex Ivanov
 
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10Altinity Ltd
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsCheng Min Chi
 

La actualidad más candente (20)

Getting started with PostGIS geographic database
Getting started with PostGIS geographic databaseGetting started with PostGIS geographic database
Getting started with PostGIS geographic database
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
 
Learning postgresql
Learning postgresqlLearning postgresql
Learning postgresql
 
mesos-devoxx14
mesos-devoxx14mesos-devoxx14
mesos-devoxx14
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
 
“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN
 
MapDB - taking Java collections to the next level
MapDB - taking Java collections to the next levelMapDB - taking Java collections to the next level
MapDB - taking Java collections to the next level
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
 
Jsonnet, terraform & packer
Jsonnet, terraform & packerJsonnet, terraform & packer
Jsonnet, terraform & packer
 
Save Java memory
Save Java memorySave Java memory
Save Java memory
 
Developing and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDWDeveloping and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDW
 
Server side geo_tools_in_drupal_pnw_2012
Server side geo_tools_in_drupal_pnw_2012Server side geo_tools_in_drupal_pnw_2012
Server side geo_tools_in_drupal_pnw_2012
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.
 
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareClickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
 
RethinkDB - the open-source database for the realtime web
RethinkDB - the open-source database for the realtime webRethinkDB - the open-source database for the realtime web
RethinkDB - the open-source database for the realtime web
 
PgconfSV compression
PgconfSV compressionPgconfSV compression
PgconfSV compression
 
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
 
Postgresql Federation
Postgresql FederationPostgresql Federation
Postgresql Federation
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 

Destacado

Dean Keynote Ladis2009
Dean Keynote Ladis2009Dean Keynote Ladis2009
Dean Keynote Ladis2009longhao
 
Vim get start_1.0
Vim get start_1.0Vim get start_1.0
Vim get start_1.0longhao
 
产品部内部交流平台
产品部内部交流平台产品部内部交流平台
产品部内部交流平台longhao
 
高性能网站最佳实践
高性能网站最佳实践高性能网站最佳实践
高性能网站最佳实践longhao
 
Netputer
NetputerNetputer
Netputerlonghao
 
透明计算与云计算
透明计算与云计算透明计算与云计算
透明计算与云计算longhao
 
手机上的硬件设备及典型App
手机上的硬件设备及典型App手机上的硬件设备及典型App
手机上的硬件设备及典型Applonghao
 
Node.js长连接开发实践
Node.js长连接开发实践Node.js长连接开发实践
Node.js长连接开发实践longhao
 
软件重构
软件重构软件重构
软件重构longhao
 
借助社会化媒体的个人成长
借助社会化媒体的个人成长借助社会化媒体的个人成长
借助社会化媒体的个人成长longhao
 
Java并发编程培训
Java并发编程培训Java并发编程培训
Java并发编程培训longhao
 
无Ued产品的易用性讨论
无Ued产品的易用性讨论无Ued产品的易用性讨论
无Ued产品的易用性讨论longhao
 
并发编程实践
并发编程实践并发编程实践
并发编程实践longhao
 
Fashionary by WGSN
Fashionary by WGSNFashionary by WGSN
Fashionary by WGSNPenter
 
What are the important brand architecture decisions in developing a branding ...
What are the important brand architecture decisions in developing a branding ...What are the important brand architecture decisions in developing a branding ...
What are the important brand architecture decisions in developing a branding ...Sameer Mathur
 
High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...
High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...
High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...Idabagus Mahartana
 

Destacado (20)

Dean Keynote Ladis2009
Dean Keynote Ladis2009Dean Keynote Ladis2009
Dean Keynote Ladis2009
 
Vim get start_1.0
Vim get start_1.0Vim get start_1.0
Vim get start_1.0
 
产品部内部交流平台
产品部内部交流平台产品部内部交流平台
产品部内部交流平台
 
高性能网站最佳实践
高性能网站最佳实践高性能网站最佳实践
高性能网站最佳实践
 
Netputer
NetputerNetputer
Netputer
 
透明计算与云计算
透明计算与云计算透明计算与云计算
透明计算与云计算
 
手机上的硬件设备及典型App
手机上的硬件设备及典型App手机上的硬件设备及典型App
手机上的硬件设备及典型App
 
Node.js长连接开发实践
Node.js长连接开发实践Node.js长连接开发实践
Node.js长连接开发实践
 
软件重构
软件重构软件重构
软件重构
 
借助社会化媒体的个人成长
借助社会化媒体的个人成长借助社会化媒体的个人成长
借助社会化媒体的个人成长
 
Java并发编程培训
Java并发编程培训Java并发编程培训
Java并发编程培训
 
无Ued产品的易用性讨论
无Ued产品的易用性讨论无Ued产品的易用性讨论
无Ued产品的易用性讨论
 
并发编程实践
并发编程实践并发编程实践
并发编程实践
 
Cw concept Cocoa
Cw concept CocoaCw concept Cocoa
Cw concept Cocoa
 
Tiempo del verbo 3
Tiempo del verbo 3Tiempo del verbo 3
Tiempo del verbo 3
 
Los tic
Los ticLos tic
Los tic
 
Exeter Presentation November_2016_Web
Exeter Presentation November_2016_WebExeter Presentation November_2016_Web
Exeter Presentation November_2016_Web
 
Fashionary by WGSN
Fashionary by WGSNFashionary by WGSN
Fashionary by WGSN
 
What are the important brand architecture decisions in developing a branding ...
What are the important brand architecture decisions in developing a branding ...What are the important brand architecture decisions in developing a branding ...
What are the important brand architecture decisions in developing a branding ...
 
High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...
High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...
High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...
 

Similar a hadoop

Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationNorberto Leite
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop IntroductionSNEHAL MASNE
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014MongoDB
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster InnardsMartin Dvorak
 
Introduction To MongoDB
Introduction To MongoDBIntroduction To MongoDB
Introduction To MongoDBElieHannouch
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 

Similar a hadoop (20)

Hadoop
HadoopHadoop
Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop Presentation
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Introduction To MongoDB
Introduction To MongoDBIntroduction To MongoDB
Introduction To MongoDB
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 

Último

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Último (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

hadoop

  • 1. Scalable Computing with Hadoop Doug Cutting cutting@apache.org dcutting@yahoo-inc.com 5/4/06
  • 2. Seek versus Transfer ● B-Tree – requires seek per access – unless to recent, cached page – so can buffer & pre-sort accesses – but, w/ fragmentation, must still seek per page G O U A B C D E F H I J K L M ...
  • 3. Seek versus Transfer ● update by merging – merge sort takes log(updates), at transfer rate – merging updates is linear in db size, at transfer rate ● if 10MB/s xfer, 10ms seek, 1% update of TB db – 100b entries, 10kb pages, 10B entries, 1B pages – seek per update requires 1000 days! – seek per page requires 100 days! – transfer entire db takes 1 day
  • 4. Hadoop DFS ● modelled after Google's GFS ● single namenode – maps name → <blockId>* – maps blockId → <host:port>replication_level ● many datanodes, one per disk generally – map blockId → <byte>* – poll namenode for replication, deletion, etc. requests ● client code talks to both
  • 5. Hadoop MapReduce ● Platform for reliable, scalable computing. ● All data is sequences of <key,value> pairs. ● Programmer specifies two primary methods: – map(k, v) → <k', v'>* – reduce(k', <v'>*) → <k', v'>* – also partition(), compare(), & others ● All v' with same k' are reduced together, in order. – bonus: built-in support for sort/merge!
  • 6. MapReduce job processing input map tasks reduce tasks output split 0 map() split 1 map() reduce() part 0 split 2 map() reduce() part 1 split 3 map() reduce() part 2 split 4 map()
  • 7. Example: RegexMapper public class RegexMapper implements Mapper { private Pattern pattern; private int group; public void configure(JobConf job) { pattern = Pattern.compile(job.get("mapred.mapper.regex")); group = job.getInt("mapred.mapper.regex.group", 0); } public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String text = ((UTF8)value).toString(); Matcher matcher = pattern.matcher(text); while (matcher.find()) { output.collect(new UTF8(matcher.group(group)), new LongWritable(1)); } } }
  • 8. Example: LongSumReducer public class LongSumReducer implements Reducer { public void configure(JobConf job) {} public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { long sum = 0; while (values.hasNext()) { sum += ((LongWritable)values.next()).get(); } output.collect(key, new LongWritable(sum)); } }
  • 9. Example: main() public static void main(String[] args) throws IOException { NutchConf defaults = NutchConf.get(); JobConf job = new JobConf(defaults); job.setInputDir(new File(args[0])); job.setMapperClass(RegexMapper.class); job.set("mapred.mapper.regex", args[2]); job.set("mapred.mapper.regex.group", args[3]); job.setReducerClass(LongSumReducer.class); job.setOutputDir(args[1]); job.setOutputKeyClass(UTF8.class); job.setOutputValueClass(LongWritable.class); JobClient.runJob(job); }
  • 10. Nutch Algorithms ● inject urls into a crawl db, to bootstrap it. ● loop: – generate a set of urls to fetch from crawl db; – fetch a set of urls into a segment; – parse fetched content of a segment; – update crawl db with data parsed from a segment. ● invert links parsed from segments ● index segment text & inlink anchor text
  • 11. Nutch on MapReduce & NDFS ● Nutch's major algorithms converted in 2 weeks. ● Before: – several were undistributed scalabilty bottlenecks – distributable algorithms were complex to manage – collections larger than 100M pages impractical ● After: – all are scalable, distributed, easy to operate – code is substantially smaller & simpler – should permit multi-billion page collections
  • 12. Data Structure: Crawl DB ● CrawlDb is a directory of files containing: <URL, CrawlDatum> ● CrawlDatum: <status, date, interval, failures, linkCount, ...> ● Status: {db_unfetched, db_fetched, db_gone, linked, fetch_success, fetch_fail, fetch_gone}
  • 13. Algorithm: Inject ● MapReduce1: Convert input to DB format In: flat text file of urls Map(line) → <url, CrawlDatum>; status=db_unfetched Reduce() is identity; Output: directory of temporary files ● MapReduce2: Merge into existing DB Input: output of Step1 and existing DB files Map() is identity. Reduce: merge CrawlDatum's into single entry Out: new version of DB
  • 14. Algorithm: Generate ● MapReduce1: select urls due for fetch In: Crawl DB files Map() → if date≥now, invert to <CrawlDatum, url> Partition by value hash (!) to randomize Reduce: compare() order by decreasing CrawlDatum.linkCount output only top-N most-linked entries ● MapReduce2: prepare for fetch Map() is invert; Partition() by host, Reduce() is identity. Out: Set of <url,CrawlDatum> files to fetch in parallel
  • 15. Algorithm: Fetch ● MapReduce: fetch a set of urls In: <url,CrawlDatum>, partition by host, sort by hash Map(url,CrawlDatum) → <url, FetcherOutput> multi-threaded, async map implementation calls existing Nutch protocol plugins FetcherOutput: <CrawlDatum, Content> Reduce is identity Out: two files: <url,CrawlDatum>, <url,Content>
  • 16. Algorithm: Parse ● MapReduce: parse content In: <url, Content> files from Fetch Map(url, Content) → <url, Parse> calls existing Nutch parser plugins Reduce is identity. Parse: <ParseText, ParseData> Out: split in three: <url,ParseText>, <url,ParseData> and <url,CrawlDatum> for outlinks.
  • 17. Algorithm: Update Crawl DB ● MapReduce: integrate fetch & parse out into db In: <url,CrawlDatum> existing db plus fetch & parse out Map() is identity Reduce() merges all entries into a single new entry overwrite previous db status w/ new from fetch sum count of links from parse w/ previous from db Out: new crawl db
  • 18. Algorithm: Invert Links ● MapReduce: compute inlinks for all urls In: <url,ParseData>, containing page outlinks Map(srcUrl, ParseData> → <destUrl, Inlinks> collect a single-element Inlinks for each outlink limit number of outlinks per page Inlinks: <srcUrl, anchorText>* Reduce() appends inlinks Out: <url, Inlinks>, a complete link inversion
  • 19. Algorithm: Index ● MapReduce: create Lucene indexes In: multiple files, values wrapped in <Class, Object> <url, ParseData> from parse, for title, metadata, etc. <url, ParseText> from parse, for text <url, Inlinks> from invert, for anchors <url, CrawlDatum> from fetch, for fetch date Map() is identity Reduce() create a Lucene Document call existing Nutch indexing plugins Out: build Lucene index; copy to fs at end
  • 20. Hadoop MapReduce Extensions ● Split output to multiple files – saves subsequent i/o, since inputs are smaller ● Mix input value types – saves MapReduce passes to convert values ● Async Map – permits multi-threaded Fetcher ● Partition by Value – facilitates selecting subsets w/ maximum key values