SlideShare una empresa de Scribd logo
1 de 78
Sort of Vinyl: Ordered Record Collection Chris Douglas 01.18.2010
Obligatory MapReduce Flow Slide Split 2 Map 2 Combine* Reduce 1 Split 1 Map 1 hdfs://host:8020/input/data hdfs://host:8020/output/data HDFS HDFS Combine* Reduce 1 Split 0 Map 0 Combine*
Obligatory MapReduce Flow Slide Map Output Collection Split 2 Map 2 Combine* Reduce 1 Split 1 Map 1 hdfs://host:8020/input/data hdfs://host:8020/output/data HDFS HDFS Combine* Reduce 1 Split 0 Map 0 Combine*
Overview Hadoop (∞, 0.10) Hadoop [ 0.10, 0.17) Hadoop [0.17, 0.22] Lucene HADOOP-331 HADOOP-2919
Overview Hadoop (∞, 0.10) Hadoop [ 0.10, 0.17) Hadoop [0.17, 0.22] Lucene HADOOP-331 HADOOP-2919 Cretaceous Jurassic Triassic
Awesome!
Problem Description map(K1,V1) * collect(K2,V2)
Problem Description p0  partition(key0,val0) map(K1,V1) * Serialization collect(K2,V2) * K2.write(DataOutput) write(byte[], int, int) * V2.write(DataOutput) write(byte[], int, int)
Problem Description p0  partition(key0,val0) map(K1,V1) * Serialization collect(K2,V2) * K2.write(DataOutput) write(byte[], int, int) * V2.write(DataOutput) write(byte[], int, int)
Problem Description p0  partition(key0,val0) map(K1,V1) * Serialization collect(K2,V2) * K2.write(DataOutput) write(byte[], int, int) * V2.write(DataOutput) write(byte[], int, int)
Problem Description p0 partition(key0,val0) map(K1,V1) * Serialization collect(K2,V2) * K2.write(DataOutput) write(byte[], int, int) * V2.write(DataOutput) write(byte[], int, int) key0
Problem Description p0 partition(key0,val0) map(K1,V1) * Serialization collect(K2,V2) * K2.write(DataOutput) write(byte[], int, int) * V2.write(DataOutput) write(byte[], int, int) key0
Problem Description p0 partition(key0,val0) map(K1,V1) * Serialization collect(K2,V2) * K2.write(DataOutput) write(byte[], int, int) * V2.write(DataOutput) write(byte[], int, int) key0 val0
Problem Description p0 partition(key0,val0) map(K1,V1) * Serialization collect(K2,V2) * K2.write(DataOutput) write(byte[], int, int) * V2.write(DataOutput) write(byte[], int, int) key0 val0
Problem Description int p0 partition(key0,val0) map(K1,V1) * Serialization collect(K2,V2) * K2.write(DataOutput) write(byte[], int, int) * V2.write(DataOutput) write(byte[], int, int) key0 val0 byte[] byte[]
Problem Description For all calls to collect(K2 keyn, V2 valn): ,[object Object]
Ordered set of write(byte[], int, int) for keyn
Ordered set of write(byte[], int, int) for valnChallenges: ,[object Object]
Records must be grouped for efficient fetch from reduce
Sort occurs after the records are serialized,[object Object]
Hadoop (∞, 0.10) p0 partition(key0,val0) map(K1,V1) * collect(K2,V2) collect(K2,V2) SequenceFile::Writer[p0].append(key0, val0) … …
Hadoop (∞, 0.10) p0 partition(key0,val0) map(K1,V1) * collect(K2,V2) collect(K2,V2) key0.write(localFS) SequenceFile::Writer[p0].append(key0, val0) val0.write(localFS) … …
Hadoop (∞, 0.10) p0 partition(key0,val0) map(K1,V1) * collect(K2,V2) collect(K2,V2) key0.write(localFS) SequenceFile::Writer[p0].append(key0, val0) val0.write(localFS) … …
Hadoop (∞, 0.10) Not necessarily true. SeqFile may buffer configurable amount of data to effect block compresion, stream buffering, etc. p0 partition(key0,val0) map(K1,V1) * collect(K2,V2) collect(K2,V2) key0.write(localFS) SequenceFile::Writer[p0].append(key0, val0) val0.write(localFS) … …
Hadoop (∞, 0.10) key0 key1 clone(key0, val0) map(K1,V1) key2 * flush() collect(K2,V2) collect(K2,V2) reduce(keyn, val*) SequenceFile::Writer[p0].append(keyn’, valn’) … p0 partition(key0,val0) …
Hadoop (∞, 0.10) key0 key1 clone(key0, val0) map(K1,V1) key2 * flush() collect(K2,V2) collect(K2,V2) reduce(keyn, val*) SequenceFile::Writer[p0].append(keyn’, valn’) … p0 partition(key0,val0) …
Hadoop (∞, 0.10) key0 key1 clone(key0, val0) map(K1,V1) key2 * flush() collect(K2,V2) collect(K2,V2) reduce(keyn, val*) SequenceFile::Writer[p0].append(keyn’, valn’) … p0 partition(key0,val0) … Combiner may change the partition and ordering of input records. This is no longer supported
Hadoop (∞, 0.10) Reduce k Reduce 0 … TaskTracker …
Hadoop (∞, 0.10) Reduce k Reduce 0 … TaskTracker …
Hadoop (∞, 0.10) Reduce 0 sort/merge  localFS …
Hadoop (∞, 0.10) Pro: ,[object Object]
Very versatile Combiner semantics (change sort order, partition)Con: ,[object Object]
Job cleanup is expensive (e.g. 7k reducer job must delete 7k files per map on that TT)
Combiner is expensive to use and its memory usage is difficult to track
OOMExceptions from untracked memory in buffers, particularly when using compression (HADOOP-570),[object Object]
Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk()
Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk()
Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk()
Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk()
Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk()
Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) Add memory used by all BufferSorter implementations and keyValBuffer. If spill threshold exceeded, then spill contents to disk * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk() Keep offset into buffer, length of key, value.
Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … * 0 1 k-1 k sortAndSpillToDisk() *Sort permutes offsets into (offset,keylen,vallen). Once ordered, each record is output into a SeqFile and the partition offsets recorded 0
Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … * 0 1 k-1 k sortAndSpillToDisk() *Sort permutes offsets into (offset,keylen,vallen). Once ordered, each record is output into a SeqFile and the partition offsets recorded 0 K2.readFields(DataInput) V2.readFields(DataInput) SequenceFile::append(K2,V2)
Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk() *If defined, the combiner is now run during the spill, separately over each partition. Values emitted from the combiner are written directly to the output partition. 0 K2.readFields(DataInput) V2.readFields(DataInput) * << Combiner >> SequenceFile::append(K2,V2)
Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … * 0 1 k-1 k sortAndSpillToDisk() 0 1
Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk() 0 1 … … k
Hadoop [0.10, 0.17) mergeParts() 0 0 0 1 1 1 … … … … … … k k k
Hadoop [0.10, 0.17) mergeParts() 0 0 0 0 1 1 1 … … … … … … k k k
Hadoop [0.10, 0.17) Reduce 0 0 1 … TaskTracker … … k Reduce k
Hadoop [0.10, 0.17) Reduce 0 0 1 … TaskTracker … … k Reduce k
Hadoop [0.10, 0.17) Pro: ,[object Object]
Much more predictable memory footprint
Shared, in-memory buffer across all partitions w/ efficient sort
Combines over each spill, defined by memory usage, instead of record count
Running the combiner doesn’t require storing a clone of each record (fewer serializ.)
In 0.16, spill was made concurrent with collection (HADOOP-1965)Con: ,[object Object]
MergeSort copies indices on each level of recursion
Deserializing the key/value before appending to the SequenceFile is avoidable
Combiner weakened by requiring sort order and partition to remain consistent
Though tracked, BufferSort instances take non-negligible space (HADOOP-1698),[object Object]
Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2)
Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) io.sort.mb * io.sort.record.percent … io.sort.mb
Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) KS.serialize(V2) Instead of explicitly tracking space used by record metadata, allocate a configurable amount of space at the beginning of the task io.sort.mb * io.sort.record.percent … io.sort.mb
Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufend bufindex bufmark io.sort.mb * io.sort.record.percent kvstart kvend kvindex io.sort.mb
Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufend bufindex bufmark io.sort.mb * io.sort.record.percent kvstart kvend kvindex io.sort.mb kvoffsets kvindices Partition no longer implicitly tracked. Store (partition, keystart,valstart) for every record collected kvbuffer
Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufend kvstart kvend kvindex bufindex bufmark
Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufend kvstart kvend kvindex bufmark bufindex
Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufend kvstart kvend kvindex bufmark bufindex
Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufend kvstart kvend kvindex p0 bufmark bufindex
Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufend kvstart kvend kvindex io.sort.spill.percent bufindex bufmark
Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart kvstart kvend kvindex bufend bufindex bufmark
Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart kvstart kvend kvindex bufindex bufmark bufend
Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufindex bufmark kvstart kvindex kvend bufend
Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufindex bufmark kvindex kvstart kvend bufstart bufend
Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) Invalid segments in the serialization buffer are marked by bufvoid RawComparator interface requires that the key be contiguous in the byte[] bufmark bufvoid bufindex kvindex kvstart kvend bufstart bufend
Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufvoid bufmark bufindex kvindex kvstart kvend bufstart bufend
Hadoop [0.17, 0.22) Pro: ,[object Object]
No resizing of buffers, copying of serialized record data or metadata

Más contenido relacionado

La actualidad más candente

ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...Altinity Ltd
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
 
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and Telegraf
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and TelegrafObtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and Telegraf
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and TelegrafInfluxData
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsSamir Bessalah
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Konrad Malawski
 
Sessionization with Spark streaming
Sessionization with Spark streamingSessionization with Spark streaming
Sessionization with Spark streamingRamūnas Urbonas
 
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...Altinity Ltd
 
Hive data migration (export/import)
Hive data migration (export/import)Hive data migration (export/import)
Hive data migration (export/import)Bopyo Hong
 
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareClickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareAltinity Ltd
 
Flux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul DixFlux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul DixInfluxData
 
Webinar slides: Adding Fast Analytics to MySQL Applications with Clickhouse
Webinar slides: Adding Fast Analytics to MySQL Applications with ClickhouseWebinar slides: Adding Fast Analytics to MySQL Applications with Clickhouse
Webinar slides: Adding Fast Analytics to MySQL Applications with ClickhouseAltinity Ltd
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseSages
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performancesource{d}
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopSages
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOAltinity Ltd
 

La actualidad más candente (20)

Scala+data
Scala+dataScala+data
Scala+data
 
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
 
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and Telegraf
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and TelegrafObtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and Telegraf
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and Telegraf
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark Jobs
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Sessionization with Spark streaming
Sessionization with Spark streamingSessionization with Spark streaming
Sessionization with Spark streaming
 
Scalding
ScaldingScalding
Scalding
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
 
Hive data migration (export/import)
Hive data migration (export/import)Hive data migration (export/import)
Hive data migration (export/import)
 
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareClickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
 
Flux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul DixFlux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul Dix
 
Webinar slides: Adding Fast Analytics to MySQL Applications with Clickhouse
Webinar slides: Adding Fast Analytics to MySQL Applications with ClickhouseWebinar slides: Adding Fast Analytics to MySQL Applications with Clickhouse
Webinar slides: Adding Fast Analytics to MySQL Applications with Clickhouse
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performance
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
 

Destacado

Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Twitter Protobufs And Hadoop Hug 021709
Twitter Protobufs And Hadoop   Hug 021709Twitter Protobufs And Hadoop   Hug 021709
Twitter Protobufs And Hadoop Hug 021709Hadoop User Group
 
Hadoop Record Reader In Python
Hadoop Record Reader In PythonHadoop Record Reader In Python
Hadoop Record Reader In PythonHadoop User Group
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009yhadoop
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talksyhadoop
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practicesHadoop User Group
 
Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for HadoopHadoop User Group
 
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-211 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21Hadoop User Group
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21Hadoop User Group
 
The Bixo Web Mining Toolkit
The Bixo Web Mining ToolkitThe Bixo Web Mining Toolkit
The Bixo Web Mining ToolkitTom Croucher
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataYahoo Developer Network
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Hadoop User Group
 

Destacado (20)

Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Hadoop Release Plan Feb17
Hadoop Release Plan Feb17Hadoop Release Plan Feb17
Hadoop Release Plan Feb17
 
Twitter Protobufs And Hadoop Hug 021709
Twitter Protobufs And Hadoop   Hug 021709Twitter Protobufs And Hadoop   Hug 021709
Twitter Protobufs And Hadoop Hug 021709
 
Searching At Scale
Searching At ScaleSearching At Scale
Searching At Scale
 
File Context
File ContextFile Context
File Context
 
Hadoop Record Reader In Python
Hadoop Record Reader In PythonHadoop Record Reader In Python
Hadoop Record Reader In Python
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practices
 
Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for Hadoop
 
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-211 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
 
Mumak
MumakMumak
Mumak
 
The Bixo Web Mining Toolkit
The Bixo Web Mining ToolkitThe Bixo Web Mining Toolkit
The Bixo Web Mining Toolkit
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big Data
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
 
HUG Nov 2010: HDFS Raid - Facebook
HUG Nov 2010: HDFS Raid - FacebookHUG Nov 2010: HDFS Raid - Facebook
HUG Nov 2010: HDFS Raid - Facebook
 
Cloudera Desktop
Cloudera DesktopCloudera Desktop
Cloudera Desktop
 
3 avro hug-2010-07-21
3 avro hug-2010-07-213 avro hug-2010-07-21
3 avro hug-2010-07-21
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
 

Similar a Ordered Record Collection

Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章moai kids
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learnedtcurdt
 
Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveEdward Capriolo
 
Hadoop World 2011: Leveraging Hadoop for Legacy Systems - Mathias Herberts, C...
Hadoop World 2011: Leveraging Hadoop for Legacy Systems - Mathias Herberts, C...Hadoop World 2011: Leveraging Hadoop for Legacy Systems - Mathias Herberts, C...
Hadoop World 2011: Leveraging Hadoop for Legacy Systems - Mathias Herberts, C...Cloudera, Inc.
 
Leveraging Hadoop for Legacy Systems
Leveraging Hadoop for Legacy SystemsLeveraging Hadoop for Legacy Systems
Leveraging Hadoop for Legacy SystemsMathias Herberts
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using DiscoJim Roepcke
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Advanced CouchDB Rotterdam.rb July 2010
Advanced CouchDB Rotterdam.rb July 2010Advanced CouchDB Rotterdam.rb July 2010
Advanced CouchDB Rotterdam.rb July 2010Sander van de Graaf
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startupsbmlever
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to sparkJavier Arrieta
 

Similar a Ordered Record Collection (20)

Hadoop
HadoopHadoop
Hadoop
 
Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
 
Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIve
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
 
Hadoop World 2011: Leveraging Hadoop for Legacy Systems - Mathias Herberts, C...
Hadoop World 2011: Leveraging Hadoop for Legacy Systems - Mathias Herberts, C...Hadoop World 2011: Leveraging Hadoop for Legacy Systems - Mathias Herberts, C...
Hadoop World 2011: Leveraging Hadoop for Legacy Systems - Mathias Herberts, C...
 
Leveraging Hadoop for Legacy Systems
Leveraging Hadoop for Legacy SystemsLeveraging Hadoop for Legacy Systems
Leveraging Hadoop for Legacy Systems
 
Farewell to Disks: Efficient Processing of Obstinate Data
Farewell to Disks: Efficient Processing of Obstinate DataFarewell to Disks: Efficient Processing of Obstinate Data
Farewell to Disks: Efficient Processing of Obstinate Data
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using Disco
 
Hadoop I/O Analysis
Hadoop I/O AnalysisHadoop I/O Analysis
Hadoop I/O Analysis
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
hadoop
hadoophadoop
hadoop
 
Advanced CouchDB Rotterdam.rb July 2010
Advanced CouchDB Rotterdam.rb July 2010Advanced CouchDB Rotterdam.rb July 2010
Advanced CouchDB Rotterdam.rb July 2010
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 
MapReduce DesignPatterns
MapReduce DesignPatternsMapReduce DesignPatterns
MapReduce DesignPatterns
 
Zenith it-hadoop-training
Zenith it-hadoop-trainingZenith it-hadoop-training
Zenith it-hadoop-training
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 

Más de Hadoop User Group

Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsHadoop User Group
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
1 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit20101 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit2010Hadoop User Group
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Hadoop User Group
 
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Hadoop User Group
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduceHadoop User Group
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop User Group
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupHadoop User Group
 
Flightcaster Presentation Hadoop
Flightcaster  Presentation  HadoopFlightcaster  Presentation  Hadoop
Flightcaster Presentation HadoopHadoop User Group
 

Más de Hadoop User Group (17)

Common crawlpresentation
Common crawlpresentationCommon crawlpresentation
Common crawlpresentation
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Cascalog internal dsl_preso
Cascalog internal dsl_presoCascalog internal dsl_preso
Cascalog internal dsl_preso
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Pig at Linkedin
Pig at LinkedinPig at Linkedin
Pig at Linkedin
 
1 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit20101 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit2010
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
 
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User Group
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Flightcaster Presentation Hadoop
Flightcaster  Presentation  HadoopFlightcaster  Presentation  Hadoop
Flightcaster Presentation Hadoop
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 

Último

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Último (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Ordered Record Collection

  • 1. Sort of Vinyl: Ordered Record Collection Chris Douglas 01.18.2010
  • 2. Obligatory MapReduce Flow Slide Split 2 Map 2 Combine* Reduce 1 Split 1 Map 1 hdfs://host:8020/input/data hdfs://host:8020/output/data HDFS HDFS Combine* Reduce 1 Split 0 Map 0 Combine*
  • 3. Obligatory MapReduce Flow Slide Map Output Collection Split 2 Map 2 Combine* Reduce 1 Split 1 Map 1 hdfs://host:8020/input/data hdfs://host:8020/output/data HDFS HDFS Combine* Reduce 1 Split 0 Map 0 Combine*
  • 4. Overview Hadoop (∞, 0.10) Hadoop [ 0.10, 0.17) Hadoop [0.17, 0.22] Lucene HADOOP-331 HADOOP-2919
  • 5. Overview Hadoop (∞, 0.10) Hadoop [ 0.10, 0.17) Hadoop [0.17, 0.22] Lucene HADOOP-331 HADOOP-2919 Cretaceous Jurassic Triassic
  • 8. Problem Description p0  partition(key0,val0) map(K1,V1) * Serialization collect(K2,V2) * K2.write(DataOutput) write(byte[], int, int) * V2.write(DataOutput) write(byte[], int, int)
  • 9. Problem Description p0  partition(key0,val0) map(K1,V1) * Serialization collect(K2,V2) * K2.write(DataOutput) write(byte[], int, int) * V2.write(DataOutput) write(byte[], int, int)
  • 10. Problem Description p0  partition(key0,val0) map(K1,V1) * Serialization collect(K2,V2) * K2.write(DataOutput) write(byte[], int, int) * V2.write(DataOutput) write(byte[], int, int)
  • 11. Problem Description p0 partition(key0,val0) map(K1,V1) * Serialization collect(K2,V2) * K2.write(DataOutput) write(byte[], int, int) * V2.write(DataOutput) write(byte[], int, int) key0
  • 12. Problem Description p0 partition(key0,val0) map(K1,V1) * Serialization collect(K2,V2) * K2.write(DataOutput) write(byte[], int, int) * V2.write(DataOutput) write(byte[], int, int) key0
  • 13. Problem Description p0 partition(key0,val0) map(K1,V1) * Serialization collect(K2,V2) * K2.write(DataOutput) write(byte[], int, int) * V2.write(DataOutput) write(byte[], int, int) key0 val0
  • 14. Problem Description p0 partition(key0,val0) map(K1,V1) * Serialization collect(K2,V2) * K2.write(DataOutput) write(byte[], int, int) * V2.write(DataOutput) write(byte[], int, int) key0 val0
  • 15. Problem Description int p0 partition(key0,val0) map(K1,V1) * Serialization collect(K2,V2) * K2.write(DataOutput) write(byte[], int, int) * V2.write(DataOutput) write(byte[], int, int) key0 val0 byte[] byte[]
  • 16.
  • 17. Ordered set of write(byte[], int, int) for keyn
  • 18.
  • 19. Records must be grouped for efficient fetch from reduce
  • 20.
  • 21. Hadoop (∞, 0.10) p0 partition(key0,val0) map(K1,V1) * collect(K2,V2) collect(K2,V2) SequenceFile::Writer[p0].append(key0, val0) … …
  • 22. Hadoop (∞, 0.10) p0 partition(key0,val0) map(K1,V1) * collect(K2,V2) collect(K2,V2) key0.write(localFS) SequenceFile::Writer[p0].append(key0, val0) val0.write(localFS) … …
  • 23. Hadoop (∞, 0.10) p0 partition(key0,val0) map(K1,V1) * collect(K2,V2) collect(K2,V2) key0.write(localFS) SequenceFile::Writer[p0].append(key0, val0) val0.write(localFS) … …
  • 24. Hadoop (∞, 0.10) Not necessarily true. SeqFile may buffer configurable amount of data to effect block compresion, stream buffering, etc. p0 partition(key0,val0) map(K1,V1) * collect(K2,V2) collect(K2,V2) key0.write(localFS) SequenceFile::Writer[p0].append(key0, val0) val0.write(localFS) … …
  • 25. Hadoop (∞, 0.10) key0 key1 clone(key0, val0) map(K1,V1) key2 * flush() collect(K2,V2) collect(K2,V2) reduce(keyn, val*) SequenceFile::Writer[p0].append(keyn’, valn’) … p0 partition(key0,val0) …
  • 26. Hadoop (∞, 0.10) key0 key1 clone(key0, val0) map(K1,V1) key2 * flush() collect(K2,V2) collect(K2,V2) reduce(keyn, val*) SequenceFile::Writer[p0].append(keyn’, valn’) … p0 partition(key0,val0) …
  • 27. Hadoop (∞, 0.10) key0 key1 clone(key0, val0) map(K1,V1) key2 * flush() collect(K2,V2) collect(K2,V2) reduce(keyn, val*) SequenceFile::Writer[p0].append(keyn’, valn’) … p0 partition(key0,val0) … Combiner may change the partition and ordering of input records. This is no longer supported
  • 28. Hadoop (∞, 0.10) Reduce k Reduce 0 … TaskTracker …
  • 29. Hadoop (∞, 0.10) Reduce k Reduce 0 … TaskTracker …
  • 30. Hadoop (∞, 0.10) Reduce 0 sort/merge  localFS …
  • 31.
  • 32.
  • 33. Job cleanup is expensive (e.g. 7k reducer job must delete 7k files per map on that TT)
  • 34. Combiner is expensive to use and its memory usage is difficult to track
  • 35.
  • 36. Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk()
  • 37. Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk()
  • 38. Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk()
  • 39. Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk()
  • 40. Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk()
  • 41. Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) Add memory used by all BufferSorter implementations and keyValBuffer. If spill threshold exceeded, then spill contents to disk * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk() Keep offset into buffer, length of key, value.
  • 42. Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … * 0 1 k-1 k sortAndSpillToDisk() *Sort permutes offsets into (offset,keylen,vallen). Once ordered, each record is output into a SeqFile and the partition offsets recorded 0
  • 43. Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … * 0 1 k-1 k sortAndSpillToDisk() *Sort permutes offsets into (offset,keylen,vallen). Once ordered, each record is output into a SeqFile and the partition offsets recorded 0 K2.readFields(DataInput) V2.readFields(DataInput) SequenceFile::append(K2,V2)
  • 44. Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk() *If defined, the combiner is now run during the spill, separately over each partition. Values emitted from the combiner are written directly to the output partition. 0 K2.readFields(DataInput) V2.readFields(DataInput) * << Combiner >> SequenceFile::append(K2,V2)
  • 45. Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … * 0 1 k-1 k sortAndSpillToDisk() 0 1
  • 46. Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk() 0 1 … … k
  • 47. Hadoop [0.10, 0.17) mergeParts() 0 0 0 1 1 1 … … … … … … k k k
  • 48. Hadoop [0.10, 0.17) mergeParts() 0 0 0 0 1 1 1 … … … … … … k k k
  • 49. Hadoop [0.10, 0.17) Reduce 0 0 1 … TaskTracker … … k Reduce k
  • 50. Hadoop [0.10, 0.17) Reduce 0 0 1 … TaskTracker … … k Reduce k
  • 51.
  • 52. Much more predictable memory footprint
  • 53. Shared, in-memory buffer across all partitions w/ efficient sort
  • 54. Combines over each spill, defined by memory usage, instead of record count
  • 55. Running the combiner doesn’t require storing a clone of each record (fewer serializ.)
  • 56.
  • 57. MergeSort copies indices on each level of recursion
  • 58. Deserializing the key/value before appending to the SequenceFile is avoidable
  • 59. Combiner weakened by requiring sort order and partition to remain consistent
  • 60.
  • 61. Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2)
  • 62. Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) io.sort.mb * io.sort.record.percent … io.sort.mb
  • 63. Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) KS.serialize(V2) Instead of explicitly tracking space used by record metadata, allocate a configurable amount of space at the beginning of the task io.sort.mb * io.sort.record.percent … io.sort.mb
  • 64. Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufend bufindex bufmark io.sort.mb * io.sort.record.percent kvstart kvend kvindex io.sort.mb
  • 65. Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufend bufindex bufmark io.sort.mb * io.sort.record.percent kvstart kvend kvindex io.sort.mb kvoffsets kvindices Partition no longer implicitly tracked. Store (partition, keystart,valstart) for every record collected kvbuffer
  • 66. Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufend kvstart kvend kvindex bufindex bufmark
  • 67. Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufend kvstart kvend kvindex bufmark bufindex
  • 68. Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufend kvstart kvend kvindex bufmark bufindex
  • 69. Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufend kvstart kvend kvindex p0 bufmark bufindex
  • 70. Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufend kvstart kvend kvindex io.sort.spill.percent bufindex bufmark
  • 71. Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart kvstart kvend kvindex bufend bufindex bufmark
  • 72. Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart kvstart kvend kvindex bufindex bufmark bufend
  • 73. Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufindex bufmark kvstart kvindex kvend bufend
  • 74. Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufindex bufmark kvindex kvstart kvend bufstart bufend
  • 75. Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) Invalid segments in the serialization buffer are marked by bufvoid RawComparator interface requires that the key be contiguous in the byte[] bufmark bufvoid bufindex kvindex kvstart kvend bufstart bufend
  • 76. Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufvoid bufmark bufindex kvindex kvstart kvend bufstart bufend
  • 77.
  • 78. No resizing of buffers, copying of serialized record data or metadata
  • 79. Uses SequenceFile::appendRaw to avoid deserialization/serialization pass
  • 80.
  • 81. Caching of spill indices (HADOOP-3638)
  • 82. Run combiner during the merge (HADOOP-3226)
  • 83.
  • 85. io.sort.record.percent is obscure, critical to performance, and awkward
  • 86. While predictable, memory usage is arguably too restricted
  • 87.
  • 88. Hadoop [0.22] bufstart bufend equator kvstart kvend kvindex bufindex bufmark
  • 89. Hadoop [0.22] bufstart bufend equator kvstart kvend kvindex bufmark bufindex
  • 90. Hadoop [0.22] bufstart bufend equator kvstart kvend kvindex bufmark bufindex
  • 91. Hadoop [0.22] bufstart bufend equator kvstart kvend kvindex bufmark bufindex
  • 92. Hadoop [0.22] bufstart bufend equator kvstart kvend kvindex bufmark bufindex p0 kvoffsets and kvindices information interlaced into metadata blocks. The sort is effected in a manner identical to 0.17, but metadata is allocated per-record, rather than a priori (kvoffsets) (kvindices)
  • 93. Hadoop [0.22] bufstart bufend equator kvstart kvend kvindex bufindex bufmark
  • 94. Hadoop [0.22] bufstart kvstart kvend bufend kvindex equator bufindex bufmark
  • 95. Hadoop [0.22] bufstart kvstart kvend bufend bufindex bufmark kvindex equator
  • 96. Hadoop [0.22] kvstart kvend bufstart bufend bufindex bufmark kvindex equator
  • 97. Hadoop [0.22] bufindex bufmark kvindex equator bufstart bufend kvstart kvend
  • 98. Hadoop [0.22] bufstart bufend equator kvstart kvend kvindex bufindex bufmark
  • 99. Hadoop [0.22] bufstart kvstart kvend kvindex bufindex bufmark bufend equator
  • 100. Hadoop [0.22] bufindex kvindex kvstart kvend bufmark bufstart bufend equator

Notas del editor

  1. Every presenter must include a slide like this one, and protocol demands that it contain no fewer than 5 inaccuracies