Ordered Record Collection

Sort of Vinyl: Ordered Record Collection Chris Douglas 01.18.2010

Obligatory MapReduce Flow Slide Split 2 Map 2 Combine* Reduce 1 Split 1 Map 1 hdfs://host:8020/input/data hdfs://host:8020/output/data HDFS HDFS Combine* Reduce 1 Split 0 Map 0 Combine*

Obligatory MapReduce Flow Slide Map Output Collection Split 2 Map 2 Combine* Reduce 1 Split 1 Map 1 hdfs://host:8020/input/data hdfs://host:8020/output/data HDFS HDFS Combine* Reduce 1 Split 0 Map 0 Combine*

Overview Hadoop (∞, 0.10) Hadoop [ 0.10, 0.17) Hadoop [0.17, 0.22] Lucene HADOOP-331 HADOOP-2919

Overview Hadoop (∞, 0.10) Hadoop [ 0.10, 0.17) Hadoop [0.17, 0.22] Lucene HADOOP-331 HADOOP-2919 Cretaceous Jurassic Triassic

Problem Description map(K1,V1) * collect(K2,V2)

Problem Description p0  partition(key0,val0) map(K1,V1) * Serialization collect(K2,V2) * K2.write(DataOutput) write(byte[], int, int) * V2.write(DataOutput) write(byte[], int, int)

Problem Description p0 partition(key0,val0) map(K1,V1) * Serialization collect(K2,V2) * K2.write(DataOutput) write(byte[], int, int) * V2.write(DataOutput) write(byte[], int, int) key0

Problem Description p0 partition(key0,val0) map(K1,V1) * Serialization collect(K2,V2) * K2.write(DataOutput) write(byte[], int, int) * V2.write(DataOutput) write(byte[], int, int) key0 val0

Problem Description int p0 partition(key0,val0) map(K1,V1) * Serialization collect(K2,V2) * K2.write(DataOutput) write(byte[], int, int) * V2.write(DataOutput) write(byte[], int, int) key0 val0 byte[] byte[]

Problem Description For all calls to collect(K2 keyn, V2 valn): ,[object Object]

Ordered set of write(byte[], int, int) for keyn

Ordered set of write(byte[], int, int) for valnChallenges: ,[object Object]

Records must be grouped for efficient fetch from reduce

Sort occurs after the records are serialized,[object Object]

Hadoop (∞, 0.10) p0 partition(key0,val0) map(K1,V1) * collect(K2,V2) collect(K2,V2) SequenceFile::Writer[p0].append(key0, val0) … …

Hadoop (∞, 0.10) p0 partition(key0,val0) map(K1,V1) * collect(K2,V2) collect(K2,V2) key0.write(localFS) SequenceFile::Writer[p0].append(key0, val0) val0.write(localFS) … …

Hadoop (∞, 0.10) Not necessarily true. SeqFile may buffer configurable amount of data to effect block compresion, stream buffering, etc. p0 partition(key0,val0) map(K1,V1) * collect(K2,V2) collect(K2,V2) key0.write(localFS) SequenceFile::Writer[p0].append(key0, val0) val0.write(localFS) … …

Hadoop (∞, 0.10) key0 key1 clone(key0, val0) map(K1,V1) key2 * flush() collect(K2,V2) collect(K2,V2) reduce(keyn, val*) SequenceFile::Writer[p0].append(keyn’, valn’) … p0 partition(key0,val0) …

Hadoop (∞, 0.10) key0 key1 clone(key0, val0) map(K1,V1) key2 * flush() collect(K2,V2) collect(K2,V2) reduce(keyn, val*) SequenceFile::Writer[p0].append(keyn’, valn’) … p0 partition(key0,val0) … Combiner may change the partition and ordering of input records. This is no longer supported

Hadoop (∞, 0.10) Reduce k Reduce 0 … TaskTracker …

Hadoop (∞, 0.10) Reduce 0 sort/merge  localFS …

Hadoop (∞, 0.10) Pro: ,[object Object]

Very versatile Combiner semantics (change sort order, partition)Con: ,[object Object]

Job cleanup is expensive (e.g. 7k reducer job must delete 7k files per map on that TT)

Combiner is expensive to use and its memory usage is difficult to track

OOMExceptions from untracked memory in buffers, particularly when using compression (HADOOP-570),[object Object]

Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk()

Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) Add memory used by all BufferSorter implementations and keyValBuffer. If spill threshold exceeded, then spill contents to disk * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk() Keep offset into buffer, length of key, value.

Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … * 0 1 k-1 k sortAndSpillToDisk() *Sort permutes offsets into (offset,keylen,vallen). Once ordered, each record is output into a SeqFile and the partition offsets recorded 0

Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … * 0 1 k-1 k sortAndSpillToDisk() *Sort permutes offsets into (offset,keylen,vallen). Once ordered, each record is output into a SeqFile and the partition offsets recorded 0 K2.readFields(DataInput) V2.readFields(DataInput) SequenceFile::append(K2,V2)

Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk() *If defined, the combiner is now run during the spill, separately over each partition. Values emitted from the combiner are written directly to the output partition. 0 K2.readFields(DataInput) V2.readFields(DataInput) * << Combiner >> SequenceFile::append(K2,V2)

Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … * 0 1 k-1 k sortAndSpillToDisk() 0 1

Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk() 0 1 … … k

Hadoop [0.10, 0.17) mergeParts() 0 0 0 1 1 1 … … … … … … k k k

Hadoop [0.10, 0.17) mergeParts() 0 0 0 0 1 1 1 … … … … … … k k k

Hadoop [0.10, 0.17) Reduce 0 0 1 … TaskTracker … … k Reduce k

Hadoop [0.10, 0.17) Pro: ,[object Object]

Much more predictable memory footprint

Shared, in-memory buffer across all partitions w/ efficient sort

Combines over each spill, defined by memory usage, instead of record count

Running the combiner doesn’t require storing a clone of each record (fewer serializ.)

In 0.16, spill was made concurrent with collection (HADOOP-1965)Con: ,[object Object]

MergeSort copies indices on each level of recursion

Deserializing the key/value before appending to the SequenceFile is avoidable

Combiner weakened by requiring sort order and partition to remain consistent

Though tracked, BufferSort instances take non-negligible space (HADOOP-1698),[object Object]

Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2)

Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) io.sort.mb * io.sort.record.percent … io.sort.mb

Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) KS.serialize(V2) Instead of explicitly tracking space used by record metadata, allocate a configurable amount of space at the beginning of the task io.sort.mb * io.sort.record.percent … io.sort.mb

Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufend bufindex bufmark io.sort.mb * io.sort.record.percent kvstart kvend kvindex io.sort.mb

Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufend bufindex bufmark io.sort.mb * io.sort.record.percent kvstart kvend kvindex io.sort.mb kvoffsets kvindices Partition no longer implicitly tracked. Store (partition, keystart,valstart) for every record collected kvbuffer

Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufend kvstart kvend kvindex bufindex bufmark

Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufend kvstart kvend kvindex bufmark bufindex

Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufend kvstart kvend kvindex p0 bufmark bufindex

Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufend kvstart kvend kvindex io.sort.spill.percent bufindex bufmark

Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart kvstart kvend kvindex bufend bufindex bufmark

Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart kvstart kvend kvindex bufindex bufmark bufend

Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufstart bufindex bufmark kvstart kvindex kvend bufend

Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufindex bufmark kvindex kvstart kvend bufstart bufend

Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) Invalid segments in the serialization buffer are marked by bufvoid RawComparator interface requires that the key be contiguous in the byte[] bufmark bufvoid bufindex kvindex kvstart kvend bufstart bufend

Hadoop [0.17, 0.22) map(K1,V1) p0  partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) bufvoid bufmark bufindex kvindex kvstart kvend bufstart bufend

Hadoop [0.17, 0.22) Pro: ,[object Object]

No resizing of buffers, copying of serialized record data or metadata

Ordered Record Collection

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Ordered Record Collection

Similar a Ordered Record Collection (20)

Más de Hadoop User Group

Más de Hadoop User Group (17)

Último

Último (20)

Ordered Record Collection

Notas del editor