27. Hadoop (∞, 0.10) key0 key1 clone(key0, val0) map(K1,V1) key2 * flush() collect(K2,V2) collect(K2,V2) reduce(keyn, val*) SequenceFile::Writer[p0].append(keyn’, valn’) … p0 partition(key0,val0) … Combiner may change the partition and ordering of input records. This is no longer supported
41. Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) Add memory used by all BufferSorter implementations and keyValBuffer. If spill threshold exceeded, then spill contents to disk * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk() Keep offset into buffer, length of key, value.
42. Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … * 0 1 k-1 k sortAndSpillToDisk() *Sort permutes offsets into (offset,keylen,vallen). Once ordered, each record is output into a SeqFile and the partition offsets recorded 0
43. Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … * 0 1 k-1 k sortAndSpillToDisk() *Sort permutes offsets into (offset,keylen,vallen). Once ordered, each record is output into a SeqFile and the partition offsets recorded 0 K2.readFields(DataInput) V2.readFields(DataInput) SequenceFile::append(K2,V2)
44. Hadoop [0.10, 0.17) map(K1,V1) p0 partition(key0,val0) * collect(K2,V2) K2.write(DataOutput) V2.write(DataOutput) BufferSorter[p0].addKeyValue(recOff, keylen, vallen) … 0 1 k-1 k sortAndSpillToDisk() *If defined, the combiner is now run during the spill, separately over each partition. Values emitted from the combiner are written directly to the output partition. 0 K2.readFields(DataInput) V2.readFields(DataInput) * << Combiner >> SequenceFile::append(K2,V2)
63. Hadoop [0.17, 0.22) map(K1,V1) p0 partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) KS.serialize(V2) Instead of explicitly tracking space used by record metadata, allocate a configurable amount of space at the beginning of the task io.sort.mb * io.sort.record.percent … io.sort.mb
75. Hadoop [0.17, 0.22) map(K1,V1) p0 partition(key0,val0) * Serialization KS.serialize(K2) collect(K2,V2) VS.serialize(V2) Invalid segments in the serialization buffer are marked by bufvoid RawComparator interface requires that the key be contiguous in the byte[] bufmark bufvoid bufindex kvindex kvstart kvend bufstart bufend
92. Hadoop [0.22] bufstart bufend equator kvstart kvend kvindex bufmark bufindex p0 kvoffsets and kvindices information interlaced into metadata blocks. The sort is effected in a manner identical to 0.17, but metadata is allocated per-record, rather than a priori (kvoffsets) (kvindices)