2. About Me – Kazuaki Ishizaki
▪ Researcher at IBM Research – Tokyo
https://ibm.biz/ishizaki
– Compiler optimization, language runtime, and parallel processing
▪ Apache Spark committer from 2018/9 (SQL module)
▪ Work for IBM Java (Open J9, now) from 1996
– Technical lead for Just-in-time compiler for PowerPC
▪ ACM Distinguished Member
▪ SNS
– @kiszk
– https://www.slideshare.net/ishizaki/
2 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
3. Table of Contents
▪ What are vectorization and SIMD?
– How can SIMD improve performance?
▪ What is VectorAPI?
– Why can’t the current Spark use SIMD?
▪ How to use SIMD with performance analysis
1. Replace external libraries
2. Use vectorized runtime routines such as sort
3. Generate vectorized Java code from a given SQL query by Catalyst
3 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
4. What is Vectorization?
▪ Do multiple jobs in a batch to improve performance
– Read multiple rows at a time
– Compute multiple rows at a time
4 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Scalar Vectorization
Read one row at a time Read four rows at a time
table table
5. What is Vectorization?
▪ Do multiple jobs in a batch to improve performance
– Read multiple rows at a time
– Compute multiple rows at a time
▪ Spark already implemented multiple vectorizations
– Vectorized Parquet Reader
– Vectorized ORC Reader
– Pandas UDF (a.k.a. vectorized UDF)
5 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
6. ▪ Apply the same operation to primitive-type multiple data in an
instruction (Single Instruction Multiple Data: SIMD)
– Boolean, Short, Integer, Long, Float, and Double
What is SIMD?
6 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
7. ▪ Apply the same operation to primitive-type multiple data in an
instruction (Single Instruction Multiple Data: SIMD)
– Boolean, Short, Integer, Long, Float, and Double
– Increase the parallelism in an instruction (8x in the example)
What is SIMD?
7 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Vector register
SIMD instruction
A0 A1 A2 A3
B0 B1 B2 B3
C0 C1 C2 C3
add add add add
input 1
input 2
output
add gr1,gr2,gr3 vadd vr1,vr2,vr3
Scalar instruction SIMD instruction
A4 A5 A6 A7
B4 B5 B6 B7
C4 C5 C6 C7
add add add add
A0
B0
C0
add
input 1
input 2
output
8. ▪ Apply the same operation to primitive-type multiple data in an
instruction (Single Instruction Multiple Data: SIMD)
– Boolean, Short, Integer, Long, Float, and Double
– Increase the parallelism in an instruction
▪ SIMD can be used to implement vectorization
What is SIMD?
8 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
9. SIMD is Used in Various BigData Software
▪ Database
– DB2, Oracle, PostgreSQL, …
▪ SQL Query Engine
– Delta Engine in Databricks Runtime, Apache Impala, Apache Drill, …
9 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
10. Why Current Spark Does Not Use SIMD?
▪ Java Virtual Machine (JVM) cannot ensure whether a given Java
program will use SIMD
10 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Java code
11. Why Current Spark Do Not Use SIMD?
▪ Java Virtual Machine (JVM) can not ensure whether a given Java
program will use SIMD
– We rely on HotSpot compiler in JVM to generate SIMD instructions or not
11 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Java code
SIMD may be generated or not
JVM
12. Why Current Spark Do Not Use SIMD?
▪ Java Virtual Machine (JVM) can not ensure whether a given Java
program will use SIMD
– We rely on HotSpot compiler in JVM to generate SIMD instructions or not
12 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Java code
SIMD may be generated or not
for (int i = 0; i < n; i++) {
load r1, a[i * 4]
load r2, b[i * 4]
add r3, r1, r2
store r3, c[i * 4]
}
Slower scalar code
JVM
13. Why Current Spark Do Not Use SIMD?
▪ Java Virtual Machine (JVM) can not ensure whether a given Java
program will use SIMD
– We rely on HotSpot compiler in JVM to generate SIMD instructions or not
13 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Java code
SIMD may be generated or not
for (int i = 0; i < n; i++) {
load r1, a[i * 4]
load r2, b[i * 4]
add r3, r1, r2
store r3, c[i * 4]
}
for (int i = 0; i < n / 8; i++) {
vload vr1, a[i * 4 * 8]
vload vr2, a[i * 4 * 8]
vadd vr3, vr1, vr2
vstore vr3, c[i * 4 * 8]
}
Faster SIMD code
Slower scalar code
JVM
14. New Approach: VectorAPI
▪ VectorAPI can ensure the generated code will use SIMD
14 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
import jdk.incubator.vector.*;
int a[], b[], c[];
...
for (int i = 0; i < n; i += SPECIES.length()) {
var va = IntVector.fromArray(SPECIES, a, i);
var vb = IntVector.fromArray(SPECIES, b, i);
var vc = va.add(vb);
vc.intoArray(c, i);
}
VectorAPI
SIMD can be always generated
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Scalar code
SIMD may be generated or not
SIMD length (e.g. 8)
15. New Approach: VectorAPI
▪ VectorAPI can ensure the generated code will use SIMD
15 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
import jdk.incubator.vector.*;
int a[], b[], c[];
...
for (int i = 0; i < n; i += SPECIES.length()) {
var va = IntVector.fromArray(SPECIES, a, i);
var vb = IntVector.fromArray(SPECIES, b, i);
var vc = va.add(vb);
vc.intoArray(c, i);
}
VectorAPI
for (int i = 0; i < n / 8; i++) {
vload vr1, a[i * 4 * 8]
vload vr2, a[i * 4 * 8]
vadd vr3, vr1, vr2
vstore vr3, c[i * 4 * 8]
}
Pseudo native SIMD code
16. Where We Can Use SIMD in Spark
16 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
17. Where We Can Use SIMD in Spark
▪ External library
– BLAS library (matrix operation)
▪ SPARK-33882
17 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
18. Where We Can Use SIMD in Spark
▪ External library
– BLAS library (matrix operation)
▪ SPARK-33882
▪ Internal library
– Sort, Join, …
18 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
19. Where We Can Use SIMD in Spark
▪ External library
– BLAS library (matrix operation)
▪ SPARK-33882
▪ Internal library
– Sort, Join, …
▪ Generated code at runtime
– Java program translated from DataFrame program by Catalyst
19 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
20. Where and How We Can Use SIMD in Spark
▪ External library – Write VectorAPI code by hand
– BLAS library (matrix operation)
▪ SPARK-33882
▪ Internal library – Write VectorAPI code by hand
– Sort, Join, …
▪ Generated code at runtime – Generate VectorAPI code by Catalyst
– Catalyst translates DataFrame program info Java program
20 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
22. Three Approaches
▪ JNI (Java Native Interface) library
– Call highly-optimized binary (e.g. written in C or Fortran) thru JNI library
▪ SIMD code
– Call Java VectorAPI code if JVM supports VectorAPI
▪ Scalar code
– Call naïve Java code that runs on all JVMs
22 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
23. Implementation using VectorAPI
▪ An example of matrix operation kernels
23 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
// y += alpha * x
public void daxpy(int n, double alpha, double[] x, int incx, double[] y, int incy) {
...
DoubleVector valpha = DoubleVector.broadcast(DMAX, alpha);
int i = 0;
// vectorized part
for (; i < DMAX.loopBound(n); i += DMAX.length()) {
DoubleVector vx = DoubleVector.fromArray(DMAX, x, i);
DoubleVector vy = DoubleVector.fromArray(DMAX, y, i);
vx.fma(valpha, vy).intoArray(y, i);
}
// residual part
for (; i < n; i += 1) {
y[i] += alpha * x[i];
}
...
}
SPARK-33882
24. Benchmark for Large-size Data
▪ JNI achieves the best performance
24 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 4.15.0-115-generic
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Algorithm
Data size
(double type)
elapsed time (ms)
JNI VectorAPI Scalar
daxpy
(Y += a * X ) 10,000,000 1.3 14.6 18.2
dgemm
Z = X * Y
1000x1000
* 1000x100
1.3 40.6 81.1
25. Benchmark for Small-size Data
▪ VectorAPI achieves the best performance
25 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Algorithm
Data size
(double type)
elapsed time (ns)
JNI VectorAPI Scalar
daxpy
(Y += a * X ) 256 118 27 140
dgemm
Z = X * Y
8x8 * 8x8 555 365 679
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 4.15.0-115-generic
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
26. Summary of Three Approaches
26 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Performance Overhead Portability Choice
JNI library Best
High
(Data copy
between Java
heap and native
memory)
Readyness of
Native library
Good for large
data
SIMD code Moderate No Java 16 or later
Good for small
data
and better than
scalar code
Scalar code Slow No
Any Java
version
Backup path
28. Lots of Research for SIMD Sort and Join
28 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
29. What Sort Algorithm We Use
▪ Current Spark uses without SIMD
– Radix sort
– Tim sort
29 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
30. What Sort Algorithm We Can Use
▪ Current Spark uses without SIMD
– Radix sort
– Tim sort
▪ SIMD sort algorithms in existing research
– AA-Sort
▪ Comb sort
▪ Merge sort
– Merge sort
– Quick sort
– …
30 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
31. What Sort Algorithm We Can Use
▪ Current Spark uses without SIMD
– Radix sort
– Tim sort
▪ SIMD sort algorithms in existing research
– AA-Sort
▪ Comb sort
▪ Merge sort
– Merge sort
– Quick sort
– …
31 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Fast for data in CPU data cache
32. Comb Sort is 2.5x Faster than Tim Sort
32 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Radix sort
(Scalar)
Comb sort
(SIMD)
Sort 1,048,576 long pairs {key, value}
84ms
117ms
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Tim sort
(Scalar) 292ms
Shorter is better
33. Radix Sort is 1.4x Faster than Comb Sort
▪ Radix sort order is smaller than that of Comb sort
– O(N) v.s. O(N log N)
▪ VectorAPI cannot exploit platform-specific SIMD instructions
33 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Radix sort
(Scalar)
Comb sort
(SIMD)
Sort 1,048,576 long pairs {key, value}
84ms
117ms
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Tim sort
(Scalar) 292ms
Shorter is better
34. Sort a Pair of Key and Value
▪ Compare two 64-bit keys and get the pair with a smaller key
– This is a frequently executed operation
34 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
{key,
value}
1
-1
7
-7
5
-5
3
-3
1
-1
3
-3
{key,
value}
in0
out
in1
35. Sort a Pair of Key and Value
▪ Sort the first pair
35 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
{key,
value}
1 < 5
1
-1
7
-7
5
-5
3
-3
1
-1
3
-3
{key,
value}
in0
out
in1
36. Sort a Pair of Key and Value
▪ Sort the second pair
36 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
{key,
value}
1
-1
7
-7
5
-5
3
-3
1
-1
3
-3
{key,
value}
7 > 3
in0
out
in1
37. Parallel Sort a Pair using SIMD
▪ In parallel, compare two 64-bit keys and get the pair with a smaller
key at once
37 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
{key,
value}
1
-1
7
-7
5
-5
3
-3
1
-1
3
-3
{key,
value}
7 > 3
in0
out
An example of 256-bit width instruction
1 < 5
in1
38. No shuffle in C Version
▪ The result of compare can be logically shifted without shuffle.
38 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
__mmask8 mask = 0b10101010;
void shufflePair(__m256 *x) {
__mmask8 maska, maskb, maskar, maskbr, maskzero;
maska = _kand_mask8(_mm256_cmpgt_epi64_mask(x[0], x[8]), mask);
maskb = _kand_mask8(_mm256_cmpgt_epi64_mask(x[4], x[12], mask);
maskA = _kor_mask8(maska, _kshiftli_mask8(maska, 1));
maskB = _kor_mask8(maskb, _kshiftli_mask8(maskb, 1));
x[0] = _mm256_mask_blend_epi64(maskA, x[8], x[0]);
x[4] = _mm256_mask_blend_epi64(maskA, x[12], x[4]);
x[8] = _mm256_mask_blend_epi64(maskB, x[0], x[8]);
x[12] = _mm256_mask_blend_epi64(maskB, x[4], x[12]);
}
0 shuffle + 6 shift/or + 2 compare instructions
1
7
x[0-3]
maska
maskA
It is an important optimization to reduce the number of shuffle instruction on x86_64
“reduce port 5 pressure”
3
-1
-7
-5
-3 5
x[4-7]
compare
39. 4 Shuffles in VectorAPI Version
▪ Since the result of the comparison (VectorMask) cannot be shifted,
all four values should be shuffled before the comparison
39 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
final VectorShuffle pair =
VectorShuffle.fromValues(SPECIES_256, 0, 0, 2, 2);
private void swapPair(long x[], int i) {
LongVector xa, xb, ya, yb, xpa, xpb, ypa, ypb, xs, xt, ys, yt;
xa = load x[i+0 … i+3]; xb = load x[i+4 … i+7];
ya = load x[i+8 … i+11]; yb = load x[i+12 … i+15];
xpa = xa.rearrange(pair);
xpb = xb.rearrange(pair);
ypa = ya.rearrange(pair);
ypb = yb.rearrange(pair);
VectorMask<Long> maskA = xpa.compare(VectorOperators.GT, ypa);
VectorMask<Long> maskA = xpb.compare(VectorOperators.GT, ypb);
xs = xa.blend(ya, maskA);
xt = xb.blend(yb, maskB);
ys = ya.blend(xa, maskA);
yt = yb.blend(xb, maskB);
xs.store(x[i+0 … i+3]); xt.store(x[i+4 … i+7]);
xs.store(x[i+8 … i+11]); yt.store(x[i+11 … i+15]);
}
4 shuffle + 2 compare instructions
maskA
1
7 1
7
rearrange
5
3 5
3
5
3
rearrange
1
7
compare
xa
xb
40. Where is Bottleneck in Spark Sort Program?
▪ Spend most of the time out of the sort routine in the program
40 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Sort
algorithm
Elapsed
time (ms)
Radix sort 563
Tim sort 757
val N = 1048576
val p = spark.sparkContext.parallelize(1 to N, 1)
val df = p.map(_ => -1 * rand.nextLong).toDF("a")
df.cache
df.count
// start measuring time
df.sort("a").noop()
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
41. Where is Bottleneck in Spark Sort Program?
▪ Spend most of the time out of the sort routine in the program
41 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Sort
algorithm
Elapsed
time (ms)
Estimated time
with SIMD (ms)
Radix sort 563 563
Tim sort 757 587
val N = 1048576
val p = spark.sparkContext.parallelize(1 to N, 1)
val df = p.map(_ => -1 * rand.nextLong).toDF("a")
df.cache
df.count
// start measuring time
df.sort("a").noop()
Radix sort took 84ms
in the previous benchmark
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
42. Sort Requires Additional Operation
▪ df.sort() always involve in a costly exchange operation
– Data transfer among nodes
42 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
== Physical Plan ==
Sort [a#5L ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(a#5L ASC NULLS FIRST, 200), ..., [id=#54]
+- InMemoryTableScan [a#5L]
+- ...
43. Lessons Learned
▪ SIMD Comb sort is faster than the current Tim sort
▪ Radix sort is smart
– Order is O(N), where N is the number of elements
▪ Sort operation involves other costly operations
▪ There is room to exploit platform-specific SIMD instructions in
VectorAPI
43 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
45. How DataFrame Program is Translated?
45 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
val N = 16384
val p = sparkContext.parallelize(1 to N, 1)
val df = p.map(i => (i.toFloat, 2*i.toFloat))
.toDF("a", "b")
df.cache
df.count
df.selectExpr("a+b", "a*b“).noop()
class … {
…
}
DataFrame source program
Generated Java code
46. Catalyst Translates into Java Code
46 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
val N = 16384
val p = sparkContext.parallelize(1 to N, 1)
val df = p.map(i => (i.toFloat, 2*i.toFloat))
.toDF("a", "b")
df.cache
df.count
df.selectExpr("a+b", "a*b“).noop()
Create
Logical Plans
Optimize
Logical Plans
Create
Physical Plans
class … {
…
}
DataFrame source program
Select
Physical Plans
Generate
Java code
Catalyst
Generated Java code
47. Current Generated Code
▪ Read data in a vector style, but computation is executed
in a sequential style at a row
47 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
class GeneratedCodeGenStage {
void BatchRead() {
if (iterator.hasNext()) {
columnarBatch = iterator.next();
batchIdx = 0;
ColumnVector colA = columnarBatch.column(0);
ColumnVector colB = columnarBatch.column(1);
}
}
void processNext() {
if (columnarBatch == null) { BatchRead(); }
float valA = cola.getFloat(batchIdx);
float valB = colb.getFloat(batchIdx);
float val0 = valA + valB;
float val1 = valA * valB;
appendRow(Row(val0, val1));
batchIdx++;
}
}
Simplified generated code
48. Computation is Inefficient in Current Code
▪ To read data is efficient in a vector style
48 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
class GeneratedCodeGenStage {
void BatchRead() {
if (iterator.hasNext()) {
columnarBatch = iterator.next();
batchIdx = 0;
ColumnVector colA = columnarBatch.column(0);
ColumnVector colB = columnarBatch.column(1);
}
}
void processNext() {
if (columnarBatch == null) { BatchRead(); }
float valA = cola.getFloat(batchIdx);
float valB = cola.getFloat(batchIdx);
float val0 = valA * valB;
float val1 = valA + valB;
appendRow(Row(val0, val1));
batchIdx++;
}
}
Read data in a vector style
Compute data at a row
Put data at a row
49. Prototyped Generated Code
▪ To read and compute data in a vector style. To put data is still in a
sequential style
49 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
class GeneratedCodeGenStage {
void BatchRead() {
if (iterator.hasNext()) {
columnarBatch = iterator.next();
batchIdx = 0;
ColumnVector colA = columnarBatch.column(0);
ColumnVector colB = columnarBatch.column(1);
float va[] = colA.getFloats(), vb[] = colB.getFloats();
// compute date using Vector API
for (int i = 0; i < columnarBatch.size(); i += SPECIES.length()) {
FloatVector va = FloatVector.fromArray(SPECIES, va, i);
v0.intoArray(cola, i);
v1.intoArray(colb, i);
}
}
}
void processNext() {
if (columnarBatch == null) { BatchRead(); }
appendRow(Row(cola[batchIdx], colb[batchIdx]));
batchIdx++;
}
}
Read data in a vector style
Compute data in a vector style
Put data at a row
50. Enhanced Code Generation in Catalyst
50 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
val N = 16384
val p = sparkContext.parallelize(1 to N, 1)
val df = p.map(i => (i.toFloat, 2*i.toFloat))
.toDF("a", "b")
df.cache
df.count
df.selectExpr("a+b", "a*b“).noop()
Create
Logical Plans
Optimize
Logical Plans
Create
Physical Plans
class … {
…
}
DataFrame source program
Select
Physical Plans
Generate
Java code
Catalyst
Generated Java code
with vectorized computation
51. Prototyped Two Code Generations
▪ Perform computations using scalar variables
▪ Perform computations using VectorAPI
51 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
52. Using Scalar Variables
▪ Perform computation for multiple rows in a batch
52 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
class GeneratedCodeGenStage {
float col0[] = new float[COLUMN_BATCH_SIZE], col1[] = new float[COLUMN_BATCH_SIZE],
col2[] = new float[COLUMN_BATCH_SIZE];
void BatchRead() {
if (iterator.hasNext()) {
columnarBatch = iterator.next();
batchIdx = 0;
ColumnVector colA = columnarBatch.column(0);
ColumnVector colB = columnarBatch.column(1);
float va[] = colA.getFloats(), vb[] = colB.getFloats();
for (int i = 0; i < columnarBatch.size(); i += 1) {
float valA = cola.getFloat(i);
float valB = colb.getFloat(i);
col0[i] = valA + valB;
col1[i] = valA * valB;
}
}
}
void processNext() {
if (batchIdx == columnarBatch.size()) { BatchRead(); }
appendRow(Row(col0[batchIdx], col1[batchIdx], col2[batchIdx]));
batchIdx++;
}
}
Simplified generated code
53. Using VectorAPI
▪ Perform computation for multiple rows using SIMD in a batch
53 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
class GeneratedCodeGenStage {
float col0[] = new float[COLUMN_BATCH_SIZE], col1[] = new float[COLUMN_BATCH_SIZE],
col2[] = new float[COLUMN_BATCH_SIZE];
void BatchRead() {
if (iterator.hasNext()) {
columnarBatch = iterator.next();
batchIdx = 0;
ColumnVector colA = columnarBatch.column(0);
ColumnVector colB = columnarBatch.column(1);
float va[] = colA.getFloats(), vb[] = colB.getFloats();
for (int i = 0; i < columnarBatch.size(); i += SPECIES.length()) {
FloatVector va = FloatVector.fromArray(SPECIES, va, i);
FloatVector vb = FloatVector.fromArray(SPECIES, vb, i);
FloatVector v0 = va.mul(vb);
FloatVector v1 = va.add(vb);
v0.intoArray(col0, i); v1.intoArray(col1, i);
}
}
}
void processNext() {
if (batchIdx == columnarBatch.size()) { BatchRead(); }
appendRow(Row(col0[batchIdx], col1[batchIdx], col2[batchIdx]));
batchIdx++;
}
}
Simplified generated code
54. Up to 1.7x Faster at Micro Benchmark
▪ Vectorized version achieve up to 1.7x performance improvement
▪ SIMD version achieves about 1.2x improvement over Vectorized
Scalar version
54 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Current
version
Vectorized
(Scalar)
Vectorized
(SIMD)
34.2ms
26.6ms
20.0ms
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
val N = 16384
val p = sparkContext.parallelize(1 to N, 1)
val df = p.map(i => (i.toFloat, 2*i.toFloat))
.toDF("a", "b")
df.cache
df.count
// start measuring time
df.selectExpr("a+b", "a*b“).noop()
Shorter is better
55. 2.8x Faster at Nano Benchmark
▪ Perform the same computation as in the previous benchmark
– Add and multiple operations against 16384 float elements
55 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
void scalar(float a[], float b[],
float c[], float d[],
int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
d[i] = a[i] * b[i];
}
}
void simd(float a[], float b[], float c[],
float d[], int n) {
for (int i = 0; i < n; i += SPECIES.length()) {
FloatVector va = FloatVector
.fromArray(SPECIES, a, i);
FloatVector vb = FloatVector
.fromArray(SPECIES, b, i);
FloatVector vc = va.add(vb);
FloatVector vd = va.mul(vb);
vc.intoArray(c, i);
vd.intoArray(d, i);
}
}
Scalar version SIMD version
2.8x faster
56. Now, To Put Data is Bottleneck
▪ To read and compute data in a vector style. To put data is
in a sequential style
56 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Class GeneratedCodeGenStage {
void BatchRead() {
if (iterator.hasNext()) {
columnarBatch = iterator.next();
batchIdx = 0;
ColumnVector colA = columnarBatch.column(0);
ColumnVector colB = columnarBatch.column(1);
float va[] = colA.getFloats(), vb[] = colB.getFloats();
// compute date using Vector API
for (int i = 0; i < columnarBatch.size(); i += SPECIES.length()) {
FloatVector va = FloatVector.fromArray(SPECIES, va, i);
v0.intoArray(cola, i);
v1.intoArray(colb, i);
}
}
}
void processNext() {
if (columnarBatch == null) { BatchRead(); }
appendRow(Row(cola[batchIdx], colb[batchIdx]));
batchIdx++;
}
}
Read data in a vector style
Compute data in a vector style
Put data at a row
57. Lessons Learned
▪ To vectorize computation is effective
▪ To use SIMD is also effective, but not huge improvement
▪ There is room to improve performance at an interface
between the generated code and its successor unit
57 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
58. Takeaway
▪ How we can use SIMD instructions in Java
▪ Use SIMD at three areas
– Good result for matrix library (SPARK-33882 has been merged)
▪ Better than Java implementation
▪ Better for small data than native implementation
– Room to improve the performance of sort program
▪ VectorAPI implementation in Java virtual machine
▪ Other parts to be improved in Apache Spark
– Good result for catalyst
▪ To vectorize computation is effective
▪ Interface between computation units is important for performance
• c.f. “Vectorized Query Execution in Apache Spark at Facebook”, 2019
58 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Visit https://www.slideshare.net/ishizaki if you are interested in this slide