SlideShare una empresa de Scribd logo
1 de 58
Descargar para leer sin conexión
Enabling Vectorized Engine
in Apache Spark
Kazuaki Ishizaki
IBM Research - Tokyo
About Me – Kazuaki Ishizaki
▪ Researcher at IBM Research – Tokyo
https://ibm.biz/ishizaki
– Compiler optimization, language runtime, and parallel processing
▪ Apache Spark committer from 2018/9 (SQL module)
▪ Work for IBM Java (Open J9, now) from 1996
– Technical lead for Just-in-time compiler for PowerPC
▪ ACM Distinguished Member
▪ SNS
– @kiszk
– https://www.slideshare.net/ishizaki/
2 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Table of Contents
▪ What are vectorization and SIMD?
– How can SIMD improve performance?
▪ What is VectorAPI?
– Why can’t the current Spark use SIMD?
▪ How to use SIMD with performance analysis
1. Replace external libraries
2. Use vectorized runtime routines such as sort
3. Generate vectorized Java code from a given SQL query by Catalyst
3 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
What is Vectorization?
▪ Do multiple jobs in a batch to improve performance
– Read multiple rows at a time
– Compute multiple rows at a time
4 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Scalar Vectorization
Read one row at a time Read four rows at a time
table table
What is Vectorization?
▪ Do multiple jobs in a batch to improve performance
– Read multiple rows at a time
– Compute multiple rows at a time
▪ Spark already implemented multiple vectorizations
– Vectorized Parquet Reader
– Vectorized ORC Reader
– Pandas UDF (a.k.a. vectorized UDF)
5 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
▪ Apply the same operation to primitive-type multiple data in an
instruction (Single Instruction Multiple Data: SIMD)
– Boolean, Short, Integer, Long, Float, and Double
What is SIMD?
6 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
▪ Apply the same operation to primitive-type multiple data in an
instruction (Single Instruction Multiple Data: SIMD)
– Boolean, Short, Integer, Long, Float, and Double
– Increase the parallelism in an instruction (8x in the example)
What is SIMD?
7 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Vector register
SIMD instruction
A0 A1 A2 A3
B0 B1 B2 B3
C0 C1 C2 C3
add add add add
input 1
input 2
output
add gr1,gr2,gr3 vadd vr1,vr2,vr3
Scalar instruction SIMD instruction
A4 A5 A6 A7
B4 B5 B6 B7
C4 C5 C6 C7
add add add add
A0
B0
C0
add
input 1
input 2
output
▪ Apply the same operation to primitive-type multiple data in an
instruction (Single Instruction Multiple Data: SIMD)
– Boolean, Short, Integer, Long, Float, and Double
– Increase the parallelism in an instruction
▪ SIMD can be used to implement vectorization
What is SIMD?
8 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
SIMD is Used in Various BigData Software
▪ Database
– DB2, Oracle, PostgreSQL, …
▪ SQL Query Engine
– Delta Engine in Databricks Runtime, Apache Impala, Apache Drill, …
9 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Why Current Spark Does Not Use SIMD?
▪ Java Virtual Machine (JVM) cannot ensure whether a given Java
program will use SIMD
10 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Java code
Why Current Spark Do Not Use SIMD?
▪ Java Virtual Machine (JVM) can not ensure whether a given Java
program will use SIMD
– We rely on HotSpot compiler in JVM to generate SIMD instructions or not
11 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Java code
SIMD may be generated or not
JVM
Why Current Spark Do Not Use SIMD?
▪ Java Virtual Machine (JVM) can not ensure whether a given Java
program will use SIMD
– We rely on HotSpot compiler in JVM to generate SIMD instructions or not
12 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Java code
SIMD may be generated or not
for (int i = 0; i < n; i++) {
load r1, a[i * 4]
load r2, b[i * 4]
add r3, r1, r2
store r3, c[i * 4]
}
Slower scalar code
JVM
Why Current Spark Do Not Use SIMD?
▪ Java Virtual Machine (JVM) can not ensure whether a given Java
program will use SIMD
– We rely on HotSpot compiler in JVM to generate SIMD instructions or not
13 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Java code
SIMD may be generated or not
for (int i = 0; i < n; i++) {
load r1, a[i * 4]
load r2, b[i * 4]
add r3, r1, r2
store r3, c[i * 4]
}
for (int i = 0; i < n / 8; i++) {
vload vr1, a[i * 4 * 8]
vload vr2, a[i * 4 * 8]
vadd vr3, vr1, vr2
vstore vr3, c[i * 4 * 8]
}
Faster SIMD code
Slower scalar code
JVM
New Approach: VectorAPI
▪ VectorAPI can ensure the generated code will use SIMD
14 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
import jdk.incubator.vector.*;
int a[], b[], c[];
...
for (int i = 0; i < n; i += SPECIES.length()) {
var va = IntVector.fromArray(SPECIES, a, i);
var vb = IntVector.fromArray(SPECIES, b, i);
var vc = va.add(vb);
vc.intoArray(c, i);
}
VectorAPI
SIMD can be always generated
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Scalar code
SIMD may be generated or not
SIMD length (e.g. 8)
New Approach: VectorAPI
▪ VectorAPI can ensure the generated code will use SIMD
15 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
import jdk.incubator.vector.*;
int a[], b[], c[];
...
for (int i = 0; i < n; i += SPECIES.length()) {
var va = IntVector.fromArray(SPECIES, a, i);
var vb = IntVector.fromArray(SPECIES, b, i);
var vc = va.add(vb);
vc.intoArray(c, i);
}
VectorAPI
for (int i = 0; i < n / 8; i++) {
vload vr1, a[i * 4 * 8]
vload vr2, a[i * 4 * 8]
vadd vr3, vr1, vr2
vstore vr3, c[i * 4 * 8]
}
Pseudo native SIMD code
Where We Can Use SIMD in Spark
16 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Where We Can Use SIMD in Spark
▪ External library
– BLAS library (matrix operation)
▪ SPARK-33882
17 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Where We Can Use SIMD in Spark
▪ External library
– BLAS library (matrix operation)
▪ SPARK-33882
▪ Internal library
– Sort, Join, …
18 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Where We Can Use SIMD in Spark
▪ External library
– BLAS library (matrix operation)
▪ SPARK-33882
▪ Internal library
– Sort, Join, …
▪ Generated code at runtime
– Java program translated from DataFrame program by Catalyst
19 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Where and How We Can Use SIMD in Spark
▪ External library – Write VectorAPI code by hand
– BLAS library (matrix operation)
▪ SPARK-33882
▪ Internal library – Write VectorAPI code by hand
– Sort, Join, …
▪ Generated code at runtime – Generate VectorAPI code by Catalyst
– Catalyst translates DataFrame program info Java program
20 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
External Library
More text on one line in this location if needed
Three Approaches
▪ JNI (Java Native Interface) library
– Call highly-optimized binary (e.g. written in C or Fortran) thru JNI library
▪ SIMD code
– Call Java VectorAPI code if JVM supports VectorAPI
▪ Scalar code
– Call naïve Java code that runs on all JVMs
22 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Implementation using VectorAPI
▪ An example of matrix operation kernels
23 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
// y += alpha * x
public void daxpy(int n, double alpha, double[] x, int incx, double[] y, int incy) {
...
DoubleVector valpha = DoubleVector.broadcast(DMAX, alpha);
int i = 0;
// vectorized part
for (; i < DMAX.loopBound(n); i += DMAX.length()) {
DoubleVector vx = DoubleVector.fromArray(DMAX, x, i);
DoubleVector vy = DoubleVector.fromArray(DMAX, y, i);
vx.fma(valpha, vy).intoArray(y, i);
}
// residual part
for (; i < n; i += 1) {
y[i] += alpha * x[i];
}
...
}
SPARK-33882
Benchmark for Large-size Data
▪ JNI achieves the best performance
24 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 4.15.0-115-generic
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Algorithm
Data size
(double type)
elapsed time (ms)
JNI VectorAPI Scalar
daxpy
(Y += a * X ) 10,000,000 1.3 14.6 18.2
dgemm
Z = X * Y
1000x1000
* 1000x100
1.3 40.6 81.1
Benchmark for Small-size Data
▪ VectorAPI achieves the best performance
25 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Algorithm
Data size
(double type)
elapsed time (ns)
JNI VectorAPI Scalar
daxpy
(Y += a * X ) 256 118 27 140
dgemm
Z = X * Y
8x8 * 8x8 555 365 679
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 4.15.0-115-generic
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Summary of Three Approaches
26 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Performance Overhead Portability Choice
JNI library Best
High
(Data copy
between Java
heap and native
memory)
Readyness of
Native library
Good for large
data
SIMD code Moderate No Java 16 or later
Good for small
data
and better than
scalar code
Scalar code Slow No
Any Java
version
Backup path
Internal Library
More text on one line in this location if needed
Lots of Research for SIMD Sort and Join
28 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
What Sort Algorithm We Use
▪ Current Spark uses without SIMD
– Radix sort
– Tim sort
29 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
What Sort Algorithm We Can Use
▪ Current Spark uses without SIMD
– Radix sort
– Tim sort
▪ SIMD sort algorithms in existing research
– AA-Sort
▪ Comb sort
▪ Merge sort
– Merge sort
– Quick sort
– …
30 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
What Sort Algorithm We Can Use
▪ Current Spark uses without SIMD
– Radix sort
– Tim sort
▪ SIMD sort algorithms in existing research
– AA-Sort
▪ Comb sort
▪ Merge sort
– Merge sort
– Quick sort
– …
31 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Fast for data in CPU data cache
Comb Sort is 2.5x Faster than Tim Sort
32 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Radix sort
(Scalar)
Comb sort
(SIMD)
Sort 1,048,576 long pairs {key, value}
84ms
117ms
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Tim sort
(Scalar) 292ms
Shorter is better
Radix Sort is 1.4x Faster than Comb Sort
▪ Radix sort order is smaller than that of Comb sort
– O(N) v.s. O(N log N)
▪ VectorAPI cannot exploit platform-specific SIMD instructions
33 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Radix sort
(Scalar)
Comb sort
(SIMD)
Sort 1,048,576 long pairs {key, value}
84ms
117ms
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Tim sort
(Scalar) 292ms
Shorter is better
Sort a Pair of Key and Value
▪ Compare two 64-bit keys and get the pair with a smaller key
– This is a frequently executed operation
34 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
{key,
value}
1
-1
7
-7
5
-5
3
-3
1
-1
3
-3
{key,
value}
in0
out
in1
Sort a Pair of Key and Value
▪ Sort the first pair
35 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
{key,
value}
1 < 5
1
-1
7
-7
5
-5
3
-3
1
-1
3
-3
{key,
value}
in0
out
in1
Sort a Pair of Key and Value
▪ Sort the second pair
36 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
{key,
value}
1
-1
7
-7
5
-5
3
-3
1
-1
3
-3
{key,
value}
7 > 3
in0
out
in1
Parallel Sort a Pair using SIMD
▪ In parallel, compare two 64-bit keys and get the pair with a smaller
key at once
37 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
{key,
value}
1
-1
7
-7
5
-5
3
-3
1
-1
3
-3
{key,
value}
7 > 3
in0
out
An example of 256-bit width instruction
1 < 5
in1
No shuffle in C Version
▪ The result of compare can be logically shifted without shuffle.
38 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
__mmask8 mask = 0b10101010;
void shufflePair(__m256 *x) {
__mmask8 maska, maskb, maskar, maskbr, maskzero;
maska = _kand_mask8(_mm256_cmpgt_epi64_mask(x[0], x[8]), mask);
maskb = _kand_mask8(_mm256_cmpgt_epi64_mask(x[4], x[12], mask);
maskA = _kor_mask8(maska, _kshiftli_mask8(maska, 1));
maskB = _kor_mask8(maskb, _kshiftli_mask8(maskb, 1));
x[0] = _mm256_mask_blend_epi64(maskA, x[8], x[0]);
x[4] = _mm256_mask_blend_epi64(maskA, x[12], x[4]);
x[8] = _mm256_mask_blend_epi64(maskB, x[0], x[8]);
x[12] = _mm256_mask_blend_epi64(maskB, x[4], x[12]);
}
0 shuffle + 6 shift/or + 2 compare instructions
1
7
x[0-3]
maska
maskA
It is an important optimization to reduce the number of shuffle instruction on x86_64
“reduce port 5 pressure”
3
-1
-7
-5
-3 5
x[4-7]
compare
4 Shuffles in VectorAPI Version
▪ Since the result of the comparison (VectorMask) cannot be shifted,
all four values should be shuffled before the comparison
39 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
final VectorShuffle pair =
VectorShuffle.fromValues(SPECIES_256, 0, 0, 2, 2);
private void swapPair(long x[], int i) {
LongVector xa, xb, ya, yb, xpa, xpb, ypa, ypb, xs, xt, ys, yt;
xa = load x[i+0 … i+3]; xb = load x[i+4 … i+7];
ya = load x[i+8 … i+11]; yb = load x[i+12 … i+15];
xpa = xa.rearrange(pair);
xpb = xb.rearrange(pair);
ypa = ya.rearrange(pair);
ypb = yb.rearrange(pair);
VectorMask<Long> maskA = xpa.compare(VectorOperators.GT, ypa);
VectorMask<Long> maskA = xpb.compare(VectorOperators.GT, ypb);
xs = xa.blend(ya, maskA);
xt = xb.blend(yb, maskB);
ys = ya.blend(xa, maskA);
yt = yb.blend(xb, maskB);
xs.store(x[i+0 … i+3]); xt.store(x[i+4 … i+7]);
xs.store(x[i+8 … i+11]); yt.store(x[i+11 … i+15]);
}
4 shuffle + 2 compare instructions
maskA
1
7 1
7
rearrange
5
3 5
3
5
3
rearrange
1
7
compare
xa
xb
Where is Bottleneck in Spark Sort Program?
▪ Spend most of the time out of the sort routine in the program
40 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Sort
algorithm
Elapsed
time (ms)
Radix sort 563
Tim sort 757
val N = 1048576
val p = spark.sparkContext.parallelize(1 to N, 1)
val df = p.map(_ => -1 * rand.nextLong).toDF("a")
df.cache
df.count
// start measuring time
df.sort("a").noop()
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Where is Bottleneck in Spark Sort Program?
▪ Spend most of the time out of the sort routine in the program
41 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Sort
algorithm
Elapsed
time (ms)
Estimated time
with SIMD (ms)
Radix sort 563 563
Tim sort 757 587
val N = 1048576
val p = spark.sparkContext.parallelize(1 to N, 1)
val df = p.map(_ => -1 * rand.nextLong).toDF("a")
df.cache
df.count
// start measuring time
df.sort("a").noop()
Radix sort took 84ms
in the previous benchmark
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Sort Requires Additional Operation
▪ df.sort() always involve in a costly exchange operation
– Data transfer among nodes
42 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
== Physical Plan ==
Sort [a#5L ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(a#5L ASC NULLS FIRST, 200), ..., [id=#54]
+- InMemoryTableScan [a#5L]
+- ...
Lessons Learned
▪ SIMD Comb sort is faster than the current Tim sort
▪ Radix sort is smart
– Order is O(N), where N is the number of elements
▪ Sort operation involves other costly operations
▪ There is room to exploit platform-specific SIMD instructions in
VectorAPI
43 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Generated Code
More text on one line in this location if needed
How DataFrame Program is Translated?
45 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
val N = 16384
val p = sparkContext.parallelize(1 to N, 1)
val df = p.map(i => (i.toFloat, 2*i.toFloat))
.toDF("a", "b")
df.cache
df.count
df.selectExpr("a+b", "a*b“).noop()
class … {
…
}
DataFrame source program
Generated Java code
Catalyst Translates into Java Code
46 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
val N = 16384
val p = sparkContext.parallelize(1 to N, 1)
val df = p.map(i => (i.toFloat, 2*i.toFloat))
.toDF("a", "b")
df.cache
df.count
df.selectExpr("a+b", "a*b“).noop()
Create
Logical Plans
Optimize
Logical Plans
Create
Physical Plans
class … {
…
}
DataFrame source program
Select
Physical Plans
Generate
Java code
Catalyst
Generated Java code
Current Generated Code
▪ Read data in a vector style, but computation is executed
in a sequential style at a row
47 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
class GeneratedCodeGenStage {
void BatchRead() {
if (iterator.hasNext()) {
columnarBatch = iterator.next();
batchIdx = 0;
ColumnVector colA = columnarBatch.column(0);
ColumnVector colB = columnarBatch.column(1);
}
}
void processNext() {
if (columnarBatch == null) { BatchRead(); }
float valA = cola.getFloat(batchIdx);
float valB = colb.getFloat(batchIdx);
float val0 = valA + valB;
float val1 = valA * valB;
appendRow(Row(val0, val1));
batchIdx++;
}
}
Simplified generated code
Computation is Inefficient in Current Code
▪ To read data is efficient in a vector style
48 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
class GeneratedCodeGenStage {
void BatchRead() {
if (iterator.hasNext()) {
columnarBatch = iterator.next();
batchIdx = 0;
ColumnVector colA = columnarBatch.column(0);
ColumnVector colB = columnarBatch.column(1);
}
}
void processNext() {
if (columnarBatch == null) { BatchRead(); }
float valA = cola.getFloat(batchIdx);
float valB = cola.getFloat(batchIdx);
float val0 = valA * valB;
float val1 = valA + valB;
appendRow(Row(val0, val1));
batchIdx++;
}
}
Read data in a vector style
Compute data at a row
Put data at a row
Prototyped Generated Code
▪ To read and compute data in a vector style. To put data is still in a
sequential style
49 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
class GeneratedCodeGenStage {
void BatchRead() {
if (iterator.hasNext()) {
columnarBatch = iterator.next();
batchIdx = 0;
ColumnVector colA = columnarBatch.column(0);
ColumnVector colB = columnarBatch.column(1);
float va[] = colA.getFloats(), vb[] = colB.getFloats();
// compute date using Vector API
for (int i = 0; i < columnarBatch.size(); i += SPECIES.length()) {
FloatVector va = FloatVector.fromArray(SPECIES, va, i);
v0.intoArray(cola, i);
v1.intoArray(colb, i);
}
}
}
void processNext() {
if (columnarBatch == null) { BatchRead(); }
appendRow(Row(cola[batchIdx], colb[batchIdx]));
batchIdx++;
}
}
Read data in a vector style
Compute data in a vector style
Put data at a row
Enhanced Code Generation in Catalyst
50 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
val N = 16384
val p = sparkContext.parallelize(1 to N, 1)
val df = p.map(i => (i.toFloat, 2*i.toFloat))
.toDF("a", "b")
df.cache
df.count
df.selectExpr("a+b", "a*b“).noop()
Create
Logical Plans
Optimize
Logical Plans
Create
Physical Plans
class … {
…
}
DataFrame source program
Select
Physical Plans
Generate
Java code
Catalyst
Generated Java code
with vectorized computation
Prototyped Two Code Generations
▪ Perform computations using scalar variables
▪ Perform computations using VectorAPI
51 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Using Scalar Variables
▪ Perform computation for multiple rows in a batch
52 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
class GeneratedCodeGenStage {
float col0[] = new float[COLUMN_BATCH_SIZE], col1[] = new float[COLUMN_BATCH_SIZE],
col2[] = new float[COLUMN_BATCH_SIZE];
void BatchRead() {
if (iterator.hasNext()) {
columnarBatch = iterator.next();
batchIdx = 0;
ColumnVector colA = columnarBatch.column(0);
ColumnVector colB = columnarBatch.column(1);
float va[] = colA.getFloats(), vb[] = colB.getFloats();
for (int i = 0; i < columnarBatch.size(); i += 1) {
float valA = cola.getFloat(i);
float valB = colb.getFloat(i);
col0[i] = valA + valB;
col1[i] = valA * valB;
}
}
}
void processNext() {
if (batchIdx == columnarBatch.size()) { BatchRead(); }
appendRow(Row(col0[batchIdx], col1[batchIdx], col2[batchIdx]));
batchIdx++;
}
}
Simplified generated code
Using VectorAPI
▪ Perform computation for multiple rows using SIMD in a batch
53 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
class GeneratedCodeGenStage {
float col0[] = new float[COLUMN_BATCH_SIZE], col1[] = new float[COLUMN_BATCH_SIZE],
col2[] = new float[COLUMN_BATCH_SIZE];
void BatchRead() {
if (iterator.hasNext()) {
columnarBatch = iterator.next();
batchIdx = 0;
ColumnVector colA = columnarBatch.column(0);
ColumnVector colB = columnarBatch.column(1);
float va[] = colA.getFloats(), vb[] = colB.getFloats();
for (int i = 0; i < columnarBatch.size(); i += SPECIES.length()) {
FloatVector va = FloatVector.fromArray(SPECIES, va, i);
FloatVector vb = FloatVector.fromArray(SPECIES, vb, i);
FloatVector v0 = va.mul(vb);
FloatVector v1 = va.add(vb);
v0.intoArray(col0, i); v1.intoArray(col1, i);
}
}
}
void processNext() {
if (batchIdx == columnarBatch.size()) { BatchRead(); }
appendRow(Row(col0[batchIdx], col1[batchIdx], col2[batchIdx]));
batchIdx++;
}
}
Simplified generated code
Up to 1.7x Faster at Micro Benchmark
▪ Vectorized version achieve up to 1.7x performance improvement
▪ SIMD version achieves about 1.2x improvement over Vectorized
Scalar version
54 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Current
version
Vectorized
(Scalar)
Vectorized
(SIMD)
34.2ms
26.6ms
20.0ms
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
val N = 16384
val p = sparkContext.parallelize(1 to N, 1)
val df = p.map(i => (i.toFloat, 2*i.toFloat))
.toDF("a", "b")
df.cache
df.count
// start measuring time
df.selectExpr("a+b", "a*b“).noop()
Shorter is better
2.8x Faster at Nano Benchmark
▪ Perform the same computation as in the previous benchmark
– Add and multiple operations against 16384 float elements
55 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
void scalar(float a[], float b[],
float c[], float d[],
int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
d[i] = a[i] * b[i];
}
}
void simd(float a[], float b[], float c[],
float d[], int n) {
for (int i = 0; i < n; i += SPECIES.length()) {
FloatVector va = FloatVector
.fromArray(SPECIES, a, i);
FloatVector vb = FloatVector
.fromArray(SPECIES, b, i);
FloatVector vc = va.add(vb);
FloatVector vd = va.mul(vb);
vc.intoArray(c, i);
vd.intoArray(d, i);
}
}
Scalar version SIMD version
2.8x faster
Now, To Put Data is Bottleneck
▪ To read and compute data in a vector style. To put data is
in a sequential style
56 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Class GeneratedCodeGenStage {
void BatchRead() {
if (iterator.hasNext()) {
columnarBatch = iterator.next();
batchIdx = 0;
ColumnVector colA = columnarBatch.column(0);
ColumnVector colB = columnarBatch.column(1);
float va[] = colA.getFloats(), vb[] = colB.getFloats();
// compute date using Vector API
for (int i = 0; i < columnarBatch.size(); i += SPECIES.length()) {
FloatVector va = FloatVector.fromArray(SPECIES, va, i);
v0.intoArray(cola, i);
v1.intoArray(colb, i);
}
}
}
void processNext() {
if (columnarBatch == null) { BatchRead(); }
appendRow(Row(cola[batchIdx], colb[batchIdx]));
batchIdx++;
}
}
Read data in a vector style
Compute data in a vector style
Put data at a row
Lessons Learned
▪ To vectorize computation is effective
▪ To use SIMD is also effective, but not huge improvement
▪ There is room to improve performance at an interface
between the generated code and its successor unit
57 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Takeaway
▪ How we can use SIMD instructions in Java
▪ Use SIMD at three areas
– Good result for matrix library (SPARK-33882 has been merged)
▪ Better than Java implementation
▪ Better for small data than native implementation
– Room to improve the performance of sort program
▪ VectorAPI implementation in Java virtual machine
▪ Other parts to be improved in Apache Spark
– Good result for catalyst
▪ To vectorize computation is effective
▪ Interface between computation units is important for performance
• c.f. “Vectorized Query Execution in Apache Spark at Facebook”, 2019
58 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Visit https://www.slideshare.net/ishizaki if you are interested in this slide

Más contenido relacionado

La actualidad más candente

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 

La actualidad más candente (20)

Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame Graphs
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at Facebook
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 

Similar a Enabling Vectorized Engine in Apache Spark

Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Databricks
 

Similar a Enabling Vectorized Engine in Apache Spark (20)

Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for Java
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 
icpe2019_ishizaki_public
icpe2019_ishizaki_publicicpe2019_ishizaki_public
icpe2019_ishizaki_public
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
 
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanScala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
 
Run Scala Faster with GraalVM on any Platform / GraalVMで、どこでもScalaを高速実行しよう by...
Run Scala Faster with GraalVM on any Platform / GraalVMで、どこでもScalaを高速実行しよう by...Run Scala Faster with GraalVM on any Platform / GraalVMで、どこでもScalaを高速実行しよう by...
Run Scala Faster with GraalVM on any Platform / GraalVMで、どこでもScalaを高速実行しよう by...
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to Use
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
 
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
In-Memory Evolution in Apache Spark
In-Memory Evolution in Apache SparkIn-Memory Evolution in Apache Spark
In-Memory Evolution in Apache Spark
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Java Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoJava Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey Kovalenko
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Accelerating SDN/NFV with transparent offloading architecture
Accelerating SDN/NFV with transparent offloading architectureAccelerating SDN/NFV with transparent offloading architecture
Accelerating SDN/NFV with transparent offloading architecture
 
Full Stack Scala
Full Stack ScalaFull Stack Scala
Full Stack Scala
 

Más de Kazuaki Ishizaki

20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public
Kazuaki Ishizaki
 

Más de Kazuaki Ishizaki (19)

20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf
 
20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf
 
Make AI ecosystem more interoperable
Make AI ecosystem more interoperableMake AI ecosystem more interoperable
Make AI ecosystem more interoperable
 
Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Introduction new features in Spark 3.0
Introduction new features in Spark 3.0
 
SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0
 
SparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizakiSparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizaki
 
SparkTokyo2019
SparkTokyo2019SparkTokyo2019
SparkTokyo2019
 
hscj2019_ishizaki_public
hscj2019_ishizaki_publichscj2019_ishizaki_public
hscj2019_ishizaki_public
 
Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0
 
20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public
 
20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
20160906 pplss ishizaki public
20160906 pplss ishizaki public20160906 pplss ishizaki public
20160906 pplss ishizaki public
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public
 
20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public
 
Java Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラJava Just-In-Timeコンパイラ
Java Just-In-Timeコンパイラ
 
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Enabling Vectorized Engine in Apache Spark

  • 1. Enabling Vectorized Engine in Apache Spark Kazuaki Ishizaki IBM Research - Tokyo
  • 2. About Me – Kazuaki Ishizaki ▪ Researcher at IBM Research – Tokyo https://ibm.biz/ishizaki – Compiler optimization, language runtime, and parallel processing ▪ Apache Spark committer from 2018/9 (SQL module) ▪ Work for IBM Java (Open J9, now) from 1996 – Technical lead for Just-in-time compiler for PowerPC ▪ ACM Distinguished Member ▪ SNS – @kiszk – https://www.slideshare.net/ishizaki/ 2 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 3. Table of Contents ▪ What are vectorization and SIMD? – How can SIMD improve performance? ▪ What is VectorAPI? – Why can’t the current Spark use SIMD? ▪ How to use SIMD with performance analysis 1. Replace external libraries 2. Use vectorized runtime routines such as sort 3. Generate vectorized Java code from a given SQL query by Catalyst 3 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 4. What is Vectorization? ▪ Do multiple jobs in a batch to improve performance – Read multiple rows at a time – Compute multiple rows at a time 4 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Scalar Vectorization Read one row at a time Read four rows at a time table table
  • 5. What is Vectorization? ▪ Do multiple jobs in a batch to improve performance – Read multiple rows at a time – Compute multiple rows at a time ▪ Spark already implemented multiple vectorizations – Vectorized Parquet Reader – Vectorized ORC Reader – Pandas UDF (a.k.a. vectorized UDF) 5 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 6. ▪ Apply the same operation to primitive-type multiple data in an instruction (Single Instruction Multiple Data: SIMD) – Boolean, Short, Integer, Long, Float, and Double What is SIMD? 6 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 7. ▪ Apply the same operation to primitive-type multiple data in an instruction (Single Instruction Multiple Data: SIMD) – Boolean, Short, Integer, Long, Float, and Double – Increase the parallelism in an instruction (8x in the example) What is SIMD? 7 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Vector register SIMD instruction A0 A1 A2 A3 B0 B1 B2 B3 C0 C1 C2 C3 add add add add input 1 input 2 output add gr1,gr2,gr3 vadd vr1,vr2,vr3 Scalar instruction SIMD instruction A4 A5 A6 A7 B4 B5 B6 B7 C4 C5 C6 C7 add add add add A0 B0 C0 add input 1 input 2 output
  • 8. ▪ Apply the same operation to primitive-type multiple data in an instruction (Single Instruction Multiple Data: SIMD) – Boolean, Short, Integer, Long, Float, and Double – Increase the parallelism in an instruction ▪ SIMD can be used to implement vectorization What is SIMD? 8 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 9. SIMD is Used in Various BigData Software ▪ Database – DB2, Oracle, PostgreSQL, … ▪ SQL Query Engine – Delta Engine in Databricks Runtime, Apache Impala, Apache Drill, … 9 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 10. Why Current Spark Does Not Use SIMD? ▪ Java Virtual Machine (JVM) cannot ensure whether a given Java program will use SIMD 10 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } Java code
  • 11. Why Current Spark Do Not Use SIMD? ▪ Java Virtual Machine (JVM) can not ensure whether a given Java program will use SIMD – We rely on HotSpot compiler in JVM to generate SIMD instructions or not 11 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } Java code SIMD may be generated or not JVM
  • 12. Why Current Spark Do Not Use SIMD? ▪ Java Virtual Machine (JVM) can not ensure whether a given Java program will use SIMD – We rely on HotSpot compiler in JVM to generate SIMD instructions or not 12 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } Java code SIMD may be generated or not for (int i = 0; i < n; i++) { load r1, a[i * 4] load r2, b[i * 4] add r3, r1, r2 store r3, c[i * 4] } Slower scalar code JVM
  • 13. Why Current Spark Do Not Use SIMD? ▪ Java Virtual Machine (JVM) can not ensure whether a given Java program will use SIMD – We rely on HotSpot compiler in JVM to generate SIMD instructions or not 13 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } Java code SIMD may be generated or not for (int i = 0; i < n; i++) { load r1, a[i * 4] load r2, b[i * 4] add r3, r1, r2 store r3, c[i * 4] } for (int i = 0; i < n / 8; i++) { vload vr1, a[i * 4 * 8] vload vr2, a[i * 4 * 8] vadd vr3, vr1, vr2 vstore vr3, c[i * 4 * 8] } Faster SIMD code Slower scalar code JVM
  • 14. New Approach: VectorAPI ▪ VectorAPI can ensure the generated code will use SIMD 14 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki import jdk.incubator.vector.*; int a[], b[], c[]; ... for (int i = 0; i < n; i += SPECIES.length()) { var va = IntVector.fromArray(SPECIES, a, i); var vb = IntVector.fromArray(SPECIES, b, i); var vc = va.add(vb); vc.intoArray(c, i); } VectorAPI SIMD can be always generated for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } Scalar code SIMD may be generated or not SIMD length (e.g. 8)
  • 15. New Approach: VectorAPI ▪ VectorAPI can ensure the generated code will use SIMD 15 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki import jdk.incubator.vector.*; int a[], b[], c[]; ... for (int i = 0; i < n; i += SPECIES.length()) { var va = IntVector.fromArray(SPECIES, a, i); var vb = IntVector.fromArray(SPECIES, b, i); var vc = va.add(vb); vc.intoArray(c, i); } VectorAPI for (int i = 0; i < n / 8; i++) { vload vr1, a[i * 4 * 8] vload vr2, a[i * 4 * 8] vadd vr3, vr1, vr2 vstore vr3, c[i * 4 * 8] } Pseudo native SIMD code
  • 16. Where We Can Use SIMD in Spark 16 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 17. Where We Can Use SIMD in Spark ▪ External library – BLAS library (matrix operation) ▪ SPARK-33882 17 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 18. Where We Can Use SIMD in Spark ▪ External library – BLAS library (matrix operation) ▪ SPARK-33882 ▪ Internal library – Sort, Join, … 18 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 19. Where We Can Use SIMD in Spark ▪ External library – BLAS library (matrix operation) ▪ SPARK-33882 ▪ Internal library – Sort, Join, … ▪ Generated code at runtime – Java program translated from DataFrame program by Catalyst 19 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 20. Where and How We Can Use SIMD in Spark ▪ External library – Write VectorAPI code by hand – BLAS library (matrix operation) ▪ SPARK-33882 ▪ Internal library – Write VectorAPI code by hand – Sort, Join, … ▪ Generated code at runtime – Generate VectorAPI code by Catalyst – Catalyst translates DataFrame program info Java program 20 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 21. External Library More text on one line in this location if needed
  • 22. Three Approaches ▪ JNI (Java Native Interface) library – Call highly-optimized binary (e.g. written in C or Fortran) thru JNI library ▪ SIMD code – Call Java VectorAPI code if JVM supports VectorAPI ▪ Scalar code – Call naïve Java code that runs on all JVMs 22 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 23. Implementation using VectorAPI ▪ An example of matrix operation kernels 23 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki // y += alpha * x public void daxpy(int n, double alpha, double[] x, int incx, double[] y, int incy) { ... DoubleVector valpha = DoubleVector.broadcast(DMAX, alpha); int i = 0; // vectorized part for (; i < DMAX.loopBound(n); i += DMAX.length()) { DoubleVector vx = DoubleVector.fromArray(DMAX, x, i); DoubleVector vy = DoubleVector.fromArray(DMAX, y, i); vx.fma(valpha, vy).intoArray(y, i); } // residual part for (; i < n; i += 1) { y[i] += alpha * x[i]; } ... } SPARK-33882
  • 24. Benchmark for Large-size Data ▪ JNI achieves the best performance 24 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 4.15.0-115-generic Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz Algorithm Data size (double type) elapsed time (ms) JNI VectorAPI Scalar daxpy (Y += a * X ) 10,000,000 1.3 14.6 18.2 dgemm Z = X * Y 1000x1000 * 1000x100 1.3 40.6 81.1
  • 25. Benchmark for Small-size Data ▪ VectorAPI achieves the best performance 25 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Algorithm Data size (double type) elapsed time (ns) JNI VectorAPI Scalar daxpy (Y += a * X ) 256 118 27 140 dgemm Z = X * Y 8x8 * 8x8 555 365 679 OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 4.15.0-115-generic Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
  • 26. Summary of Three Approaches 26 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Performance Overhead Portability Choice JNI library Best High (Data copy between Java heap and native memory) Readyness of Native library Good for large data SIMD code Moderate No Java 16 or later Good for small data and better than scalar code Scalar code Slow No Any Java version Backup path
  • 27. Internal Library More text on one line in this location if needed
  • 28. Lots of Research for SIMD Sort and Join 28 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 29. What Sort Algorithm We Use ▪ Current Spark uses without SIMD – Radix sort – Tim sort 29 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 30. What Sort Algorithm We Can Use ▪ Current Spark uses without SIMD – Radix sort – Tim sort ▪ SIMD sort algorithms in existing research – AA-Sort ▪ Comb sort ▪ Merge sort – Merge sort – Quick sort – … 30 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 31. What Sort Algorithm We Can Use ▪ Current Spark uses without SIMD – Radix sort – Tim sort ▪ SIMD sort algorithms in existing research – AA-Sort ▪ Comb sort ▪ Merge sort – Merge sort – Quick sort – … 31 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Fast for data in CPU data cache
  • 32. Comb Sort is 2.5x Faster than Tim Sort 32 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Radix sort (Scalar) Comb sort (SIMD) Sort 1,048,576 long pairs {key, value} 84ms 117ms OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64 Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz Tim sort (Scalar) 292ms Shorter is better
  • 33. Radix Sort is 1.4x Faster than Comb Sort ▪ Radix sort order is smaller than that of Comb sort – O(N) v.s. O(N log N) ▪ VectorAPI cannot exploit platform-specific SIMD instructions 33 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Radix sort (Scalar) Comb sort (SIMD) Sort 1,048,576 long pairs {key, value} 84ms 117ms OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64 Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz Tim sort (Scalar) 292ms Shorter is better
  • 34. Sort a Pair of Key and Value ▪ Compare two 64-bit keys and get the pair with a smaller key – This is a frequently executed operation 34 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki {key, value} 1 -1 7 -7 5 -5 3 -3 1 -1 3 -3 {key, value} in0 out in1
  • 35. Sort a Pair of Key and Value ▪ Sort the first pair 35 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki {key, value} 1 < 5 1 -1 7 -7 5 -5 3 -3 1 -1 3 -3 {key, value} in0 out in1
  • 36. Sort a Pair of Key and Value ▪ Sort the second pair 36 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki {key, value} 1 -1 7 -7 5 -5 3 -3 1 -1 3 -3 {key, value} 7 > 3 in0 out in1
  • 37. Parallel Sort a Pair using SIMD ▪ In parallel, compare two 64-bit keys and get the pair with a smaller key at once 37 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki {key, value} 1 -1 7 -7 5 -5 3 -3 1 -1 3 -3 {key, value} 7 > 3 in0 out An example of 256-bit width instruction 1 < 5 in1
  • 38. No shuffle in C Version ▪ The result of compare can be logically shifted without shuffle. 38 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki __mmask8 mask = 0b10101010; void shufflePair(__m256 *x) { __mmask8 maska, maskb, maskar, maskbr, maskzero; maska = _kand_mask8(_mm256_cmpgt_epi64_mask(x[0], x[8]), mask); maskb = _kand_mask8(_mm256_cmpgt_epi64_mask(x[4], x[12], mask); maskA = _kor_mask8(maska, _kshiftli_mask8(maska, 1)); maskB = _kor_mask8(maskb, _kshiftli_mask8(maskb, 1)); x[0] = _mm256_mask_blend_epi64(maskA, x[8], x[0]); x[4] = _mm256_mask_blend_epi64(maskA, x[12], x[4]); x[8] = _mm256_mask_blend_epi64(maskB, x[0], x[8]); x[12] = _mm256_mask_blend_epi64(maskB, x[4], x[12]); } 0 shuffle + 6 shift/or + 2 compare instructions 1 7 x[0-3] maska maskA It is an important optimization to reduce the number of shuffle instruction on x86_64 “reduce port 5 pressure” 3 -1 -7 -5 -3 5 x[4-7] compare
  • 39. 4 Shuffles in VectorAPI Version ▪ Since the result of the comparison (VectorMask) cannot be shifted, all four values should be shuffled before the comparison 39 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki final VectorShuffle pair = VectorShuffle.fromValues(SPECIES_256, 0, 0, 2, 2); private void swapPair(long x[], int i) { LongVector xa, xb, ya, yb, xpa, xpb, ypa, ypb, xs, xt, ys, yt; xa = load x[i+0 … i+3]; xb = load x[i+4 … i+7]; ya = load x[i+8 … i+11]; yb = load x[i+12 … i+15]; xpa = xa.rearrange(pair); xpb = xb.rearrange(pair); ypa = ya.rearrange(pair); ypb = yb.rearrange(pair); VectorMask<Long> maskA = xpa.compare(VectorOperators.GT, ypa); VectorMask<Long> maskA = xpb.compare(VectorOperators.GT, ypb); xs = xa.blend(ya, maskA); xt = xb.blend(yb, maskB); ys = ya.blend(xa, maskA); yt = yb.blend(xb, maskB); xs.store(x[i+0 … i+3]); xt.store(x[i+4 … i+7]); xs.store(x[i+8 … i+11]); yt.store(x[i+11 … i+15]); } 4 shuffle + 2 compare instructions maskA 1 7 1 7 rearrange 5 3 5 3 5 3 rearrange 1 7 compare xa xb
  • 40. Where is Bottleneck in Spark Sort Program? ▪ Spend most of the time out of the sort routine in the program 40 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Sort algorithm Elapsed time (ms) Radix sort 563 Tim sort 757 val N = 1048576 val p = spark.sparkContext.parallelize(1 to N, 1) val df = p.map(_ => -1 * rand.nextLong).toDF("a") df.cache df.count // start measuring time df.sort("a").noop() OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64 Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
  • 41. Where is Bottleneck in Spark Sort Program? ▪ Spend most of the time out of the sort routine in the program 41 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Sort algorithm Elapsed time (ms) Estimated time with SIMD (ms) Radix sort 563 563 Tim sort 757 587 val N = 1048576 val p = spark.sparkContext.parallelize(1 to N, 1) val df = p.map(_ => -1 * rand.nextLong).toDF("a") df.cache df.count // start measuring time df.sort("a").noop() Radix sort took 84ms in the previous benchmark OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64 Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
  • 42. Sort Requires Additional Operation ▪ df.sort() always involve in a costly exchange operation – Data transfer among nodes 42 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki == Physical Plan == Sort [a#5L ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(a#5L ASC NULLS FIRST, 200), ..., [id=#54] +- InMemoryTableScan [a#5L] +- ...
  • 43. Lessons Learned ▪ SIMD Comb sort is faster than the current Tim sort ▪ Radix sort is smart – Order is O(N), where N is the number of elements ▪ Sort operation involves other costly operations ▪ There is room to exploit platform-specific SIMD instructions in VectorAPI 43 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 44. Generated Code More text on one line in this location if needed
  • 45. How DataFrame Program is Translated? 45 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki val N = 16384 val p = sparkContext.parallelize(1 to N, 1) val df = p.map(i => (i.toFloat, 2*i.toFloat)) .toDF("a", "b") df.cache df.count df.selectExpr("a+b", "a*b“).noop() class … { … } DataFrame source program Generated Java code
  • 46. Catalyst Translates into Java Code 46 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki val N = 16384 val p = sparkContext.parallelize(1 to N, 1) val df = p.map(i => (i.toFloat, 2*i.toFloat)) .toDF("a", "b") df.cache df.count df.selectExpr("a+b", "a*b“).noop() Create Logical Plans Optimize Logical Plans Create Physical Plans class … { … } DataFrame source program Select Physical Plans Generate Java code Catalyst Generated Java code
  • 47. Current Generated Code ▪ Read data in a vector style, but computation is executed in a sequential style at a row 47 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki class GeneratedCodeGenStage { void BatchRead() { if (iterator.hasNext()) { columnarBatch = iterator.next(); batchIdx = 0; ColumnVector colA = columnarBatch.column(0); ColumnVector colB = columnarBatch.column(1); } } void processNext() { if (columnarBatch == null) { BatchRead(); } float valA = cola.getFloat(batchIdx); float valB = colb.getFloat(batchIdx); float val0 = valA + valB; float val1 = valA * valB; appendRow(Row(val0, val1)); batchIdx++; } } Simplified generated code
  • 48. Computation is Inefficient in Current Code ▪ To read data is efficient in a vector style 48 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki class GeneratedCodeGenStage { void BatchRead() { if (iterator.hasNext()) { columnarBatch = iterator.next(); batchIdx = 0; ColumnVector colA = columnarBatch.column(0); ColumnVector colB = columnarBatch.column(1); } } void processNext() { if (columnarBatch == null) { BatchRead(); } float valA = cola.getFloat(batchIdx); float valB = cola.getFloat(batchIdx); float val0 = valA * valB; float val1 = valA + valB; appendRow(Row(val0, val1)); batchIdx++; } } Read data in a vector style Compute data at a row Put data at a row
  • 49. Prototyped Generated Code ▪ To read and compute data in a vector style. To put data is still in a sequential style 49 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki class GeneratedCodeGenStage { void BatchRead() { if (iterator.hasNext()) { columnarBatch = iterator.next(); batchIdx = 0; ColumnVector colA = columnarBatch.column(0); ColumnVector colB = columnarBatch.column(1); float va[] = colA.getFloats(), vb[] = colB.getFloats(); // compute date using Vector API for (int i = 0; i < columnarBatch.size(); i += SPECIES.length()) { FloatVector va = FloatVector.fromArray(SPECIES, va, i); v0.intoArray(cola, i); v1.intoArray(colb, i); } } } void processNext() { if (columnarBatch == null) { BatchRead(); } appendRow(Row(cola[batchIdx], colb[batchIdx])); batchIdx++; } } Read data in a vector style Compute data in a vector style Put data at a row
  • 50. Enhanced Code Generation in Catalyst 50 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki val N = 16384 val p = sparkContext.parallelize(1 to N, 1) val df = p.map(i => (i.toFloat, 2*i.toFloat)) .toDF("a", "b") df.cache df.count df.selectExpr("a+b", "a*b“).noop() Create Logical Plans Optimize Logical Plans Create Physical Plans class … { … } DataFrame source program Select Physical Plans Generate Java code Catalyst Generated Java code with vectorized computation
  • 51. Prototyped Two Code Generations ▪ Perform computations using scalar variables ▪ Perform computations using VectorAPI 51 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 52. Using Scalar Variables ▪ Perform computation for multiple rows in a batch 52 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki class GeneratedCodeGenStage { float col0[] = new float[COLUMN_BATCH_SIZE], col1[] = new float[COLUMN_BATCH_SIZE], col2[] = new float[COLUMN_BATCH_SIZE]; void BatchRead() { if (iterator.hasNext()) { columnarBatch = iterator.next(); batchIdx = 0; ColumnVector colA = columnarBatch.column(0); ColumnVector colB = columnarBatch.column(1); float va[] = colA.getFloats(), vb[] = colB.getFloats(); for (int i = 0; i < columnarBatch.size(); i += 1) { float valA = cola.getFloat(i); float valB = colb.getFloat(i); col0[i] = valA + valB; col1[i] = valA * valB; } } } void processNext() { if (batchIdx == columnarBatch.size()) { BatchRead(); } appendRow(Row(col0[batchIdx], col1[batchIdx], col2[batchIdx])); batchIdx++; } } Simplified generated code
  • 53. Using VectorAPI ▪ Perform computation for multiple rows using SIMD in a batch 53 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki class GeneratedCodeGenStage { float col0[] = new float[COLUMN_BATCH_SIZE], col1[] = new float[COLUMN_BATCH_SIZE], col2[] = new float[COLUMN_BATCH_SIZE]; void BatchRead() { if (iterator.hasNext()) { columnarBatch = iterator.next(); batchIdx = 0; ColumnVector colA = columnarBatch.column(0); ColumnVector colB = columnarBatch.column(1); float va[] = colA.getFloats(), vb[] = colB.getFloats(); for (int i = 0; i < columnarBatch.size(); i += SPECIES.length()) { FloatVector va = FloatVector.fromArray(SPECIES, va, i); FloatVector vb = FloatVector.fromArray(SPECIES, vb, i); FloatVector v0 = va.mul(vb); FloatVector v1 = va.add(vb); v0.intoArray(col0, i); v1.intoArray(col1, i); } } } void processNext() { if (batchIdx == columnarBatch.size()) { BatchRead(); } appendRow(Row(col0[batchIdx], col1[batchIdx], col2[batchIdx])); batchIdx++; } } Simplified generated code
  • 54. Up to 1.7x Faster at Micro Benchmark ▪ Vectorized version achieve up to 1.7x performance improvement ▪ SIMD version achieves about 1.2x improvement over Vectorized Scalar version 54 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Current version Vectorized (Scalar) Vectorized (SIMD) 34.2ms 26.6ms 20.0ms OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64 Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz val N = 16384 val p = sparkContext.parallelize(1 to N, 1) val df = p.map(i => (i.toFloat, 2*i.toFloat)) .toDF("a", "b") df.cache df.count // start measuring time df.selectExpr("a+b", "a*b“).noop() Shorter is better
  • 55. 2.8x Faster at Nano Benchmark ▪ Perform the same computation as in the previous benchmark – Add and multiple operations against 16384 float elements 55 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki void scalar(float a[], float b[], float c[], float d[], int n) { for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; d[i] = a[i] * b[i]; } } void simd(float a[], float b[], float c[], float d[], int n) { for (int i = 0; i < n; i += SPECIES.length()) { FloatVector va = FloatVector .fromArray(SPECIES, a, i); FloatVector vb = FloatVector .fromArray(SPECIES, b, i); FloatVector vc = va.add(vb); FloatVector vd = va.mul(vb); vc.intoArray(c, i); vd.intoArray(d, i); } } Scalar version SIMD version 2.8x faster
  • 56. Now, To Put Data is Bottleneck ▪ To read and compute data in a vector style. To put data is in a sequential style 56 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Class GeneratedCodeGenStage { void BatchRead() { if (iterator.hasNext()) { columnarBatch = iterator.next(); batchIdx = 0; ColumnVector colA = columnarBatch.column(0); ColumnVector colB = columnarBatch.column(1); float va[] = colA.getFloats(), vb[] = colB.getFloats(); // compute date using Vector API for (int i = 0; i < columnarBatch.size(); i += SPECIES.length()) { FloatVector va = FloatVector.fromArray(SPECIES, va, i); v0.intoArray(cola, i); v1.intoArray(colb, i); } } } void processNext() { if (columnarBatch == null) { BatchRead(); } appendRow(Row(cola[batchIdx], colb[batchIdx])); batchIdx++; } } Read data in a vector style Compute data in a vector style Put data at a row
  • 57. Lessons Learned ▪ To vectorize computation is effective ▪ To use SIMD is also effective, but not huge improvement ▪ There is room to improve performance at an interface between the generated code and its successor unit 57 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 58. Takeaway ▪ How we can use SIMD instructions in Java ▪ Use SIMD at three areas – Good result for matrix library (SPARK-33882 has been merged) ▪ Better than Java implementation ▪ Better for small data than native implementation – Room to improve the performance of sort program ▪ VectorAPI implementation in Java virtual machine ▪ Other parts to be improved in Apache Spark – Good result for catalyst ▪ To vectorize computation is effective ▪ Interface between computation units is important for performance • c.f. “Vectorized Query Execution in Apache Spark at Facebook”, 2019 58 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Visit https://www.slideshare.net/ishizaki if you are interested in this slide