SlideShare a Scribd company logo
1 of 48
Download to read offline
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia,
LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration
under contract DE-NA-0003525.
Kokkos:	The	C++	Performance	Portability	Programming	
Model
Christian	 Trott (crtrott@sandia.gov),	 Carter	Edwards
D.	Sunderland,	 N.	Ellingwood,	D.	Ibanez,	S.	Hammond,	S.	Rajamanickam,	K.	Kim,	M.	Deveci,	M.	Hoemmen,	G.		
Center	for	Computing	Research,	Sandia	National	Laboratories,	NM
SAND2017-4935	C
New	Programming	Models
§ HPC	is	at	a	Crossroads
§ Diversifying	Hardware	Architectures
§ More	parallelism	necessitates	paradigm	shift	from	MPI-only
§ Need	for	New	Programming	Models	
§ Performance	Portability:	OpenMP 4.5,	OpenACC,	Kokkos,	RAJA,	SyCL,	
C++20?,	…
§ Resilience	and	Load	Balancing:	Legion,	HPX,	UPC++,	...
§ Vendor	decoupling	drives	external	development
2
New	Programming	Models
§ HPC	is	at	a	Crossroads
§ Diversifying	Hardware	Architectures
§ More	parallelism	necessitates	paradigm	shift	from	MPI-only
§ Need	for	New	Programming	Models	
§ Performance	Portability:	OpenMP 4.5,	OpenACC,	Kokkos,	RAJA,	SyCL,	
C++20?,	…
§ Resilience	and	Load	Balancing:	Legion,	HPX,	UPC++,	...
§ Vendor	decoupling	drives	external	development
3
What	is	Kokkos?
What	is	new?
Why	should	you	trust	us?
Kokkos:	Performance,	Portability	and	Productivity
DDR#
HBM#
DDR#
HBM#
DDR#DDR#
DDR#
HBM#HBM#
Kokkos#
LAMMPS# Sierra# Albany#Trilinos#
https://github.com/kokkos
Performance	Portability	through	Abstraction
Kokkos
Execution Spaces (“Where”)
Execution Patterns (“What”)
Execution Policies (“How”)
- N-Level
- Support Heterogeneous Execution
- parallel_for/reduce/scan, task spawn
- Enable nesting
- Range, Team, Task-Dag
- Dynamic / Static Scheduling
- Support non-persistent scratch-pads
Memory Spaces (“Where”)
Memory Layouts (“How”)
Memory Traits
- Multiple-Levels
- Logical Space (think UVM vs explicit)
- Architecture dependent index-maps
- Also needed for subviews
- Access Intent: Stream, Random, …
- Access Behavior: Atomic
- Enables special load paths: i.e. texture
Parallel ExecutionData Structures
Separating of Concerns for Future Systems…
Capability	Matrix
6
Implementation
Technique
ParallelLoops
Parallel
Reduction
TightlyNested
Loops
Non-tightly
NestedLoops
TaskParallelism
DataAllocations
DataTransfers
AdvancedData
Abstractions
Kokkos C++ Abstraction X X X X X X X X
OpenMP Directives X X X X X X X -
OpenACC Directives X X X X - X X -
CUDA Extension (X) - (X) X - X X -
OpenCL Extension (X) - (X) X - X X -
C++AMP Extension X - X - - X X -
Raja C++ Abstraction X X X (X) - - - -
TBB C++ Abstraction X X X X X X - -
C++17 Language X - - - (X) X (X) (X)
Fortran2008 Language X - - - - X (X) -
Example:	Conjugent Gradient	Solver
§ Simple	Iterative	Linear	Solver
§ For	example	used	in	MiniFE
§ Uses	only	three	math	operations:
§ Vector	addition	(AXPBY)
§ Dot	product	(DOT)
§ Sparse	Matrix	Vector	multiply	(SPMV)
§ Data	management	with	Kokkos	Views:
7
View<double*,HostSpace,MemoryTraits<Unmanaged> >
h_x(x_in, nrows);
View<double*> x("x",nrows);
deep_copy(x,h_x);
CG	Solve:	The	AXPBY
8
void axpby(int n, View<double*> z, double alpha, View<const double*> x,
double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {
z(i) = alpha*x(i) + beta*y(i);
});
}
§ Simple	data	parallel	loop:	Kokkos::parallel_for
§ Easy	to	express	in	most	programming	models
§ Bandwidth	bound
§ Serial	Implementation:	
§ Kokkos	Implementation:
void axpby(int n, double* z, double alpha, const double* x,
double beta, const double* y) {
for(int i=0; i<n; i++)
z[i] = alpha*x[i] + beta*y[i];
}
CG	Solve:	The	AXPBY
9
void axpby(int n, View<double*> z, double alpha, View<const double*> x,
double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {
z(i) = alpha*x(i) + beta*y(i);
});
}
§ Simple	data	parallel	loop:	Kokkos::parallel_for
§ Easy	to	express	in	most	programming	models
§ Bandwidth	bound
§ Serial	Implementation:	
§ Kokkos	Implementation:
void axpby(int n, double* z, double alpha, const double* x,
double beta, const double* y) {
for(int i=0; i<n; i++)
z[i] = alpha*x[i] + beta*y[i];
}
Parallel Pattern: for loop
CG	Solve:	The	AXPBY
10
void axpby(int n, View<double*> z, double alpha, View<const double*> x,
double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {
z(i) = alpha*x(i) + beta*y(i);
});
}
§ Simple	data	parallel	loop:	Kokkos::parallel_for
§ Easy	to	express	in	most	programming	models
§ Bandwidth	bound
§ Serial	Implementation:	
§ Kokkos	Implementation:
void axpby(int n, double* z, double alpha, const double* x,
double beta, const double* y) {
for(int i=0; i<n; i++)
z[i] = alpha*x[i] + beta*y[i];
}
String Label: Profiling/Debugging
CG	Solve:	The	AXPBY
11
void axpby(int n, View<double*> z, double alpha, View<const double*> x,
double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {
z(i) = alpha*x(i) + beta*y(i);
});
}
§ Simple	data	parallel	loop:	Kokkos::parallel_for
§ Easy	to	express	in	most	programming	models
§ Bandwidth	bound
§ Serial	Implementation:	
§ Kokkos	Implementation:
void axpby(int n, double* z, double alpha, const double* x,
double beta, const double* y) {
for(int i=0; i<n; i++)
z[i] = alpha*x[i] + beta*y[i];
}
Execution Policy: do n iterations
CG	Solve:	The	AXPBY
12
void axpby(int n, View<double*> z, double alpha, View<const double*> x,
double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {
z(i) = alpha*x(i) + beta*y(i);
});
}
§ Simple	data	parallel	loop:	Kokkos::parallel_for
§ Easy	to	express	in	most	programming	models
§ Bandwidth	bound
§ Serial	Implementation:	
§ Kokkos	Implementation:
void axpby(int n, double* z, double alpha, const double* x,
double beta, const double* y) {
for(int i=0; i<n; i++)
z[i] = alpha*x[i] + beta*y[i];
}
Iteration handle: integer index
CG	Solve:	The	AXPBY
13
void axpby(int n, View<double*> z, double alpha, View<const double*> x,
double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {
z(i) = alpha*x(i) + beta*y(i);
});
}
§ Simple	data	parallel	loop:	Kokkos::parallel_for
§ Easy	to	express	in	most	programming	models
§ Bandwidth	bound
§ Serial	Implementation:	
§ Kokkos	Implementation:
void axpby(int n, double* z, double alpha, const double* x,
double beta, const double* y) {
for(int i=0; i<n; i++)
z[i] = alpha*x[i] + beta*y[i];
}
Loop Body
CG	Solve:	The	AXPBY
14
void axpby(int n, View<double*> z, double alpha, View<const double*> x,
double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {
z(i) = alpha*x(i) + beta*y(i);
});
}
§ Simple	data	parallel	loop:	Kokkos::parallel_for
§ Easy	to	express	in	most	programming	models
§ Bandwidth	bound
§ Serial	Implementation:	
§ Kokkos	Implementation:
void axpby(int n, double* z, double alpha, const double* x,
double beta, const double* y) {
for(int i=0; i<n; i++)
z[i] = alpha*x[i] + beta*y[i];
}
CG	Solve:	The	Dot	Product
15
double dot(int n, View<const double*> x, View<const double*> y) {
double x_dot_y = 0.0;
parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) {
sum += x[i]*y[i];
}, x_dot_y);
return x_dot_y;
}
§ Simple	data	parallel	loop	with	reduction:	Kokkos::parallel_reduce
§ Non	trivial	in	CUDA	due	to	lack	of	built-in	reduction	support
§ Bandwidth	bound
§ Serial	Implementation:	
§ Kokkos	Implementation:
double dot(int n, const double* x, const double* y) {
double sum = 0.0;
for(int i=0; i<n; i++)
sum += x[i]*y[i];
return sum;
}
CG	Solve:	The	Dot	Product
16
double dot(int n, View<const double*> x, View<const double*> y) {
double x_dot_y = 0.0;
parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) {
sum += x[i]*y[i];
}, x_dot_y);
return x_dot_y;
}
§ Simple	data	parallel	loop	with	reduction:	Kokkos::parallel_reduce
§ Non	trivial	in	CUDA	due	to	lack	of	built-in	reduction	support
§ Bandwidth	bound
§ Serial	Implementation:	
§ Kokkos	Implementation:
double dot(int n, const double* x, const double* y) {
double sum = 0.0;
for(int i=0; i<n; i++)
sum += x[i]*y[i];
return sum;
} Parallel Pattern: loop with reduction
CG	Solve:	The	Dot	Product
17
double dot(int n, View<const double*> x, View<const double*> y) {
double x_dot_y = 0.0;
parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) {
sum += x[i]*y[i];
}, x_dot_y);
return x_dot_y;
}
§ Simple	data	parallel	loop	with	reduction:	Kokkos::parallel_reduce
§ Non	trivial	in	CUDA	due	to	lack	of	built-in	reduction	support
§ Bandwidth	bound
§ Serial	Implementation:	
§ Kokkos	Implementation:
double dot(int n, const double* x, const double* y) {
double sum = 0.0;
for(int i=0; i<n; i++)
sum += x[i]*y[i];
return sum;
} Iteration Index + Thread-Local Red. Varible
CG	Solve:	The	Dot	Product
18
double dot(int n, View<const double*> x, View<const double*> y) {
double x_dot_y = 0.0;
parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) {
sum += x[i]*y[i];
}, x_dot_y);
return x_dot_y;
}
§ Simple	data	parallel	loop	with	reduction:	Kokkos::parallel_reduce
§ Non	trivial	in	CUDA	due	to	lack	of	built-in	reduction	support
§ Bandwidth	bound
§ Serial	Implementation:	
§ Kokkos	Implementation:
double dot(int n, const double* x, const double* y) {
double sum = 0.0;
for(int i=0; i<n; i++)
sum += x[i]*y[i];
return sum;
}
CG	Solve:	The	SPMV
§ Loop	over	rows
§ Dot	product	of	matrix	row	with	a	vector
§ Example	of	Non-Tightly	nested	loops
§ Random	access	on	the	vector	(Texture	fetch	on	GPUs)
19
void SPMV(int nrows, const int* A_row_offsets, const int* A_cols,
const double* A_vals, double* y, const double* x) {
for(int row=0; row<nrows; ++row) {
double sum = 0.0;
int row_start=A_row_offsets[row];
int row_end=A_row_offsets[row+1];
for(int i=row_start; i<row_end; ++i) {
sum += A_vals[i]*x[A_cols[i]];
}
y[row] = sum;
}
}
CG	Solve:	The	SPMV
§ Loop	over	rows
§ Dot	product	of	matrix	row	with	a	vector
§ Example	of	Non-Tightly	nested	loops
§ Random	access	on	the	vector	(Texture	fetch	on	GPUs)
20
void SPMV(int nrows, const int* A_row_offsets, const int* A_cols,
const double* A_vals, double* y, const double* x) {
for(int row=0; row<nrows; ++row) {
double sum = 0.0;
int row_start=A_row_offsets[row];
int row_end=A_row_offsets[row+1];
for(int i=row_start; i<row_end; ++i) {
sum += A_vals[i]*x[A_cols[i]];
}
y[row] = sum;
}
}
Outer loop over matrix rows
CG	Solve:	The	SPMV
§ Loop	over	rows
§ Dot	product	of	matrix	row	with	a	vector
§ Example	of	Non-Tightly	nested	loops
§ Random	access	on	the	vector	(Texture	fetch	on	GPUs)
21
void SPMV(int nrows, const int* A_row_offsets, const int* A_cols,
const double* A_vals, double* y, const double* x) {
for(int row=0; row<nrows; ++row) {
double sum = 0.0;
int row_start=A_row_offsets[row];
int row_end=A_row_offsets[row+1];
for(int i=row_start; i<row_end; ++i) {
sum += A_vals[i]*x[A_cols[i]];
}
y[row] = sum;
}
}
Inner dot product row x vector
CG	Solve:	The	SPMV
§ Loop	over	rows
§ Dot	product	of	matrix	row	with	a	vector
§ Example	of	Non-Tightly	nested	loops
§ Random	access	on	the	vector	(Texture	fetch	on	GPUs)
22
void SPMV(int nrows, const int* A_row_offsets, const int* A_cols,
const double* A_vals, double* y, const double* x) {
for(int row=0; row<nrows; ++row) {
double sum = 0.0;
int row_start=A_row_offsets[row];
int row_end=A_row_offsets[row+1];
for(int i=row_start; i<row_end; ++i) {
sum += A_vals[i]*x[A_cols[i]];
}
y[row] = sum;
}
}
CG	Solve:	The	SPMV
23
void SPMV(int nrows, View<const int*> A_row_offsets,
View<const int*> A_cols, View<const double*> A_vals,
View<double*> y,
View<const double*, MemoryTraits< RandomAccess>> x) {
#ifdef KOKKOS_ENABLE_CUDA
int rows_per_team = 64;int team_size = 64;
#else
int rows_per_team = 512;int team_size = 1;
#endif
parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >
((nrows+rows_per_team-1)/rows_per_team,team_size,8),
KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {
const int first_row = team.league_rank()*rows_per_team;
const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;
parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {
const int row_start=A_row_offsets[row];
const int row_length=A_row_offsets[row+1]-row_start;
double y_row;
parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {
sum += A_vals(i+row_start)*x(A_cols(i+row_start));
} , y_row);
y(row) = y_row;
});
});
}
CG	Solve:	The	SPMV
24
void SPMV(int nrows, View<const int*> A_row_offsets,
View<const int*> A_cols, View<const double*> A_vals,
View<double*> y,
View<const double*, MemoryTraits< RandomAccess>> x) {
#ifdef KOKKOS_ENABLE_CUDA
int rows_per_team = 64;int team_size = 64;
#else
int rows_per_team = 512;int team_size = 1;
#endif
parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >
((nrows+rows_per_team-1)/rows_per_team,team_size,8),
KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {
const int first_row = team.league_rank()*rows_per_team;
const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;
parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {
const int row_start=A_row_offsets[row];
const int row_length=A_row_offsets[row+1]-row_start;
double y_row;
parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {
sum += A_vals(i+row_start)*x(A_cols(i+row_start));
} , y_row);
y(row) = y_row;
});
});
}
Team Parallelism over Row Worksets
CG	Solve:	The	SPMV
25
void SPMV(int nrows, View<const int*> A_row_offsets,
View<const int*> A_cols, View<const double*> A_vals,
View<double*> y,
View<const double*, MemoryTraits< RandomAccess>> x) {
#ifdef KOKKOS_ENABLE_CUDA
int rows_per_team = 64;int team_size = 64;
#else
int rows_per_team = 512;int team_size = 1;
#endif
parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >
((nrows+rows_per_team-1)/rows_per_team,team_size,8),
KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {
const int first_row = team.league_rank()*rows_per_team;
const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;
parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {
const int row_start=A_row_offsets[row];
const int row_length=A_row_offsets[row+1]-row_start;
double y_row;
parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {
sum += A_vals(i+row_start)*x(A_cols(i+row_start));
} , y_row);
y(row) = y_row;
});
});
}
Distribute rows in workset over team-threads
CG	Solve:	The	SPMV
26
void SPMV(int nrows, View<const int*> A_row_offsets,
View<const int*> A_cols, View<const double*> A_vals,
View<double*> y,
View<const double*, MemoryTraits< RandomAccess>> x) {
#ifdef KOKKOS_ENABLE_CUDA
int rows_per_team = 64;int team_size = 64;
#else
int rows_per_team = 512;int team_size = 1;
#endif
parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >
((nrows+rows_per_team-1)/rows_per_team,team_size,8),
KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {
const int first_row = team.league_rank()*rows_per_team;
const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;
parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {
const int row_start=A_row_offsets[row];
const int row_length=A_row_offsets[row+1]-row_start;
double y_row;
parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {
sum += A_vals(i+row_start)*x(A_cols(i+row_start));
} , y_row);
y(row) = y_row;
});
});
} Row x Vector dot product
CG	Solve:	The	SPMV
27
void SPMV(int nrows, View<const int*> A_row_offsets,
View<const int*> A_cols, View<const double*> A_vals,
View<double*> y,
View<const double*, MemoryTraits< RandomAccess>> x) {
#ifdef KOKKOS_ENABLE_CUDA
int rows_per_team = 64;int team_size = 64;
#else
int rows_per_team = 512;int team_size = 1;
#endif
parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >
((nrows+rows_per_team-1)/rows_per_team,team_size,8),
KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {
const int first_row = team.league_rank()*rows_per_team;
const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;
parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {
const int row_start=A_row_offsets[row];
const int row_length=A_row_offsets[row+1]-row_start;
double y_row;
parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {
sum += A_vals(i+row_start)*x(A_cols(i+row_start));
} , y_row);
y(row) = y_row;
});
});
}
CG	Solve:	Performance
28
0
10
20
30
40
50
60
70
80
90
AXPBY-100 AXPBY-200 DOT-100 DOT-200 SPMV-100 SPMV-200
Performance	[Gflop/s]
NVIDIA	P100	/	IBM	Power8
OpenACC CUDA Kokkos OpenMP
0
10
20
30
40
50
60
AXPBY-100 AXPBY-200 DOT-100 DOT-200 SPMV-100 SPMV-200
Performance	[Gflop/s]
Intel	KNL	7250	
OpenACC Kokkos OpenMP TBB	(Flat) TBB	(Hierarchical)
§ Comparison	with	other	
Programming	Models	
§ Straight	forward	
implementation	of	
kernels
§ OpenMP	4.5	is	
immature	at	this	point
§ Two	problem	sizes:	
100x100x100	and	
200x200x200	elements
Custom	Reductions	With	Lambdas
§ Added	Capability	to	do	Custom	Reductions	with	Lambdas
§ Provide	built-in	reducers	for	common	operations
§ Add,Min,Max,Prod,MinLoc,MaxLoc,MinMaxLoc,And,Or,Xor,
§ Users	can	implement	their	own	reducers
§ Example	Max	reduction:
29
double result
parallel_reduce(N,
KOKKOS_LAMBDA(const int& i,
double& lmax) {
if(lmax < a(i)) lmax = a(i);
},Max<double>(result));
0
100
200
300
400
500
600
K80 P100 KNL
BandwidthinGB/s
Reduction Performance
Sum Max MaxLoc
New	Features:	MDRangePolicy
§ Many	people	perform	structured	grid	calculations	
§ Sandia’s	codes	are	predominantly	unstructured	though
§ MDRangePolicy introduced	for	tightly	nested	loops
§ Usecase corresponding	to	OpenMP	collapse	clause
§ Optionally	set	iteration	order	and	tiling:
30
void launch (int N0, int N1, [ARGS]) {
parallel_for(MDRangePolicy<Rank<3>>({0,0,0},{N0,N1,N2}),
KOKKOS_LAMBDA (int i0, int i1)
{/*...*/});
}
MDRangePolicy<Rank<3,Iterate::Right,Iterate::Left>>
({0,0,0},{N0,N1,N2},{T0,T1,T2})
0
100
200
300
400
500
600
KNL P100
Gflop/s
Albany Kernel
Naïve Best Raw MDRange
New	Features:	Task	Graphs
§ Task	dags are	an	important	class	of	algorithmic	structures
§ Used	for	algorithms	with	complex	data	dependencies
§ For	example	triangular	solve
§ Kokkos	tasking	is	on-node
§ Future	based,	not	explicit	data	centric	(as	for	example	Legion)
§ Tasks	return	futures
§ Tasks	can	depend	on	futures
§ Respawn	of	tasks	possible
§ Tasks	can	spawn	other	tasks
§ Tasks	can	have	data	parallel	loops	
§ I.e.	a	task	can	utilize	multiple	threads	like	the	hyper	threads	on	a	core	or	a	
CUDA	block
31
Carter Edwards S7253 “Task Data Parallelism” , Right after this talk.
New	Features:	Pascal	Support
§ Pascal	GPUs	Provide	a	set	of	new	Capabilities
§ Much	better	memory	subsystem
§ NVLink (next	slide)
§ Hardware	support	for	double	precision	atomic	add
§ Generally	significant	speedup	3-5x	over	K80	for	Sandia	Apps
32
0
20
40
60
80
100
120
140
160
4000 32000 256000 2048000
MillionAtomstepsperSecond
Number of Atoms
LammpsTersoffPotential
K40
K80
P100
K40 Atomic
K80 Atomic
P100 Atomic
0.001
0.01
0.1
1
10
100
1000
1 8 64 512 4096
BandwidthinGB/s
Update Array Size
Atomic Bandwidth
New	Features:	HBM	Support
§ New	architectures	with	HBM:	Intel	KNL,	NVIDIA	P100	
§ Generally	three	types	of	allocations:
§ Page	pinned	in	DDR
§ Page	pinned	in	HBM
§ Page	migratable by	OS	or	hardware	caching
§ Kokkos	supports	all	three	on	both	architectures
§ For	Cuda backend:	CudaHostPinnedSpace,CudaSpace and	CudaUVMSpace
§ E.g.:	Kokkos::View<double*,	CudaUVMSpace>	a(”A”,N);
§ P100	Bandwidth	with	and	without	data	Reuse
33
1
10
100
1000
0.125 0.25 0.5 1 2 4 8 16 32 64
BandwidthGB/s
Problem Size [GB]
HostPinned 1
HostPinned 32
UVM 1
UVM 32
Upcoming	Features
§ Support	for	OpenMP	4.5+	Target	Backend
§ Experimentally	available	on	github
§ CUDA	will	stay	preferred	backend
§ Maybe	support	for	FPGAs	in	the	future?
§ Help	maturing	OpenMP	4.5+	compilers
§ Support	for	AMD	ROCm Backend
§ Experimentally	available on	github
§ Mainly	developed	by	AMD
§ Support	for	APUs	and	discrete	GPUs
§ Expect	maturation	fall	2017
34
Beyond	Capabilities
§ Using	Kokkos	is	invasive,	you	can’t	just	swap	it	out
§ Significant	part	of	data	structures	need	to	be	taken	over
§ Function	markings	everywhere
§ It	is	not	enough	to	have	initial	Capabilities
§ Robustness	comes	from	utilizations	and	experience
§ Different	types	of	application	and	coding	styles	will	expose	different	
corner	cases
§ Applications	need	libraries
§ Interaction	with	TPLs	such	as	BLAS	must	work
§ Many	library	capabilities	must	be	reimplemented in	the	programming	
model
§ Applications	need	tools
§ Profiling	and	Debugging	capabilities	are	required	for	serious	work
35
Timeline
36
Initial Kokkos: Linear Algebra for Trilinos
Restart of Kokkos: Scope now Programming Model
Mantevo MiniApps: Compare Kokkos to other Models
LAMMPS: Demonstrate Legacy App Transition
Trilinos: Move Tpetra over to use Kokkos Views
Multiple Apps start exploring (Albany, Uintah, …)
Sandia Multiday Tutorial (~80 attendees)
Sandia Decision to prefer Kokkos over other models
Github Release of Kokkos 2.0
Kokkos-Kernels and Kokkos-Tools Release
DOE Exascale Computing Project starts
2008
2011
2013
2012
2014
2016
2015
2017
17"
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
FOM+(Z/s)+
LULESH+Figure+of+Merit+Results+(Problem+60)+
HSW"1x16" HSW"1x32" P8"1x40"XL" KNC"1x224" ARM64"1x8" NV"K40"Higher"
is"
Better"
Results"by"Dennis"Dinge,"Christian"Trott"and"Si"Hammond"
Initial	Demonstrations	(2012-2015)
37
§ Demonstrate	Feasibility	of	Performance	Portability
§ Development	of	a	number	of	MiniApps from	different	science	domains
§ Demonstrate	Low	Performance	Loss	versus	Native	Models
§ MiniApps are	implemented	in	various	programming	models
§ DOE	TriLab Collaboration
§ Show	Kokkos works	for	
other	labs	app
§ Note	this	is	historical	data:	
Improvements	were	
found,	RAJA	implemented	
similar	optimization	etc.
Training	the	User-Base
§ Typical	Legacy	Application	Developer
§ Science	Background
§ Mostly	Serial	Coding	(MPI	apps	usually	have	communication	layer	few	
people	touch)
§ Little	hardware	background,	little	parallel	programming	experience
§ Not	sufficient	to	teach	Programming	Model	Syntax
§ Need	training	in	parallel	programming	techniques
§ Teach	fundamental	hardware	knowledge	(how	does	CPU,	MIC	and	GPU	
differ,	and	what	does	it	mean	for	my	code)
§ Need	training	in	performance	profiling
§ Regular	Kokkos Tutorials
§ ~200	slides,	9	hands-on	exercises	to	teach	parallel	programming	
techniques,	performance	considerations	and	Kokkos
§ Now	dedicated	ECP	Kokkos	support	project:	develop	online	support	
community
§ ~200	HPC	developers	(mostly	from	DOE	labs)	had	Kokkos	training	so	far38
Keeping	Applications	Happy
§ Never	underestimate	developers	ability	to	find	new	corner	
cases!!
§ Having	a	Programming	Model	deployed	in	MiniApps or	a	single	big	app	is	
very	different	from	having	half	a	dozen	multi-million	line	code	customers.
§ 538	Issues	in	24	months
§ 28%	are	small	enhancements
§ 18%	bigger	feature	requests
§ 24%	are	bugs:	often	corner	cases
39
0
100
200
300
400
500
600
Issues
Issues since 2015
Enhancements
Feature Request
Bug
Compiler Issue
Questions
Other
§ Example:	Subviews
§ Initially	data	type	needed	to	match	
including	compile	time	dimensions
§ Allow	compile/runtime	conversion
§ Allow	Layout	conversion	if	possible
§ Automatically	find	best	layout
§ Add	subviewpatterns
Testing	and	Software	Quality
Develop
Release	Version
Compilers GCC	(4.8-6.3),	Clang (3.6-4.0),	
Intel	(15.0-18.0),	IBM	(13.1.5,	14.0),	
PGI	(17.3),	NVIDIA	(7.0-8.0)
Hardware Intel	Haswell,	Intel	KNL,	ARM	v8,
IBM	Power8,	NVIDIA	K80,	
NVIDIA	P100
Backends OpenMP,	Pthreads,	Serial,	CUDA
Warnings	as	Errors
Feature
Development/Tests
Review
Tests
Master
Integration
Trilinos,	LAMMPS,	…
Multi	config	integration	test
Nightly,	UnitTests
New	Features	are	developed	 on	forks,	and	branches.	
Limited	number	of	developers	can	push	to	develop	
branch.	Pull	requests	are	reviewed/tested.
Each	merge	into	master	is	minor	release.	
Extensive	integration	test	suite	ensures	
backward	compatibility,	and	catching	of	
unit-test	coverage	gaps.
Building	an	EcoSystem
41
Algorithms
(Random, Sort)
Containers
(Map, CrsGraph, Mem Pool)
Kokkos
(Parallel Execution, Data Allocation, Data Transfer)
Kokkos – Kernels
(Sparse/Dense BLAS, Graph Kernels, Tensor
Kernels)
Kokkos–Tools
(KokkosawareProfilingandDebuggingTools)
Trilinos
(Linear Solvers, Load Balancing,
Discretization, Distributed Linear
Algebra)
Kokkos–SupportCommunity
(ApplicationSupport,DeveloperTraining)
ApplicationsMiniApps
std::thread OpenMP CUDA ROCm
Kokkos	Tools
42
§ Utilities
§ KernelFilter: Enable/Disable	Profiling	for	a	selection	of	Kernels
§ Kernel	Inspection
§ KernelLogger:	Runtime	information	about	entering/leaving	Kernels	and	
Regions
§ KernelTimer:	Postprocessinginformation	about	Kernel	and	Region	Times
§ Memory	Analysis
§ MemoryHighWater: Maximum	Memory	Footprint	over	whole	run
§ MemoryUsage:	Per	Memory	Space	Utilization	Timeline
§ MemoryEvents:	Per	Memory	Space	Allocation	and	Deallocations
§ Third	Party	Connector
§ VTune Connector:	Mark	Kernels	as	Frames	inside	of	Vtune
§ VTune Focused	Connector:	Mark	Kernels	as	Frames	+	start/stop	profiling
https://github.com/kokkos/kokkos-tools
Kokkos-Tools:	Example	MemUsage
§ Tools	are	loaded	at	runtime
§ Profile	actual	release	builds	of	applications
§ Set	via:	export	KOKKOS_PROFILE_LIBRARY=[PATH_TO_PROFILING_LIB]
§ Output	depends	on	tool
§ Often	per	process	file
§ MemoryUsage provides	per	MemorySpace utilization	timelines
§ Time	starts	with	Kokkos::initialize
§ HOSTNAME-PROCESSID-CudaUVM.memspace_usage
43
# Space CudaUVM
# Time(s) Size(MB) HighWater(MB) HighWater-Process(MB)
0.317260 38.1 38.1 81.8
0.377285 0.0 38.1 158.1
0.384785 38.1 38.1 158.1
0.441988 0.0 38.1 158.1
Kokkos-Kernels
44
§ Provide	BLAS	(1,2,3);	Sparse;	Graph	and	Tensor	Kernels
§ No	required	dependencies	other	than	Kokkos
§ Local	kernels	(no	MPI)	
§ Hooks	in	TPLs	such	as	MKL	or	cuBLAS/cuSparse where	applicable
§ Provide	kernels	for	all	levels	of	hierarchical	parallelism:
§ Global	Kernels:	use	all	execution	resources	available
§ Team	Level	Kernels:	use	a	subset	of	threads	for	execution
§ Thread	Level	Kernels:	utilize	vectorization inside	the	kernel
§ Serial	Kernels:	provide	elemental	functions	(OpenMP declare	SIMD)
§ Work	started	based	on	customer	priorities;	expect	multi-year	effort	for	
broad	coverage
§ People:	Many	developers	from	Trilinos contribute
§ Consolidate	node	level	reusable	kernels	previously	distributed	 over	multiple	
packages
Kokkos-Kernels:	Dense	Blas	Example
45
0
5E+10
1E+11
1.5E+11
2E+11
2.5E+11
3 5 10 15
PerformanceGFlop/s
Matrix Block Size
Batched DGEMM 32k blocks
MKL @ KNL KK @ KNL
CUBLAS @ P100 KK @ P100
0
5E+10
1E+11
1.5E+11
2E+11
3 5 10 15
PerformanceGFlop/s
Matrix Block Size
BatchedTRSM 32k blocks
MKL @ KNL KK @ KNL
CUBLAS @ P100 KK @ P100
§ Batched	small	matrices	using	an	interleaved	memory	layout
§ Matrix	sizes	based	on	common	physics	problems:	3,5,10,15
§ 32k	small	matrices	
§ Vendor	libraries	get	better	for	more	and	larger	matrices
Kokkos	Users	Spread
46
§ Users	from	a	dozen	major	institutions
§ More	than	two	dozen	applications/libraries
§ Including	many	multi-million-line	projects
UsersUSERS
Backend	Optimizations
Further	Material
§ https://github.com/kokkosKokkos Github Organization
§ Kokkos: Core	library,	Containers,	Algorithms
§ Kokkos-Kernels: Sparse	and	Dense	BLAS,	Graph,	Tensor	(under	development)
§ Kokkos-Tools: Profiling	and	Debugging	
§ Kokkos-MiniApps: MiniApp repository	and	links
§ Kokkos-Tutorials: Extensive	Tutorials	with	Hands-On	Exercises
§ https://cs.sandia.gov Publications	(search	for	’Kokkos’)
§ Many	Presentations	on	Kokkos and	its	use	in	libraries	and	apps
§ Talks	at	this	GTC:
§ Carter	Edwards	S7253	“Task	Data	Parallelism”,	Today	10:00,	211B
§ Ramanan Sankaran,	S7561	“High	Pres.	Reacting	Flows”,	Today	1:30,	212B
§ Pierre	Kestener,	S7166	“High	Res.	Fluid	Dynamics”,	Today	14:30,	212B
§ Michael	Carilli,	S7148	“Liquid	Rocket	Simulations”,	Today	16:00,	212B
§ Panel,	S7564	“Accelerator	Programming	Ecosystems”,	Tuesday	16:00,	Ball3
§ Training	Lab,	L7107	“Kokkos,	Manycore PP”,	Wednesday	16:00,	LL21E 47
http://www.github.com/kokkos

More Related Content

What's hot

便利なNFC ~利用シーンと技術の動向~
便利なNFC  ~利用シーンと技術の動向~便利なNFC  ~利用シーンと技術の動向~
便利なNFC ~利用シーンと技術の動向~NFC Forum
 
犬でもわかる公開鍵暗号
犬でもわかる公開鍵暗号犬でもわかる公開鍵暗号
犬でもわかる公開鍵暗号akakou
 
10 Useful Asciidoctor Tips
10 Useful Asciidoctor Tips10 Useful Asciidoctor Tips
10 Useful Asciidoctor TipsAndres Almiray
 
OpenID Connect 入門 〜コンシューマーにおけるID連携のトレンド〜
OpenID Connect 入門 〜コンシューマーにおけるID連携のトレンド〜OpenID Connect 入門 〜コンシューマーにおけるID連携のトレンド〜
OpenID Connect 入門 〜コンシューマーにおけるID連携のトレンド〜Masaru Kurahayashi
 
Unityネイティブプラグインの勧め
Unityネイティブプラグインの勧めUnityネイティブプラグインの勧め
Unityネイティブプラグインの勧めKLab Inc. / Tech
 
今なら間に合う分散型IDとEntra Verified ID
今なら間に合う分散型IDとEntra Verified ID今なら間に合う分散型IDとEntra Verified ID
今なら間に合う分散型IDとEntra Verified IDNaohiro Fujie
 
PHPで大規模ブラウザゲームを開発してわかったこと
PHPで大規模ブラウザゲームを開発してわかったことPHPで大規模ブラウザゲームを開発してわかったこと
PHPで大規模ブラウザゲームを開発してわかったことKentaro Matsui
 
T69 c++cli ネイティブライブラリラッピング入門
T69 c++cli ネイティブライブラリラッピング入門T69 c++cli ネイティブライブラリラッピング入門
T69 c++cli ネイティブライブラリラッピング入門伸男 伊藤
 
サンプルで学ぶAlloy
サンプルで学ぶAlloyサンプルで学ぶAlloy
サンプルで学ぶAlloyNSaitoNmiri
 
SSE4.2の文字列処理命令の紹介
SSE4.2の文字列処理命令の紹介SSE4.2の文字列処理命令の紹介
SSE4.2の文字列処理命令の紹介MITSUNARI Shigeo
 
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみようPythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみようShinya Takamaeda-Y
 
C++のビルド高速化について
C++のビルド高速化についてC++のビルド高速化について
C++のビルド高速化についてAimingStudy
 
アジャイルなオフショア開発(第51回オフショア開発勉強会)
アジャイルなオフショア開発(第51回オフショア開発勉強会)アジャイルなオフショア開発(第51回オフショア開発勉強会)
アジャイルなオフショア開発(第51回オフショア開発勉強会)Arata Fujimura
 
Vivliostyle.js における CSS Paged Media の実装
Vivliostyle.js における CSS Paged Media の実装Vivliostyle.js における CSS Paged Media の実装
Vivliostyle.js における CSS Paged Media の実装Shinyu Murakami
 
UIbuilderを使ったフロントエンド開発
UIbuilderを使ったフロントエンド開発UIbuilderを使ったフロントエンド開発
UIbuilderを使ったフロントエンド開発Atsushi Kojo
 
ドローンの仕組み( #ABC2015S )
ドローンの仕組み( #ABC2015S )ドローンの仕組み( #ABC2015S )
ドローンの仕組み( #ABC2015S )博宣 今村
 
Goで実装した UPSIDERの決済金額リミット機能
Goで実装した UPSIDERの決済金額リミット機能 Goで実装した UPSIDERの決済金額リミット機能
Goで実装した UPSIDERの決済金額リミット機能 Miki Masumoto
 
人工知能は人狼の夢を見るか-日本デジタルゲーム学会年次大会2013@函館
人工知能は人狼の夢を見るか-日本デジタルゲーム学会年次大会2013@函館人工知能は人狼の夢を見るか-日本デジタルゲーム学会年次大会2013@函館
人工知能は人狼の夢を見るか-日本デジタルゲーム学会年次大会2013@函館Fujio Toriumi
 
KYC and identity on blockchain
KYC and identity on blockchainKYC and identity on blockchain
KYC and identity on blockchainmosa siru
 

What's hot (20)

便利なNFC ~利用シーンと技術の動向~
便利なNFC  ~利用シーンと技術の動向~便利なNFC  ~利用シーンと技術の動向~
便利なNFC ~利用シーンと技術の動向~
 
犬でもわかる公開鍵暗号
犬でもわかる公開鍵暗号犬でもわかる公開鍵暗号
犬でもわかる公開鍵暗号
 
10 Useful Asciidoctor Tips
10 Useful Asciidoctor Tips10 Useful Asciidoctor Tips
10 Useful Asciidoctor Tips
 
OpenID Connect 入門 〜コンシューマーにおけるID連携のトレンド〜
OpenID Connect 入門 〜コンシューマーにおけるID連携のトレンド〜OpenID Connect 入門 〜コンシューマーにおけるID連携のトレンド〜
OpenID Connect 入門 〜コンシューマーにおけるID連携のトレンド〜
 
Unityネイティブプラグインの勧め
Unityネイティブプラグインの勧めUnityネイティブプラグインの勧め
Unityネイティブプラグインの勧め
 
今なら間に合う分散型IDとEntra Verified ID
今なら間に合う分散型IDとEntra Verified ID今なら間に合う分散型IDとEntra Verified ID
今なら間に合う分散型IDとEntra Verified ID
 
PHPで大規模ブラウザゲームを開発してわかったこと
PHPで大規模ブラウザゲームを開発してわかったことPHPで大規模ブラウザゲームを開発してわかったこと
PHPで大規模ブラウザゲームを開発してわかったこと
 
T69 c++cli ネイティブライブラリラッピング入門
T69 c++cli ネイティブライブラリラッピング入門T69 c++cli ネイティブライブラリラッピング入門
T69 c++cli ネイティブライブラリラッピング入門
 
サンプルで学ぶAlloy
サンプルで学ぶAlloyサンプルで学ぶAlloy
サンプルで学ぶAlloy
 
SSE4.2の文字列処理命令の紹介
SSE4.2の文字列処理命令の紹介SSE4.2の文字列処理命令の紹介
SSE4.2の文字列処理命令の紹介
 
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみようPythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
 
C++のビルド高速化について
C++のビルド高速化についてC++のビルド高速化について
C++のビルド高速化について
 
アジャイルなオフショア開発(第51回オフショア開発勉強会)
アジャイルなオフショア開発(第51回オフショア開発勉強会)アジャイルなオフショア開発(第51回オフショア開発勉強会)
アジャイルなオフショア開発(第51回オフショア開発勉強会)
 
Vivliostyle.js における CSS Paged Media の実装
Vivliostyle.js における CSS Paged Media の実装Vivliostyle.js における CSS Paged Media の実装
Vivliostyle.js における CSS Paged Media の実装
 
UIbuilderを使ったフロントエンド開発
UIbuilderを使ったフロントエンド開発UIbuilderを使ったフロントエンド開発
UIbuilderを使ったフロントエンド開発
 
ドローンの仕組み( #ABC2015S )
ドローンの仕組み( #ABC2015S )ドローンの仕組み( #ABC2015S )
ドローンの仕組み( #ABC2015S )
 
Linked Data (再)入門
Linked Data (再)入門Linked Data (再)入門
Linked Data (再)入門
 
Goで実装した UPSIDERの決済金額リミット機能
Goで実装した UPSIDERの決済金額リミット機能 Goで実装した UPSIDERの決済金額リミット機能
Goで実装した UPSIDERの決済金額リミット機能
 
人工知能は人狼の夢を見るか-日本デジタルゲーム学会年次大会2013@函館
人工知能は人狼の夢を見るか-日本デジタルゲーム学会年次大会2013@函館人工知能は人狼の夢を見るか-日本デジタルゲーム学会年次大会2013@函館
人工知能は人狼の夢を見るか-日本デジタルゲーム学会年次大会2013@函館
 
KYC and identity on blockchain
KYC and identity on blockchainKYC and identity on blockchain
KYC and identity on blockchain
 

Similar to The Kokkos C++ Performance Portability EcoSystem

Score (smart contract for icon)
Score (smart contract for icon) Score (smart contract for icon)
Score (smart contract for icon) Doyun Hwang
 
Applying Linear Optimization Using GLPK
Applying Linear Optimization Using GLPKApplying Linear Optimization Using GLPK
Applying Linear Optimization Using GLPKJeremy Chen
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
TDC2018SP | Trilha .Net - Novidades do C# 7 e 8
TDC2018SP | Trilha .Net - Novidades do C# 7 e 8TDC2018SP | Trilha .Net - Novidades do C# 7 e 8
TDC2018SP | Trilha .Net - Novidades do C# 7 e 8tdc-globalcode
 
Dzanan_Bajgoric_C2CUDA_MscThesis_Present
Dzanan_Bajgoric_C2CUDA_MscThesis_PresentDzanan_Bajgoric_C2CUDA_MscThesis_Present
Dzanan_Bajgoric_C2CUDA_MscThesis_PresentDžanan Bajgorić
 
C sharp 8.0 new features
C sharp 8.0 new featuresC sharp 8.0 new features
C sharp 8.0 new featuresMSDEVMTL
 
C sharp 8.0 new features
C sharp 8.0 new featuresC sharp 8.0 new features
C sharp 8.0 new featuresMiguel Bernard
 
What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++Microsoft
 
mloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentDavid Galeano
 
Design Patterns - Compiler Case Study - Hands-on Examples
Design Patterns - Compiler Case Study - Hands-on ExamplesDesign Patterns - Compiler Case Study - Hands-on Examples
Design Patterns - Compiler Case Study - Hands-on ExamplesGanesh Samarthyam
 
Dynamic C++ Silicon Valley Code Camp 2012
Dynamic C++ Silicon Valley Code Camp 2012Dynamic C++ Silicon Valley Code Camp 2012
Dynamic C++ Silicon Valley Code Camp 2012aleks-f
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortStefan Marr
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Christian Peel
 
JVM code reading -- C2
JVM code reading -- C2JVM code reading -- C2
JVM code reading -- C2ytoshima
 
Why learn new programming languages
Why learn new programming languagesWhy learn new programming languages
Why learn new programming languagesJonas Follesø
 

Similar to The Kokkos C++ Performance Portability EcoSystem (20)

CppTutorial.ppt
CppTutorial.pptCppTutorial.ppt
CppTutorial.ppt
 
Cpp tutorial
Cpp tutorialCpp tutorial
Cpp tutorial
 
Score (smart contract for icon)
Score (smart contract for icon) Score (smart contract for icon)
Score (smart contract for icon)
 
Applying Linear Optimization Using GLPK
Applying Linear Optimization Using GLPKApplying Linear Optimization Using GLPK
Applying Linear Optimization Using GLPK
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
TDC2018SP | Trilha .Net - Novidades do C# 7 e 8
TDC2018SP | Trilha .Net - Novidades do C# 7 e 8TDC2018SP | Trilha .Net - Novidades do C# 7 e 8
TDC2018SP | Trilha .Net - Novidades do C# 7 e 8
 
Dzanan_Bajgoric_C2CUDA_MscThesis_Present
Dzanan_Bajgoric_C2CUDA_MscThesis_PresentDzanan_Bajgoric_C2CUDA_MscThesis_Present
Dzanan_Bajgoric_C2CUDA_MscThesis_Present
 
C sharp 8.0 new features
C sharp 8.0 new featuresC sharp 8.0 new features
C sharp 8.0 new features
 
C sharp 8.0 new features
C sharp 8.0 new featuresC sharp 8.0 new features
C sharp 8.0 new features
 
What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++
 
mloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game development
 
Einführung in MongoDB
Einführung in MongoDBEinführung in MongoDB
Einführung in MongoDB
 
Design Patterns - Compiler Case Study - Hands-on Examples
Design Patterns - Compiler Case Study - Hands-on ExamplesDesign Patterns - Compiler Case Study - Hands-on Examples
Design Patterns - Compiler Case Study - Hands-on Examples
 
Dynamic C++ Silicon Valley Code Camp 2012
Dynamic C++ Silicon Valley Code Camp 2012Dynamic C++ Silicon Valley Code Camp 2012
Dynamic C++ Silicon Valley Code Camp 2012
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low Effort
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
 
Novidades do c# 7 e 8
Novidades do c# 7 e 8Novidades do c# 7 e 8
Novidades do c# 7 e 8
 
JVM code reading -- C2
JVM code reading -- C2JVM code reading -- C2
JVM code reading -- C2
 
Why learn new programming languages
Why learn new programming languagesWhy learn new programming languages
Why learn new programming languages
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

The Kokkos C++ Performance Portability EcoSystem

  • 1. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525. Kokkos: The C++ Performance Portability Programming Model Christian Trott (crtrott@sandia.gov), Carter Edwards D. Sunderland, N. Ellingwood, D. Ibanez, S. Hammond, S. Rajamanickam, K. Kim, M. Deveci, M. Hoemmen, G. Center for Computing Research, Sandia National Laboratories, NM SAND2017-4935 C
  • 2. New Programming Models § HPC is at a Crossroads § Diversifying Hardware Architectures § More parallelism necessitates paradigm shift from MPI-only § Need for New Programming Models § Performance Portability: OpenMP 4.5, OpenACC, Kokkos, RAJA, SyCL, C++20?, … § Resilience and Load Balancing: Legion, HPX, UPC++, ... § Vendor decoupling drives external development 2
  • 3. New Programming Models § HPC is at a Crossroads § Diversifying Hardware Architectures § More parallelism necessitates paradigm shift from MPI-only § Need for New Programming Models § Performance Portability: OpenMP 4.5, OpenACC, Kokkos, RAJA, SyCL, C++20?, … § Resilience and Load Balancing: Legion, HPX, UPC++, ... § Vendor decoupling drives external development 3 What is Kokkos? What is new? Why should you trust us?
  • 5. Performance Portability through Abstraction Kokkos Execution Spaces (“Where”) Execution Patterns (“What”) Execution Policies (“How”) - N-Level - Support Heterogeneous Execution - parallel_for/reduce/scan, task spawn - Enable nesting - Range, Team, Task-Dag - Dynamic / Static Scheduling - Support non-persistent scratch-pads Memory Spaces (“Where”) Memory Layouts (“How”) Memory Traits - Multiple-Levels - Logical Space (think UVM vs explicit) - Architecture dependent index-maps - Also needed for subviews - Access Intent: Stream, Random, … - Access Behavior: Atomic - Enables special load paths: i.e. texture Parallel ExecutionData Structures Separating of Concerns for Future Systems…
  • 6. Capability Matrix 6 Implementation Technique ParallelLoops Parallel Reduction TightlyNested Loops Non-tightly NestedLoops TaskParallelism DataAllocations DataTransfers AdvancedData Abstractions Kokkos C++ Abstraction X X X X X X X X OpenMP Directives X X X X X X X - OpenACC Directives X X X X - X X - CUDA Extension (X) - (X) X - X X - OpenCL Extension (X) - (X) X - X X - C++AMP Extension X - X - - X X - Raja C++ Abstraction X X X (X) - - - - TBB C++ Abstraction X X X X X X - - C++17 Language X - - - (X) X (X) (X) Fortran2008 Language X - - - - X (X) -
  • 7. Example: Conjugent Gradient Solver § Simple Iterative Linear Solver § For example used in MiniFE § Uses only three math operations: § Vector addition (AXPBY) § Dot product (DOT) § Sparse Matrix Vector multiply (SPMV) § Data management with Kokkos Views: 7 View<double*,HostSpace,MemoryTraits<Unmanaged> > h_x(x_in, nrows); View<double*> x("x",nrows); deep_copy(x,h_x);
  • 8. CG Solve: The AXPBY 8 void axpby(int n, View<double*> z, double alpha, View<const double*> x, double beta, View<const double*> y) { parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) { z(i) = alpha*x(i) + beta*y(i); }); } § Simple data parallel loop: Kokkos::parallel_for § Easy to express in most programming models § Bandwidth bound § Serial Implementation: § Kokkos Implementation: void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) { for(int i=0; i<n; i++) z[i] = alpha*x[i] + beta*y[i]; }
  • 9. CG Solve: The AXPBY 9 void axpby(int n, View<double*> z, double alpha, View<const double*> x, double beta, View<const double*> y) { parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) { z(i) = alpha*x(i) + beta*y(i); }); } § Simple data parallel loop: Kokkos::parallel_for § Easy to express in most programming models § Bandwidth bound § Serial Implementation: § Kokkos Implementation: void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) { for(int i=0; i<n; i++) z[i] = alpha*x[i] + beta*y[i]; } Parallel Pattern: for loop
  • 10. CG Solve: The AXPBY 10 void axpby(int n, View<double*> z, double alpha, View<const double*> x, double beta, View<const double*> y) { parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) { z(i) = alpha*x(i) + beta*y(i); }); } § Simple data parallel loop: Kokkos::parallel_for § Easy to express in most programming models § Bandwidth bound § Serial Implementation: § Kokkos Implementation: void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) { for(int i=0; i<n; i++) z[i] = alpha*x[i] + beta*y[i]; } String Label: Profiling/Debugging
  • 11. CG Solve: The AXPBY 11 void axpby(int n, View<double*> z, double alpha, View<const double*> x, double beta, View<const double*> y) { parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) { z(i) = alpha*x(i) + beta*y(i); }); } § Simple data parallel loop: Kokkos::parallel_for § Easy to express in most programming models § Bandwidth bound § Serial Implementation: § Kokkos Implementation: void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) { for(int i=0; i<n; i++) z[i] = alpha*x[i] + beta*y[i]; } Execution Policy: do n iterations
  • 12. CG Solve: The AXPBY 12 void axpby(int n, View<double*> z, double alpha, View<const double*> x, double beta, View<const double*> y) { parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) { z(i) = alpha*x(i) + beta*y(i); }); } § Simple data parallel loop: Kokkos::parallel_for § Easy to express in most programming models § Bandwidth bound § Serial Implementation: § Kokkos Implementation: void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) { for(int i=0; i<n; i++) z[i] = alpha*x[i] + beta*y[i]; } Iteration handle: integer index
  • 13. CG Solve: The AXPBY 13 void axpby(int n, View<double*> z, double alpha, View<const double*> x, double beta, View<const double*> y) { parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) { z(i) = alpha*x(i) + beta*y(i); }); } § Simple data parallel loop: Kokkos::parallel_for § Easy to express in most programming models § Bandwidth bound § Serial Implementation: § Kokkos Implementation: void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) { for(int i=0; i<n; i++) z[i] = alpha*x[i] + beta*y[i]; } Loop Body
  • 14. CG Solve: The AXPBY 14 void axpby(int n, View<double*> z, double alpha, View<const double*> x, double beta, View<const double*> y) { parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) { z(i) = alpha*x(i) + beta*y(i); }); } § Simple data parallel loop: Kokkos::parallel_for § Easy to express in most programming models § Bandwidth bound § Serial Implementation: § Kokkos Implementation: void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) { for(int i=0; i<n; i++) z[i] = alpha*x[i] + beta*y[i]; }
  • 15. CG Solve: The Dot Product 15 double dot(int n, View<const double*> x, View<const double*> y) { double x_dot_y = 0.0; parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) { sum += x[i]*y[i]; }, x_dot_y); return x_dot_y; } § Simple data parallel loop with reduction: Kokkos::parallel_reduce § Non trivial in CUDA due to lack of built-in reduction support § Bandwidth bound § Serial Implementation: § Kokkos Implementation: double dot(int n, const double* x, const double* y) { double sum = 0.0; for(int i=0; i<n; i++) sum += x[i]*y[i]; return sum; }
  • 16. CG Solve: The Dot Product 16 double dot(int n, View<const double*> x, View<const double*> y) { double x_dot_y = 0.0; parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) { sum += x[i]*y[i]; }, x_dot_y); return x_dot_y; } § Simple data parallel loop with reduction: Kokkos::parallel_reduce § Non trivial in CUDA due to lack of built-in reduction support § Bandwidth bound § Serial Implementation: § Kokkos Implementation: double dot(int n, const double* x, const double* y) { double sum = 0.0; for(int i=0; i<n; i++) sum += x[i]*y[i]; return sum; } Parallel Pattern: loop with reduction
  • 17. CG Solve: The Dot Product 17 double dot(int n, View<const double*> x, View<const double*> y) { double x_dot_y = 0.0; parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) { sum += x[i]*y[i]; }, x_dot_y); return x_dot_y; } § Simple data parallel loop with reduction: Kokkos::parallel_reduce § Non trivial in CUDA due to lack of built-in reduction support § Bandwidth bound § Serial Implementation: § Kokkos Implementation: double dot(int n, const double* x, const double* y) { double sum = 0.0; for(int i=0; i<n; i++) sum += x[i]*y[i]; return sum; } Iteration Index + Thread-Local Red. Varible
  • 18. CG Solve: The Dot Product 18 double dot(int n, View<const double*> x, View<const double*> y) { double x_dot_y = 0.0; parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) { sum += x[i]*y[i]; }, x_dot_y); return x_dot_y; } § Simple data parallel loop with reduction: Kokkos::parallel_reduce § Non trivial in CUDA due to lack of built-in reduction support § Bandwidth bound § Serial Implementation: § Kokkos Implementation: double dot(int n, const double* x, const double* y) { double sum = 0.0; for(int i=0; i<n; i++) sum += x[i]*y[i]; return sum; }
  • 19. CG Solve: The SPMV § Loop over rows § Dot product of matrix row with a vector § Example of Non-Tightly nested loops § Random access on the vector (Texture fetch on GPUs) 19 void SPMV(int nrows, const int* A_row_offsets, const int* A_cols, const double* A_vals, double* y, const double* x) { for(int row=0; row<nrows; ++row) { double sum = 0.0; int row_start=A_row_offsets[row]; int row_end=A_row_offsets[row+1]; for(int i=row_start; i<row_end; ++i) { sum += A_vals[i]*x[A_cols[i]]; } y[row] = sum; } }
  • 20. CG Solve: The SPMV § Loop over rows § Dot product of matrix row with a vector § Example of Non-Tightly nested loops § Random access on the vector (Texture fetch on GPUs) 20 void SPMV(int nrows, const int* A_row_offsets, const int* A_cols, const double* A_vals, double* y, const double* x) { for(int row=0; row<nrows; ++row) { double sum = 0.0; int row_start=A_row_offsets[row]; int row_end=A_row_offsets[row+1]; for(int i=row_start; i<row_end; ++i) { sum += A_vals[i]*x[A_cols[i]]; } y[row] = sum; } } Outer loop over matrix rows
  • 21. CG Solve: The SPMV § Loop over rows § Dot product of matrix row with a vector § Example of Non-Tightly nested loops § Random access on the vector (Texture fetch on GPUs) 21 void SPMV(int nrows, const int* A_row_offsets, const int* A_cols, const double* A_vals, double* y, const double* x) { for(int row=0; row<nrows; ++row) { double sum = 0.0; int row_start=A_row_offsets[row]; int row_end=A_row_offsets[row+1]; for(int i=row_start; i<row_end; ++i) { sum += A_vals[i]*x[A_cols[i]]; } y[row] = sum; } } Inner dot product row x vector
  • 22. CG Solve: The SPMV § Loop over rows § Dot product of matrix row with a vector § Example of Non-Tightly nested loops § Random access on the vector (Texture fetch on GPUs) 22 void SPMV(int nrows, const int* A_row_offsets, const int* A_cols, const double* A_vals, double* y, const double* x) { for(int row=0; row<nrows; ++row) { double sum = 0.0; int row_start=A_row_offsets[row]; int row_end=A_row_offsets[row+1]; for(int i=row_start; i<row_end; ++i) { sum += A_vals[i]*x[A_cols[i]]; } y[row] = sum; } }
  • 23. CG Solve: The SPMV 23 void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y, View<const double*, MemoryTraits< RandomAccess>> x) { #ifdef KOKKOS_ENABLE_CUDA int rows_per_team = 64;int team_size = 64; #else int rows_per_team = 512;int team_size = 1; #endif parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > > ((nrows+rows_per_team-1)/rows_per_team,team_size,8), KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) { const int first_row = team.league_rank()*rows_per_team; const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows; parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) { const int row_start=A_row_offsets[row]; const int row_length=A_row_offsets[row+1]-row_start; double y_row; parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) { sum += A_vals(i+row_start)*x(A_cols(i+row_start)); } , y_row); y(row) = y_row; }); }); }
  • 24. CG Solve: The SPMV 24 void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y, View<const double*, MemoryTraits< RandomAccess>> x) { #ifdef KOKKOS_ENABLE_CUDA int rows_per_team = 64;int team_size = 64; #else int rows_per_team = 512;int team_size = 1; #endif parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > > ((nrows+rows_per_team-1)/rows_per_team,team_size,8), KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) { const int first_row = team.league_rank()*rows_per_team; const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows; parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) { const int row_start=A_row_offsets[row]; const int row_length=A_row_offsets[row+1]-row_start; double y_row; parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) { sum += A_vals(i+row_start)*x(A_cols(i+row_start)); } , y_row); y(row) = y_row; }); }); } Team Parallelism over Row Worksets
  • 25. CG Solve: The SPMV 25 void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y, View<const double*, MemoryTraits< RandomAccess>> x) { #ifdef KOKKOS_ENABLE_CUDA int rows_per_team = 64;int team_size = 64; #else int rows_per_team = 512;int team_size = 1; #endif parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > > ((nrows+rows_per_team-1)/rows_per_team,team_size,8), KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) { const int first_row = team.league_rank()*rows_per_team; const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows; parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) { const int row_start=A_row_offsets[row]; const int row_length=A_row_offsets[row+1]-row_start; double y_row; parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) { sum += A_vals(i+row_start)*x(A_cols(i+row_start)); } , y_row); y(row) = y_row; }); }); } Distribute rows in workset over team-threads
  • 26. CG Solve: The SPMV 26 void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y, View<const double*, MemoryTraits< RandomAccess>> x) { #ifdef KOKKOS_ENABLE_CUDA int rows_per_team = 64;int team_size = 64; #else int rows_per_team = 512;int team_size = 1; #endif parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > > ((nrows+rows_per_team-1)/rows_per_team,team_size,8), KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) { const int first_row = team.league_rank()*rows_per_team; const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows; parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) { const int row_start=A_row_offsets[row]; const int row_length=A_row_offsets[row+1]-row_start; double y_row; parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) { sum += A_vals(i+row_start)*x(A_cols(i+row_start)); } , y_row); y(row) = y_row; }); }); } Row x Vector dot product
  • 27. CG Solve: The SPMV 27 void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y, View<const double*, MemoryTraits< RandomAccess>> x) { #ifdef KOKKOS_ENABLE_CUDA int rows_per_team = 64;int team_size = 64; #else int rows_per_team = 512;int team_size = 1; #endif parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > > ((nrows+rows_per_team-1)/rows_per_team,team_size,8), KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) { const int first_row = team.league_rank()*rows_per_team; const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows; parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) { const int row_start=A_row_offsets[row]; const int row_length=A_row_offsets[row+1]-row_start; double y_row; parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) { sum += A_vals(i+row_start)*x(A_cols(i+row_start)); } , y_row); y(row) = y_row; }); }); }
  • 28. CG Solve: Performance 28 0 10 20 30 40 50 60 70 80 90 AXPBY-100 AXPBY-200 DOT-100 DOT-200 SPMV-100 SPMV-200 Performance [Gflop/s] NVIDIA P100 / IBM Power8 OpenACC CUDA Kokkos OpenMP 0 10 20 30 40 50 60 AXPBY-100 AXPBY-200 DOT-100 DOT-200 SPMV-100 SPMV-200 Performance [Gflop/s] Intel KNL 7250 OpenACC Kokkos OpenMP TBB (Flat) TBB (Hierarchical) § Comparison with other Programming Models § Straight forward implementation of kernels § OpenMP 4.5 is immature at this point § Two problem sizes: 100x100x100 and 200x200x200 elements
  • 29. Custom Reductions With Lambdas § Added Capability to do Custom Reductions with Lambdas § Provide built-in reducers for common operations § Add,Min,Max,Prod,MinLoc,MaxLoc,MinMaxLoc,And,Or,Xor, § Users can implement their own reducers § Example Max reduction: 29 double result parallel_reduce(N, KOKKOS_LAMBDA(const int& i, double& lmax) { if(lmax < a(i)) lmax = a(i); },Max<double>(result)); 0 100 200 300 400 500 600 K80 P100 KNL BandwidthinGB/s Reduction Performance Sum Max MaxLoc
  • 30. New Features: MDRangePolicy § Many people perform structured grid calculations § Sandia’s codes are predominantly unstructured though § MDRangePolicy introduced for tightly nested loops § Usecase corresponding to OpenMP collapse clause § Optionally set iteration order and tiling: 30 void launch (int N0, int N1, [ARGS]) { parallel_for(MDRangePolicy<Rank<3>>({0,0,0},{N0,N1,N2}), KOKKOS_LAMBDA (int i0, int i1) {/*...*/}); } MDRangePolicy<Rank<3,Iterate::Right,Iterate::Left>> ({0,0,0},{N0,N1,N2},{T0,T1,T2}) 0 100 200 300 400 500 600 KNL P100 Gflop/s Albany Kernel Naïve Best Raw MDRange
  • 31. New Features: Task Graphs § Task dags are an important class of algorithmic structures § Used for algorithms with complex data dependencies § For example triangular solve § Kokkos tasking is on-node § Future based, not explicit data centric (as for example Legion) § Tasks return futures § Tasks can depend on futures § Respawn of tasks possible § Tasks can spawn other tasks § Tasks can have data parallel loops § I.e. a task can utilize multiple threads like the hyper threads on a core or a CUDA block 31 Carter Edwards S7253 “Task Data Parallelism” , Right after this talk.
  • 32. New Features: Pascal Support § Pascal GPUs Provide a set of new Capabilities § Much better memory subsystem § NVLink (next slide) § Hardware support for double precision atomic add § Generally significant speedup 3-5x over K80 for Sandia Apps 32 0 20 40 60 80 100 120 140 160 4000 32000 256000 2048000 MillionAtomstepsperSecond Number of Atoms LammpsTersoffPotential K40 K80 P100 K40 Atomic K80 Atomic P100 Atomic 0.001 0.01 0.1 1 10 100 1000 1 8 64 512 4096 BandwidthinGB/s Update Array Size Atomic Bandwidth
  • 33. New Features: HBM Support § New architectures with HBM: Intel KNL, NVIDIA P100 § Generally three types of allocations: § Page pinned in DDR § Page pinned in HBM § Page migratable by OS or hardware caching § Kokkos supports all three on both architectures § For Cuda backend: CudaHostPinnedSpace,CudaSpace and CudaUVMSpace § E.g.: Kokkos::View<double*, CudaUVMSpace> a(”A”,N); § P100 Bandwidth with and without data Reuse 33 1 10 100 1000 0.125 0.25 0.5 1 2 4 8 16 32 64 BandwidthGB/s Problem Size [GB] HostPinned 1 HostPinned 32 UVM 1 UVM 32
  • 34. Upcoming Features § Support for OpenMP 4.5+ Target Backend § Experimentally available on github § CUDA will stay preferred backend § Maybe support for FPGAs in the future? § Help maturing OpenMP 4.5+ compilers § Support for AMD ROCm Backend § Experimentally available on github § Mainly developed by AMD § Support for APUs and discrete GPUs § Expect maturation fall 2017 34
  • 35. Beyond Capabilities § Using Kokkos is invasive, you can’t just swap it out § Significant part of data structures need to be taken over § Function markings everywhere § It is not enough to have initial Capabilities § Robustness comes from utilizations and experience § Different types of application and coding styles will expose different corner cases § Applications need libraries § Interaction with TPLs such as BLAS must work § Many library capabilities must be reimplemented in the programming model § Applications need tools § Profiling and Debugging capabilities are required for serious work 35
  • 36. Timeline 36 Initial Kokkos: Linear Algebra for Trilinos Restart of Kokkos: Scope now Programming Model Mantevo MiniApps: Compare Kokkos to other Models LAMMPS: Demonstrate Legacy App Transition Trilinos: Move Tpetra over to use Kokkos Views Multiple Apps start exploring (Albany, Uintah, …) Sandia Multiday Tutorial (~80 attendees) Sandia Decision to prefer Kokkos over other models Github Release of Kokkos 2.0 Kokkos-Kernels and Kokkos-Tools Release DOE Exascale Computing Project starts 2008 2011 2013 2012 2014 2016 2015 2017
  • 37. 17" 0" 2000" 4000" 6000" 8000" 10000" 12000" 14000" FOM+(Z/s)+ LULESH+Figure+of+Merit+Results+(Problem+60)+ HSW"1x16" HSW"1x32" P8"1x40"XL" KNC"1x224" ARM64"1x8" NV"K40"Higher" is" Better" Results"by"Dennis"Dinge,"Christian"Trott"and"Si"Hammond" Initial Demonstrations (2012-2015) 37 § Demonstrate Feasibility of Performance Portability § Development of a number of MiniApps from different science domains § Demonstrate Low Performance Loss versus Native Models § MiniApps are implemented in various programming models § DOE TriLab Collaboration § Show Kokkos works for other labs app § Note this is historical data: Improvements were found, RAJA implemented similar optimization etc.
  • 38. Training the User-Base § Typical Legacy Application Developer § Science Background § Mostly Serial Coding (MPI apps usually have communication layer few people touch) § Little hardware background, little parallel programming experience § Not sufficient to teach Programming Model Syntax § Need training in parallel programming techniques § Teach fundamental hardware knowledge (how does CPU, MIC and GPU differ, and what does it mean for my code) § Need training in performance profiling § Regular Kokkos Tutorials § ~200 slides, 9 hands-on exercises to teach parallel programming techniques, performance considerations and Kokkos § Now dedicated ECP Kokkos support project: develop online support community § ~200 HPC developers (mostly from DOE labs) had Kokkos training so far38
  • 39. Keeping Applications Happy § Never underestimate developers ability to find new corner cases!! § Having a Programming Model deployed in MiniApps or a single big app is very different from having half a dozen multi-million line code customers. § 538 Issues in 24 months § 28% are small enhancements § 18% bigger feature requests § 24% are bugs: often corner cases 39 0 100 200 300 400 500 600 Issues Issues since 2015 Enhancements Feature Request Bug Compiler Issue Questions Other § Example: Subviews § Initially data type needed to match including compile time dimensions § Allow compile/runtime conversion § Allow Layout conversion if possible § Automatically find best layout § Add subviewpatterns
  • 40. Testing and Software Quality Develop Release Version Compilers GCC (4.8-6.3), Clang (3.6-4.0), Intel (15.0-18.0), IBM (13.1.5, 14.0), PGI (17.3), NVIDIA (7.0-8.0) Hardware Intel Haswell, Intel KNL, ARM v8, IBM Power8, NVIDIA K80, NVIDIA P100 Backends OpenMP, Pthreads, Serial, CUDA Warnings as Errors Feature Development/Tests Review Tests Master Integration Trilinos, LAMMPS, … Multi config integration test Nightly, UnitTests New Features are developed on forks, and branches. Limited number of developers can push to develop branch. Pull requests are reviewed/tested. Each merge into master is minor release. Extensive integration test suite ensures backward compatibility, and catching of unit-test coverage gaps.
  • 41. Building an EcoSystem 41 Algorithms (Random, Sort) Containers (Map, CrsGraph, Mem Pool) Kokkos (Parallel Execution, Data Allocation, Data Transfer) Kokkos – Kernels (Sparse/Dense BLAS, Graph Kernels, Tensor Kernels) Kokkos–Tools (KokkosawareProfilingandDebuggingTools) Trilinos (Linear Solvers, Load Balancing, Discretization, Distributed Linear Algebra) Kokkos–SupportCommunity (ApplicationSupport,DeveloperTraining) ApplicationsMiniApps std::thread OpenMP CUDA ROCm
  • 42. Kokkos Tools 42 § Utilities § KernelFilter: Enable/Disable Profiling for a selection of Kernels § Kernel Inspection § KernelLogger: Runtime information about entering/leaving Kernels and Regions § KernelTimer: Postprocessinginformation about Kernel and Region Times § Memory Analysis § MemoryHighWater: Maximum Memory Footprint over whole run § MemoryUsage: Per Memory Space Utilization Timeline § MemoryEvents: Per Memory Space Allocation and Deallocations § Third Party Connector § VTune Connector: Mark Kernels as Frames inside of Vtune § VTune Focused Connector: Mark Kernels as Frames + start/stop profiling https://github.com/kokkos/kokkos-tools
  • 43. Kokkos-Tools: Example MemUsage § Tools are loaded at runtime § Profile actual release builds of applications § Set via: export KOKKOS_PROFILE_LIBRARY=[PATH_TO_PROFILING_LIB] § Output depends on tool § Often per process file § MemoryUsage provides per MemorySpace utilization timelines § Time starts with Kokkos::initialize § HOSTNAME-PROCESSID-CudaUVM.memspace_usage 43 # Space CudaUVM # Time(s) Size(MB) HighWater(MB) HighWater-Process(MB) 0.317260 38.1 38.1 81.8 0.377285 0.0 38.1 158.1 0.384785 38.1 38.1 158.1 0.441988 0.0 38.1 158.1
  • 44. Kokkos-Kernels 44 § Provide BLAS (1,2,3); Sparse; Graph and Tensor Kernels § No required dependencies other than Kokkos § Local kernels (no MPI) § Hooks in TPLs such as MKL or cuBLAS/cuSparse where applicable § Provide kernels for all levels of hierarchical parallelism: § Global Kernels: use all execution resources available § Team Level Kernels: use a subset of threads for execution § Thread Level Kernels: utilize vectorization inside the kernel § Serial Kernels: provide elemental functions (OpenMP declare SIMD) § Work started based on customer priorities; expect multi-year effort for broad coverage § People: Many developers from Trilinos contribute § Consolidate node level reusable kernels previously distributed over multiple packages
  • 45. Kokkos-Kernels: Dense Blas Example 45 0 5E+10 1E+11 1.5E+11 2E+11 2.5E+11 3 5 10 15 PerformanceGFlop/s Matrix Block Size Batched DGEMM 32k blocks MKL @ KNL KK @ KNL CUBLAS @ P100 KK @ P100 0 5E+10 1E+11 1.5E+11 2E+11 3 5 10 15 PerformanceGFlop/s Matrix Block Size BatchedTRSM 32k blocks MKL @ KNL KK @ KNL CUBLAS @ P100 KK @ P100 § Batched small matrices using an interleaved memory layout § Matrix sizes based on common physics problems: 3,5,10,15 § 32k small matrices § Vendor libraries get better for more and larger matrices
  • 46. Kokkos Users Spread 46 § Users from a dozen major institutions § More than two dozen applications/libraries § Including many multi-million-line projects UsersUSERS Backend Optimizations
  • 47. Further Material § https://github.com/kokkosKokkos Github Organization § Kokkos: Core library, Containers, Algorithms § Kokkos-Kernels: Sparse and Dense BLAS, Graph, Tensor (under development) § Kokkos-Tools: Profiling and Debugging § Kokkos-MiniApps: MiniApp repository and links § Kokkos-Tutorials: Extensive Tutorials with Hands-On Exercises § https://cs.sandia.gov Publications (search for ’Kokkos’) § Many Presentations on Kokkos and its use in libraries and apps § Talks at this GTC: § Carter Edwards S7253 “Task Data Parallelism”, Today 10:00, 211B § Ramanan Sankaran, S7561 “High Pres. Reacting Flows”, Today 1:30, 212B § Pierre Kestener, S7166 “High Res. Fluid Dynamics”, Today 14:30, 212B § Michael Carilli, S7148 “Liquid Rocket Simulations”, Today 16:00, 212B § Panel, S7564 “Accelerator Programming Ecosystems”, Tuesday 16:00, Ball3 § Training Lab, L7107 “Kokkos, Manycore PP”, Wednesday 16:00, LL21E 47