-
Notifications
You must be signed in to change notification settings - Fork 9
Description
A related issue with #84.
scf.forall
Lowering Issue with Scalar
One of the issue is about scf.forall
that only takes tensors as operands.
In the fused GNN kernel, when we use alphabetical order for the index labels, two indices will be fused and the intermediate tensor T
will become a scalar.
/// GNN kernel A = B * C * D
/// B is sparse
/// T[i, h] = B[i, k] * C[k, h];
/// A[i, j] = T[i, h] * D[h, j];
void no_fusion_index_tree_alphabet_order()
{
for (h = 0 to NH) {
for (i = 0 to NI) {
for (k = 0 to NK) {
T[i, h] += B[i, k] * C[k, h];
}
}
}
for (h = 0 to NH) {
for (i = 0 to NI) {
for (j = 0 to NJ) {
A[i, j] += T[i, h] * D[h, j];
}
}
}
}
void fusion_index_tree_scalar()
{
for (h = 0 to NH) {
for (i = 0 to NI) {
for (k = 0 to NK) {
t += B[i, k] * C[k, h];
}
for (j = 0 to NJ) {
A[i, j] += t * D[h, j];
}
t = 0;
}
}
}
Current implementation lowers both indices h
and i
to parallel scf.forall
. The t
is a scalar and also an output of the parallel for loop, which causes a problem during lowering.
A scf.forall
example using tensor operands is like this
%32:2 = scf.forall (%arg2) in (%21) shared_outs(%arg3 = %arg0, %arg4 = %arg1) -> (tensor<?x4xf64>, tensor<4xf64>) {
%extracted_slice = tensor.extract_slice %arg3[%arg2, 0] [1, 4] [1, 1] : tensor<?x4xf64> to tensor<1x4xf64>
%extracted_slice_46 = tensor.extract_slice %arg4[%arg2] [1] [1] : tensor<4xf64> to tensor<1xf64>
scf.forall.in_parallel { /// Terminator of scf.forall
tensor.parallel_insert_slice %extracted_slice into %arg3[%arg2, 0] [1, 4] [1, 1] : tensor<1x4xf64> into tensor<?x4xf64>
tensor.parallel_insert_slice %extracted_slice_46 into %arg4[%arg2] [1] [1] : tensor<1xf64> into tensor<4xf64>
}
}
In the fused case, %arg4
should be a f64
, but scf.forall
only takes tensors as operands. This is because scf.forall
is designed to write different parts of a tensor in parallel.
error: 'scf.forall' op operand #2 must be variadic of ranked tensor of any type values, but got 'f64'
A possible solution is that we don’t make index h
parallel but make k
and j
parallel, then we can use scf.parallel
with reduce
operation for scalar t
.
Performance Issue in Parallel GNN
The other issue is that the allocation of private tensors might cause high overhead when each thread allocates its new private tensor in each iteration.
/// GNN kernel A = B * C * D
/// B is sparse
/// T[i, h] = B[i, k] * C[k, h];
/// A[i, j] = T[i, h] * D[h, j];
void no_fusion_index_tree_canonical_order()
{
for (i = 0 to NI) {
for (k = 0 to NK) {
for (h = 0 to NH) {
T[i, h] += B[i, k] * C[k, h];
}
}
}
for (i = 0 to NI) {
for (h = 0 to NH) {
for (j = 0 to NJ) {
A[i, j] += T[i, h] * D[h, j];
}
}
}
}
void fusion_index_tree_tensor()
{
for (i = 0 to NI) {
for (k = 0 to NK) {
for (h = 0 to NH) {
T[h] += B[i, k] * C[k, h];
}
}
for (h = 0 to NH) {
for (j = 0 to NJ) {
A[i, j] += T[h] * D[h, j];
}
}
for (h = 0 to NH) {
T[h] = 0;
}
}
}
In current implementation for the fusion_index_tree_tensor()
, index i
will be parallel, and each new allocation of T
brings overhead.
This allocation overhead could also happen for other parallel kernels with sparse output (e.g., SpGEMM and Triangle Counting).
In this specific GNN case, one solution could be using parallel h
and parallel j
, but not parallel i
.