Parallel GNN After Fusion has Problems in Lowering and Performance

A related issue with #84.

# `scf.forall` Lowering Issue with Scalar

One of the issue is about `scf.forall` that only takes tensors as operands.

In the fused GNN kernel, when we use alphabetical order for the index labels, two indices will be fused and the intermediate tensor `T` will become a scalar.

```cpp
/// GNN kernel A = B * C * D
/// B is sparse
/// T[i, h] = B[i, k] * C[k, h];
/// A[i, j] = T[i, h] * D[h, j];

void no_fusion_index_tree_alphabet_order()
{
    for (h = 0 to NH) {
        for (i = 0 to NI) {
            for (k = 0 to NK) {
                T[i, h] += B[i, k] * C[k, h];
            }
        }
    }
    for (h = 0 to NH) {
        for (i = 0 to NI) {
            for (j = 0 to NJ) {
                A[i, j] += T[i, h] * D[h, j];
            }
        }
    }
}

void fusion_index_tree_scalar()
{
    for (h = 0 to NH) {
        for (i = 0 to NI) {
            for (k = 0 to NK) {
                t += B[i, k] * C[k, h];
            }
            for (j = 0 to NJ) {
                A[i, j] += t * D[h, j];
            }
            t = 0;
        }
    }
}
```

Current implementation lowers both indices `h` and `i` to parallel `scf.forall`. The `t` is a scalar and also an output of the parallel for loop, which causes a problem during lowering.

A `scf.forall` example using tensor operands is like this

```mlir
%32:2 = scf.forall (%arg2) in (%21) shared_outs(%arg3 = %arg0, %arg4 = %arg1) -> (tensor<?x4xf64>, tensor<4xf64>) {
  %extracted_slice = tensor.extract_slice %arg3[%arg2, 0] [1, 4] [1, 1] : tensor<?x4xf64> to tensor<1x4xf64>
  %extracted_slice_46 = tensor.extract_slice %arg4[%arg2] [1] [1] : tensor<4xf64> to tensor<1xf64>
  scf.forall.in_parallel {  /// Terminator of scf.forall
    tensor.parallel_insert_slice %extracted_slice into %arg3[%arg2, 0] [1, 4] [1, 1] : tensor<1x4xf64> into tensor<?x4xf64>
    tensor.parallel_insert_slice %extracted_slice_46 into %arg4[%arg2] [1] [1] : tensor<1xf64> into tensor<4xf64>
  }
}
```

In the fused case, `%arg4` should be a `f64`, but `scf.forall` only takes tensors as operands. This is because `scf.forall` is designed to write different parts of a tensor in parallel.

```cpp
error: 'scf.forall' op operand #2 must be variadic of ranked tensor of any type values, but got 'f64'
```

A possible solution is that we don’t make index `h` parallel but make `k` and `j` parallel, then we can use `scf.parallel` with `reduce` operation for scalar `t`.

# Performance Issue in Parallel GNN

The other issue is that the allocation of private tensors might cause high overhead when each thread allocates its new private tensor in each iteration.

```cpp
/// GNN kernel A = B * C * D
/// B is sparse
/// T[i, h] = B[i, k] * C[k, h];
/// A[i, j] = T[i, h] * D[h, j];

void no_fusion_index_tree_canonical_order()
{
    for (i = 0 to NI) {
        for (k = 0 to NK) {
            for (h = 0 to NH) {
                T[i, h] += B[i, k] * C[k, h];
            }
        }
    }
 
    for (i = 0 to NI) {
        for (h = 0 to NH) {
            for (j = 0 to NJ) {
                A[i, j] += T[i, h] * D[h, j];
            }
        }
    }
}

void fusion_index_tree_tensor()
{
    for (i = 0 to NI) {
        for (k = 0 to NK) {
            for (h = 0 to NH) {
                T[h] += B[i, k] * C[k, h];
            }
        }
 
        for (h = 0 to NH) {
            for (j = 0 to NJ) {
                A[i, j] += T[h] * D[h, j];
            }
        }
        for (h = 0 to NH) {
            T[h] = 0;
        }
    }
}
```

In current implementation for the `fusion_index_tree_tensor()`, index `i` will be parallel, and each new allocation of `T` brings overhead.

This allocation overhead could also happen for other parallel kernels with sparse output (e.g., SpGEMM and Triangle Counting).

In this specific GNN case, one solution could be using parallel `h` and parallel `j`, but not parallel `i`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallel GNN After Fusion has Problems in Lowering and Performance #91

`scf.forall` Lowering Issue with Scalar

Performance Issue in Parallel GNN

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parallel GNN After Fusion has Problems in Lowering and Performance #91

Description

scf.forall Lowering Issue with Scalar

Performance Issue in Parallel GNN

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`scf.forall` Lowering Issue with Scalar