Skip to content

Double Occurrence of Parameters in Kernels Generated with CLI tool. #557

@ThrudPrimrose

Description

@ThrudPrimrose

If I generate a kernel with the following command in the CLI tool:
taco "C(i, j, b) = C(i, j, b) + A(l, j, b) * B(i, k, l, b) * w(k, b)" -cuda -d=A:32,32,25866 -d=B:32,32,32,25866 -d=C:32,32,25866 -d=w:32,25866 -t=A:float -t=B:float -t=C:float -t=w:float -print-nocolor

The generated kernel has the parameter C twice in the launcher function. I will add the generated code:


// Generated by the Tensor Algebra Compiler (tensor-compiler.org)

__global__
void computeDeviceKernel0(taco_tensor_t * __restrict__ A, taco_tensor_t * __restrict__ B, taco_tensor_t * __restrict__ C, taco_tensor_t * __restrict__ w){
  int A2_dimension = (int)(A->dimensions[1]);
  int A3_dimension = (int)(A->dimensions[2]);
  float* __restrict__ A_vals = (float*)(A->vals);
  int B2_dimension = (int)(B->dimensions[1]);
  int B3_dimension = (int)(B->dimensions[2]);
  int B4_dimension = (int)(B->dimensions[3]);
  float* __restrict__ B_vals = (float*)(B->vals);
  int C1_dimension = (int)(C->dimensions[0]);
  int C2_dimension = (int)(C->dimensions[1]);
  int C3_dimension = (int)(C->dimensions[2]);
  float* __restrict__ C_vals = (float*)(C->vals);
  int w1_dimension = (int)(w->dimensions[0]);
  int w2_dimension = (int)(w->dimensions[1]);
  float* __restrict__ w_vals = (float*)(w->vals);

  int32_t i161 = blockIdx.x;
  int32_t i162 = (threadIdx.x % (256));
  if (threadIdx.x >= 256) {
    return;
  }

  int32_t i = i161 * 256 + i162;
  if (i >= C1_dimension)
    return;

  for (int32_t j = 0; j < C2_dimension; j++) {
    int32_t jC = i * C2_dimension + j;
    for (int32_t b = 0; b < C3_dimension; b++) {
      int32_t bC = jC * C3_dimension + b;
      float tl_val = 0.0;
      for (int32_t l = 0; l < B3_dimension; l++) {
        int32_t jA = l * A2_dimension + j;
        int32_t bA = jA * A3_dimension + b;
        float tk_val = 0.0;
        for (int32_t k = 0; k < w1_dimension; k++) {
          int32_t kB = i * B2_dimension + k;
          int32_t lB = kB * B3_dimension + l;
          int32_t bB = lB * B4_dimension + b;
          int32_t bw = k * w2_dimension + b;
          tk_val = tk_val + (A_vals[bA] * B_vals[bB]) * w_vals[bw];
        }
        tl_val = tl_val + tk_val;
      }
      C_vals[bC] = C_vals[bC] + tl_val;
    }
  }
}

int compute(taco_tensor_t *C, taco_tensor_t *A, taco_tensor_t *B, taco_tensor_t *w, taco_tensor_t *C) {
  int C1_dimension = (int)(C->dimensions[0]);

  computeDeviceKernel0<<<(C1_dimension + 255) / 256, 256>>>(A, B, C, w);
  cudaDeviceSynchronize();
  return 0;
}

I have built taco with gcc-11, optimize flags -fPIC -O3, with pybind11 and cuda bundled with nvhpc 23.9 from source. The commit hash (output from git log -n 1, I hope it is the correct command to use here) is:
git log -n 1
commit 2b8ece4

Also, one more question, I wanted to use the index "b" to create batched tensor contractions in this case, and I hoped that the kernel would distribute the workload using the C->dimensions[3] because the last index is definitely the biggest one, I also have provided it in the command line argument with the hope that it would be used, what am I doing there wrong? Should I provide a schedule, or what should I do for that?

I tried:

taco "C(i, j, b) = C(i, j, b) + A(l, j, b) * B(i, k, l, b) * w(k, b)" -cuda -d=A:32,32,25866 -d=B:32,32,32,25866 -d=C:32,32,25866 -d=w:32,25866 -t=A:float -t=B:float -t=C:float -t=w:float -print-nocolor -s="parallelize(b, GPUBlock, NoRaces)"
// Generated by the Tensor Algebra Compiler (tensor-compiler.org)

terminate called after throwing an instance of 'taco::TacoException'
  what():  Compiler bug at /home/primrose/Installed/taco/src/codegen/codegen_cuda.cpp:374 in visit
Please report it to developers
 Condition failed: blockIDVars.size() == threadIDVars.size()
 No matching GPUThread parallelize 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions