mirror of git://gcc.gnu.org/git/gcc.git
When using a compiler build with:
...
+#define PTX_DEFAULT_VECTOR_LENGTH PTX_CTA_SIZE
...
consider a test-case:
...
int
main (void)
{
#pragma acc parallel vector_length (64)
#pragma acc loop worker
for (unsigned int i = 0; i < 32; i++)
#pragma acc loop vector
for (unsigned int j = 0; j < 64; j++)
;
return 0;
}
...
If num_workers is 16, either because:
- we add a "num_workers (16)" clause on the parallel directive, or
- we set "GOMP_OPENACC_DIM=:16:", or
- the libgomp plugin chooses 16 num_workers
we run into an illegal instruction at runtime, because a bar.sync instruction
tries to use a barrier 16. The instruction is illegal, because ptx supports
only 16 barriers per CTA, and the valid range is 0..15.
The problem is that with a warp-multiple vector length, we use a code generation
scheme with a per-worker barrier. And because barrier zero is reserved for
per-cta barrier, only the remaining 15 barriers can be used as per-worker
barrier, and consequently we can't use num_workers larger than 15.
This problem occurs only for vector_length 64. For vector_length 32, we use a
different code generation scheme, and for vector_length >= 96, the maximum
num_workers is not big enough not to trigger this problem.
Also, this problem only occurs for num_workers 16. As explained above,
num_workers 15 is safe to use, and 16 is already the maximum num_workers for
vector_length 64.
This patch fixes the problem in both the compiler (handling "num_workers (16)")
and in the libgomp nvptx plugin (with and without "GOMP_OPENACC_DIM=:16:").
2019-01-11 Tom de Vries <tdevries@suse.de>
* config/nvptx/nvptx.c (PTX_CTA_NUM_BARRIERS, PTX_PER_CTA_BARRIER)
(PTX_NUM_PER_CTA_BARRIER, PTX_FIRST_PER_WORKER_BARRIER)
(PTX_NUM_PER_WORKER_BARRIERS): Define.
(nvptx_apply_dim_limits): Prevent vector_length 64 and
num_workers 16.
* plugin/plugin-nvptx.c (nvptx_exec): Prevent vector_length 64 and
num_workers 16.
From-SVN: r267838
|
||
|---|---|---|
| .. | ||
| config | ||
| plugin | ||
| testsuite | ||
| ChangeLog | ||
| ChangeLog.graphite | ||
| Makefile.am | ||
| Makefile.in | ||
| acinclude.m4 | ||
| aclocal.m4 | ||
| affinity-fmt.c | ||
| affinity.c | ||
| alloc.c | ||
| atomic.c | ||
| barrier.c | ||
| config.h.in | ||
| configure | ||
| configure.ac | ||
| configure.tgt | ||
| critical.c | ||
| env.c | ||
| error.c | ||
| fortran.c | ||
| hashtab.h | ||
| icv-device.c | ||
| icv.c | ||
| iter.c | ||
| iter_ull.c | ||
| libgomp-plugin.c | ||
| libgomp-plugin.h | ||
| libgomp.h | ||
| libgomp.map | ||
| libgomp.spec.in | ||
| libgomp.texi | ||
| libgomp_f.h.in | ||
| libgomp_g.h | ||
| lock.c | ||
| loop.c | ||
| loop_ull.c | ||
| oacc-async.c | ||
| oacc-cuda.c | ||
| oacc-host.c | ||
| oacc-init.c | ||
| oacc-int.h | ||
| oacc-mem.c | ||
| oacc-parallel.c | ||
| oacc-plugin.c | ||
| oacc-plugin.h | ||
| omp.h.in | ||
| omp_lib.f90.in | ||
| omp_lib.h.in | ||
| openacc.f90 | ||
| openacc.h | ||
| openacc_lib.h | ||
| ordered.c | ||
| parallel.c | ||
| priority_queue.c | ||
| priority_queue.h | ||
| sections.c | ||
| secure_getenv.h | ||
| single.c | ||
| splay-tree.c | ||
| splay-tree.h | ||
| target.c | ||
| task.c | ||
| taskloop.c | ||
| team.c | ||
| teams.c | ||
| work.c | ||