Modern accelerator programming models, such as OpenCL, organise threads into work-groups. Recently proposed by AMD, remote-scope promotion (RSP) is a language extension that enables applications, for the first time, both to optimise for the common case of intra-work-group communication (using memory scopes to guarantee consistency only within a work-group) but also to allow occasional inter-work-group communication (as required, for instance, to support the popular load-balancing idiom of work stealing).

We present the first formal, axiomatic memory model of OpenCL extended with RSP. We have extended the Herd memory model simulator with support for OpenCL kernels that exploit RSP, and used it to discover bugs in several litmus tests and a work stealing queue developed by AMD. We have also formalised a GPU implementation of RSP proposed by AMD. The formalisation process revealed errors and omissions in AMD’s description that could result in well-synchronised programs experiencing memory inconsistencies. We present and prove correct a new implementation of RSP that fixes these issues and requires less non-standard hardware than the original implementation.

Our work, a collaboration between academia and industry, clearly demonstrates how, when designing hardware support for a new concurrent language feature, the early application of formal tools and techniques can prevent errors such as those we have identified from making it into silicon.