Nvidia Cannot Read Property 'data' of Undefined

Get started with GPU Compute on the spider web

This post explores the experimental WebGPU API through examples and helps
y'all get started with performing data-parallel computations using the GPU.

Aug 28, 2019 — Updated October 18, 2021

Groundwork #

Every bit you may already know, the Graphic Processing Unit (GPU) is an electronic subsystem within a reckoner that was originally specialized for processing graphics. However, in the past 10 years, information technology has evolved towards a more than flexible architecture allowing developers to implement many types of algorithms, not just render 3D graphics, while taking advantage of the unique architecture of the GPU. These capabilities are referred to as GPU Compute, and using a GPU as a coprocessor for general-purpose scientific computing is called full general-purpose GPU (GPGPU) programming.

GPU Compute has contributed significantly to the recent machine learning boom, every bit convolution neural networks and other models tin can have reward of the architecture to run more than efficiently on GPUs. With the electric current Web Platform lacking in GPU Compute capabilities, the W3C's "GPU for the Web" Community Group is designing an API to expose the modern GPU APIs that are available on most current devices. This API is called WebGPU.

WebGPU is a low-level API, like WebGL. It is very powerful and quite verbose, equally yous'll come across. But that's OK. What we're looking for is operation.

In this article, I'one thousand going to focus on the GPU Compute role of WebGPU and, to be honest, I'm just scratching the surface, so that you can start playing on your own. I will be diving deeper and covering WebGPU rendering (sheet, texture, etc.) in forthcoming articles.

Admission the GPU #

Accessing the GPU is easy in WebGPU. Calling navigator.gpu.requestAdapter() returns a JavaScript promise that will asynchronously resolve with a GPU adapter. Retrieve of this adapter as the graphics carte du jour. It can either be integrated (on the same flake as the CPU) or discrete (normally a PCIe card that is more performant but uses more than power).

Once you have the GPU adapter, call adapter.requestDevice() to go a hope that will resolve with a GPU device you'll use to do some GPU computation.

                          const              adapter              =              await              navigator.gpu.              requestAdapter              (              )              ;              
              if              (              !adapter)              {              return              ;              }              
              const              device              =              wait              adapter.              requestDevice              (              )              ;

Both functions take options that allow you lot to be specific about the kind of adapter (power preference) and device (extensions, limits) you want. For the sake of simplicity, we'll employ the default options in this commodity.

Write buffer memory #

Allow's see how to utilize JavaScript to write data to memory for the GPU. This procedure isn't straightforward considering of the sandboxing model used in modern web browsers.

The example below shows y'all how to write four bytes to buffer retentivity accessible from the GPU. It calls device.createBuffer() which takes the size of the buffer and its usage. Fifty-fifty though the usage flag GPUBufferUsage.MAP_WRITE is not required for this specific call, allow's be explicit that we want to write to this buffer. It results in a GPU buffer object mapped at creation cheers to mappedAtCreation ready to true. Then the associated raw binary data buffer tin be retrieved past calling the GPU buffer method getMappedRange().

Writing bytes is familiar if yous've already played with ArrayBuffer; use a TypedArray and re-create the values into information technology.

                          // Become a GPU buffer in a mapped state and an arrayBuffer for writing.              
              const              gpuBuffer              =              device.              createBuffer              (              {              
              mappedAtCreation:              true              ,              
              size:              4              ,              
              usage:              GPUBufferUsage.              MAP_WRITE              
              }              )              ;              
              const              arrayBuffer              =              gpuBuffer.              getMappedRange              (              )              ;                              // Write bytes to buffer.                
                new                Uint8Array                (arrayBuffer)                .                set up                (                [                0                ,                1                ,                ii                ,                3                ]                )                ;

At this point, the GPU buffer is mapped, significant it is owned by the CPU, and information technology'southward accessible in read/write from JavaScript. So that the GPU can access information technology, information technology has to be unmapped which is as simple as calling gpuBuffer.unmap().

The concept of mapped/unmapped is needed to prevent race conditions where GPU and CPU access retention at the same time.

Read buffer memory #

Now permit'due south see how to copy a GPU buffer to some other GPU buffer and read it dorsum.

Since nosotros're writing in the starting time GPU buffer and we want to copy it to a second GPU buffer, a new usage flag GPUBufferUsage.COPY_SRC is required. The second GPU buffer is created in an unmapped state this time with device.createBuffer(). Its usage flag is GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ as it volition be used as the destination of the kickoff GPU buffer and read in JavaScript once GPU copy commands accept been executed.

                          // Get a GPU buffer in a mapped state and an arrayBuffer for writing.              
              const              gpuWriteBuffer              =              device.              createBuffer              (              {              
              mappedAtCreation:              true              ,              
              size:              4              ,              
              usage:              GPUBufferUsage.              MAP_WRITE              |              GPUBufferUsage.              COPY_SRC              
              }              )              ;              
              const              arrayBuffer              =              gpuWriteBuffer.              getMappedRange              (              )              ;                              // Write bytes to buffer.                
                new                Uint8Array                (arrayBuffer)                .                prepare                (                [                0                ,                1                ,                ii                ,                3                ]                )                ;              
                              // Unmap buffer so that it can be used later for copy.                
gpuWriteBuffer.                unmap                (                )                ;              
                              // Get a GPU buffer for reading in an unmapped land.                
                const                gpuReadBuffer                =                device.                createBuffer                (                {                
                size:                4                ,                
                usage:                GPUBufferUsage.                COPY_DST                |                GPUBufferUsage.                MAP_READ                
                }                )                ;

Because the GPU is an independent coprocessor, all GPU commands are executed asynchronously. This is why at that place is a list of GPU commands built up and sent in batches when needed. In WebGPU, the GPU command encoder returned by device.createCommandEncoder()is the JavaScript object that builds a batch of "buffered" commands that will be sent to the GPU at some point. The methods on GPUBuffer, on the other hand, are "unbuffered", pregnant they execute atomically at the fourth dimension they are chosen.

In one case you lot take the GPU control encoder, phone call copyEncoder.copyBufferToBuffer() equally shown beneath to add this control to the command queue for subsequently execution. Finally, cease encoding commands by calling copyEncoder.cease() and submit those to the GPU device command queue. The queue is responsible for handling submissions washed via device.queue.submit() with the GPU commands equally arguments. This will atomically execute all the commands stored in the array in order.

                          // Encode commands for copying buffer to buffer.              
              const              copyEncoder              =              device.              createCommandEncoder              (              )              ;              
copyEncoder.              copyBufferToBuffer              (              
              gpuWriteBuffer              /* source buffer */              ,              
              0              /* source beginning */              ,              
              gpuReadBuffer              /* destination buffer */              ,              
              0              /* destination outset */              ,              
              four              /* size */              
              )              ;                              // Submit copy commands.                
                const                copyCommands                =                copyEncoder.                end                (                )                ;                
device.queue.                submit                (                [copyCommands]                )                ;

At this point, GPU queue commands have been sent, but not necessarily executed. To read the second GPU buffer, telephone call gpuReadBuffer.mapAsync() with GPUMapMode.READ. It returns a promise that will resolve when the GPU buffer is mapped. Then get the mapped range with gpuReadBuffer.getMappedRange() that contains the same values as the first GPU buffer one time all queued GPU commands accept been executed.

                          // Read buffer.              
              await              gpuReadBuffer.              mapAsync              (GPUMapMode.              READ              )              ;              
              const              copyArrayBuffer              =              gpuReadBuffer.              getMappedRange              (              )              ;              
panel.              log              (              new              Uint8Array              (copyArrayBuffer)              )              ;

You can effort out this sample.

In short, here's what you lot demand to remember regarding buffer memory operations:

GPU buffers accept to be unmapped to exist used in device queue submission.
When mapped, GPU buffers can exist read and written in JavaScript.
GPU buffers are mapped when mapAsync() and createBuffer() with mappedAtCreation set to true are called.

Shader programming #

Programs running on the GPU that merely perform computations (and don't draw triangles) are chosen compute shaders. They are executed in parallel by hundreds of GPU cores (which are smaller than CPU cores) that operate together to crunch data. Their input and output are buffers in WebGPU.

To illustrate the use of compute shaders in WebGPU, we'll play with matrix multiplication, a common algorithm in auto learning illustrated below.

In curt, hither'due south what nosotros're going to do:

Create three GPU buffers (2 for the matrices to multiply and one for the result matrix)
Describe input and output for the compute shader
Compile the compute shader code
Set up a compute pipeline
Submit in batch the encoded commands to the GPU
Read the upshot matrix GPU buffer

GPU Buffers cosmos #

For the sake of simplicity, matrices will exist represented as a list of floating point numbers. The kickoff element is the number of rows, the 2d chemical element the number of columns, and the residue is the actual numbers of the matrix.

Simple representation of a matrix in JavaScript and its equivalent in mathematical notation — Elementary representation of a matrix in JavaScript and its equivalent in mathematical note

The three GPU buffers are storage buffers as we demand to store and retrieve information in the compute shader. This explains why the GPU buffer usage flags include GPUBufferUsage.STORAGE for all of them. The outcome matrix usage flag also has GPUBufferUsage.COPY_SRC because it volition be copied to another buffer for reading once all GPU queue commands have all been executed.

                          const              adapter              =              await              navigator.gpu.              requestAdapter              (              )              ;              
              if              (              !adapter)              {              return              ;              }              
              const              device              =              await              adapter.              requestDevice              (              )              ;                              // Showtime Matrix              
                              const                firstMatrix                =                new                Float32Array                (                [                
                ii                /* rows */                ,                4                /* columns */                ,                
                1                ,                2                ,                3                ,                4                ,                
                5                ,                vi                ,                7                ,                viii                
                ]                )                ;              
                              const                gpuBufferFirstMatrix                =                device.                createBuffer                (                {                
                mappedAtCreation:                true                ,                
                size:                firstMatrix.byteLength,                
                usage:                GPUBufferUsage.                STORAGE                ,                
                }                )                ;                
                const                arrayBufferFirstMatrix                =                gpuBufferFirstMatrix.                getMappedRange                (                )                ;                
                new                Float32Array                (arrayBufferFirstMatrix)                .                set up                (firstMatrix)                ;                
gpuBufferFirstMatrix.                unmap                (                )                ;              
                              // Second Matrix              
                              const                secondMatrix                =                new                Float32Array                (                [                
                4                /* rows */                ,                2                /* columns */                ,                
                1                ,                two                ,                
                3                ,                four                ,                
                5                ,                6                ,                
                7                ,                8                
                ]                )                ;              
                              const                gpuBufferSecondMatrix                =                device.                createBuffer                (                {                
                mappedAtCreation:                true                ,                
                size:                secondMatrix.byteLength,                
                usage:                GPUBufferUsage.                STORAGE                ,                
                }                )                ;                
                const                arrayBufferSecondMatrix                =                gpuBufferSecondMatrix.                getMappedRange                (                )                ;                
                new                Float32Array                (arrayBufferSecondMatrix)                .                set up                (secondMatrix)                ;                
gpuBufferSecondMatrix.                unmap                (                )                ;              
                              // Result Matrix              
                              const                resultMatrixBufferSize                =                Float32Array.                BYTES_PER_ELEMENT                *                (                2                +                firstMatrix[                0                ]                *                secondMatrix[                i                ]                )                ;                
                const                resultMatrixBuffer                =                device.                createBuffer                (                {                
                size:                resultMatrixBufferSize,                
                usage:                GPUBufferUsage.                STORAGE                |                GPUBufferUsage.                COPY_SRC                
                }                )                ;

Bind grouping layout and bind group #

Concepts of bind group layout and bind group are specific to WebGPU. A bind grouping layout defines the input/output interface expected by a shader, while a bind group represents the actual input/output data for a shader.

In the instance beneath, the bind grouping layout expects two readonly storage buffers at numbered entry bindings 0, 1, and a storage buffer at 2 for the compute shader. The bind grouping on the other manus, defined for this bind grouping layout, associates GPU buffers to the entries: gpuBufferFirstMatrix to the binding 0, gpuBufferSecondMatrix to the binding 1, and resultMatrixBuffer to the binding 2.

                          const              bindGroupLayout              =              device.              createBindGroupLayout              (              {              
              entries:              [              
              {              
              binding:              0              ,              
              visibility:              GPUShaderStage.              COMPUTE              ,              
              buffer:              {              
              type:              "read-only-storage"              
              }              
              }              ,              
              {              
              binding:              1              ,              
              visibility:              GPUShaderStage.              COMPUTE              ,              
              buffer:              {              
              type:              "read-only-storage"              
              }              
              }              ,              
              {              
              bounden:              2              ,              
              visibility:              GPUShaderStage.              COMPUTE              ,              
              buffer:              {              
              type:              "storage"              
              }              
              }              
              ]              
              }              )              ;                              const                bindGroup                =                device.                createBindGroup                (                {                
                layout:                bindGroupLayout,                
                entries:                [                
                {                
                binding:                0                ,                
                resource:                {                
                buffer:                gpuBufferFirstMatrix
                }                
                }                ,                
                {                
                binding:                1                ,                
                resource:                {                
                buffer:                gpuBufferSecondMatrix
                }                
                }                ,                
                {                
                binding:                ii                ,                
                resources:                {                
                buffer:                resultMatrixBuffer
                }                
                }                
                ]                
                }                )                ;

Compute shader code #

The compute shader lawmaking for multiplying matrices is written in WGSL, the WebGPU Shader Language, that is trivially translatable to SPIR-V. Without going into item, y'all should observe below the three storage buffers identified with var<storage>. The programme will utilize firstMatrix and secondMatrix equally inputs and resultMatrix as its output.

Note that each storage buffer has a binding ornament used that corresponds to the same index divers in bind grouping layouts and bind groups declared in a higher place.

                          const              shaderModule              =              device.              createShaderModule              (              {              
              code:                              `                                  
                  [[block]] struct Matrix {
                  size : vec2<f32>;
                  numbers: array<f32>;
                  };                    [[group(0), bounden(0)]] var<storage, read> firstMatrix : Matrix;
                    [[grouping(0), bounden(i)]] var<storage, read> secondMatrix : Matrix;
                    [[grouping(0), bounden(2)]] var<storage, write> resultMatrix : Matrix;
                                      [[phase(compute), workgroup_size(8, viii)]]
                    fn main([[builtin(global_invocation_id)]] global_id : vec3<u32>) {
                    // Baby-sit confronting out-of-bounds piece of work group sizes
                    if (global_id.x >= u32(firstMatrix.size.x) || global_id.y >= u32(secondMatrix.size.y)) {
                    return;
                    }
                                      resultMatrix.size = vec2<f32>(firstMatrix.size.x, secondMatrix.size.y);
                                      let resultCell = vec2<u32>(global_id.x, global_id.y);
                    var issue = 0.0;
                    for (var i = 0u; i < u32(firstMatrix.size.y); i = i + 1u) {
                    let a = i + resultCell.ten * u32(firstMatrix.size.y);
                    let b = resultCell.y + i * u32(secondMatrix.size.y);
                    outcome = outcome + firstMatrix.numbers[a] * secondMatrix.numbers[b];
                    }
                                      let index = resultCell.y + resultCell.x * u32(secondMatrix.size.y);
                    resultMatrix.numbers[alphabetize] = result;
                    }
                                `                            
              }              )              ;

Pipeline setup #

The compute pipeline is the object that really describes the compute operation we're going to perform. Create it past calling device.createComputePipeline(). Information technology takes two arguments: the demark group layout we created before, and a compute stage defining the entry betoken of our compute shader (the main WGSL function) and the bodily compute shader module created with device.createShaderModule().

                          const              computePipeline              =              device.              createComputePipeline              (              {              
              layout:              device.              createPipelineLayout              (              {              
              bindGroupLayouts:              [bindGroupLayout]              
              }              )              ,              
              compute:              {              
              module:              shaderModule,              
              entryPoint:              "main"              
              }              
              }              )              ;

Commands submission #

Afterwards instantiating a bind grouping with our three GPU buffers and a compute pipeline with a bind group layout, it is fourth dimension to use them.

Let's showtime a programmable compute pass encoder with commandEncoder.beginComputePass(). We'll use this to encode GPU commands that will perform the matrix multiplication. Fix its pipeline with passEncoder.setPipeline(computePipeline) and its bind group at index 0 with passEncoder.setBindGroup(0, bindGroup). The index 0 corresponds to the group(0) decoration in the WGSL code.

At present, allow'south talk about how this compute shader is going to run on the GPU. Our goal is to execute this program in parallel for each cell of the upshot matrix, step past step. For a result matrix of size 2 by 4 for example, we'd phone call passEncoder.dispatch(2, 4) to encode the control of execution. The first argument "x" is the first dimension, the second 1 "y" is the second dimension, and the latest i "z" is the 3rd dimension that defaults to ane equally we don't need it here. In the GPU compute globe, encoding a command to execute a kernel office on a fix of information is called dispatching.

Execution in parallel for each result matrix cell

The size of the workgroup filigree for our compute shader is (8, 8) in our WGSL code. Because of that, "x" and "y" that are respectively the number of rows of the commencement matrix and the number of columns of the second matrix will be divided past 8. With that, we tin now dispatch a compute call with passEncoder.dispatch(firstMatrix[0] / viii, secondMatrix[1] / 8). The number of workgroup grids to run are the acceleration() arguments.

As seen in the drawing above, each shader will take access to a unique builtin(global_invocation_id) object that will be used to know which result matrix cell to compute.

                          const              commandEncoder              =              device.              createCommandEncoder              (              )              ;                              const                passEncoder                =                commandEncoder.                beginComputePass                (                )                ;                
passEncoder.                setPipeline                (computePipeline)                ;                
passEncoder.                setBindGroup                (                0                ,                bindGroup)                ;                
                const                x                =                Math.                ceil                (firstMatrix[                0                ]                /                eight                )                ;                // X dimension of the grid of workgroups to dispatch.                
                const                y                =                Math.                ceil                (secondMatrix[                i                ]                /                8                )                ;                // Y dimension of the grid of workgroups to acceleration.                
passEncoder.                dispatch                (x,                y)                ;                
passEncoder.                endPass                (                )                ;

To end the compute pass encoder, call passEncoder.endPass(). Then, create a GPU buffer to use as a destination to copy the consequence matrix buffer with copyBufferToBuffer. Finally, finish encoding commands with copyEncoder.cease() and submit those to the GPU device queue by calling device.queue.submit() with the GPU commands.

                          // Get a GPU buffer for reading in an unmapped country.              
              const              gpuReadBuffer              =              device.              createBuffer              (              {              
              size:              resultMatrixBufferSize,              
              usage:              GPUBufferUsage.              COPY_DST              |              GPUBufferUsage.              MAP_READ              
              }              )              ;                              // Encode commands for copying buffer to buffer.                
commandEncoder.                copyBufferToBuffer                (                
                resultMatrixBuffer                /* source buffer */                ,                
                0                /* source offset */                ,                
                gpuReadBuffer                /* destination buffer */                ,                
                0                /* destination offset */                ,                
                resultMatrixBufferSize                /* size */                
                )                ;              
                              // Submit GPU commands.                
                const                gpuCommands                =                commandEncoder.                finish                (                )                ;                
device.queue.                submit                (                [gpuCommands]                )                ;

Read issue matrix #

Reading the issue matrix is equally easy as calling gpuReadBuffer.mapAsync() with GPUMapMode.READ and waiting for the returning promise to resolve which indicates the GPU buffer is now mapped. At this betoken, it is possible to get the mapped range with gpuReadBuffer.getMappedRange().

In our lawmaking, the result logged in DevTools JavaScript panel is "2, 2, fifty, lx, 114, 140".

                          // Read buffer.              
              await              gpuReadBuffer.              mapAsync              (GPUMapMode.              READ              )              ;              
              const              arrayBuffer              =              gpuReadBuffer.              getMappedRange              (              )              ;              
panel.              log              (              new              Float32Array              (arrayBuffer)              )              ;

Congratulations! You lot made it. You tin play with the sample.

One last flim-flam #

One way of making your code easier to read is to apply the handy getBindGroupLayout method of the compute pipeline to infer the demark grouping layout from the shader module. This trick removes the need from creating a custom bind group layout and specifying a pipeline layout in your compute pipeline as you tin can see below.

An illustration of getBindGroupLayout for the previous sample is available.

                                          const                computePipeline                =                device.                createComputePipeline                (                {                            
                              -                layout:                device.                createPipelineLayout                (                {                            
                              -                bindGroupLayouts:                [bindGroupLayout]                            
                              -                }                )                ,                            
                              compute:                {

                                          -                // Demark group layout and bind group                            
                              -                const                bindGroupLayout                =                device.                createBindGroupLayout                (                {                            
                              -                entries:                [                            
                              -                {                            
                              -                binding:                0                ,                            
                              -                visibility:                GPUShaderStage.                COMPUTE                ,                            
                              -                buffer:                {                            
                              -                type:                "read-only-storage"                            
                              -                }                            
                              -                }                ,                            
                              -                {                            
                              -                binding:                one                ,                            
                              -                visibility:                GPUShaderStage.                COMPUTE                ,                            
                              -                buffer:                {                            
                              -                blazon:                "read-only-storage"                            
                              -                }                            
                              -                }                ,                            
                              -                {                            
                              -                binding:                2                ,                            
                              -                visibility:                GPUShaderStage.                COMPUTE                ,                            
                              -                buffer:                {                            
                              -                type:                "storage"                            
                              -                }                            
                              -                }                            
                              -                ]                            
                              -                }                )                ;                            
                              +                // Bind group                            
                              const                bindGroup                =                device.                createBindGroup                (                {                            
                              -                layout:                bindGroupLayout,                            
                              +                layout:                computePipeline.                getBindGroupLayout                (                0                /* index */                )                ,                            
                              entries:                [

Performance findings #

So how does running matrix multiplication on a GPU compare to running information technology on a CPU? To find out, I wrote the programme but described for a CPU. And as y'all can meet in the graph below, using the full ability of GPU seems like an obvious choice when the size of the matrices is greater than 256 by 256.

This article was merely the showtime of my journey exploring WebGPU. Expect more articles soon featuring more deep dives in GPU Compute and on how rendering (canvas, texture, sampler) works in WebGPU.

Last updated: Oct 18, 2021 — Amend article

Return to all articles