From Parallel and High-Performance Computing by Robert Robey and Yuliana Zamora

This article takes a deep dive into GPUs.

Take 37% off Parallel and High-Performance Computing by entering fccrobey into the discount code box at checkout at

The CPU-GPU system as an accelerated computational platform

GPUs are everywhere. They can be found in cell phones, tablets, personal computers, consumer-grade workstations, gaming consoles, high performance computing centers, and cloud computing platforms. GPUs provide additional compute power on most modern hardware and accelerate many operations you may not even be aware of. As the name suggests, GPUs were designed for graphics related-computations. Consequently, GPU design is focused on processing large blocks of data (triangles or polygons) in parallel, a requirement for graphics applications. When compared to CPUs, which can handle tens of parallel threads or processes in a clock cycle, GPUs are capable of processing thousands of parallel threads simultaneously. Because of this design, GPUs offer a considerably higher theoretical peak performance that can potentially reduce the time to solution and the energy footprint of an application, relative to a CPU-only implementation.

Computational scientists, always on the lookout for computational horsepower, were attracted to using GPUs to perform more general-purpose computing tasks. Because GPUs were designed for graphics, the languages originally developed to program them, like OpenGL, focused on graphics operations. To implement algorithms on GPUs, programmers had to reframe their algorithms in terms of these graphic operations, which was time consuming and error-prone. Extending the use of the graphics processor to non-graphic workloads became known as General Purpose GPU (GPGPU) computing.

The continued interest and success of GPGPU computing led to the introduction of a flurry of GPGPU languages. The first to gain wide adoption was the Compute Unified Device Architecture (CUDA) programming language for Nvidia GPUs, which was first introduced in 2007.

The dominant open standard GPGPU computing language is OpenCL (Open Computing Language), developed by a group of vendors led by Apple and released in 2009.

Despite the continual introduction of GPGPU languages, or maybe because of it, many computational scientists found the original, “native”, GPGPU languages difficult to use. As a result, higher level approaches using directive-based APIs have gained a large following and corresponding development effort by vendors.

We have to summarize the new directive-based GPGPU languages, OpenACC and OpenMP, as an unqualified success. These languages and APIs have allowed programmers to focus more on developing their applications, rather than expressing their algorithm in terms of graphics operations. The end result has often been tremendous speedups in scientific and data science applications.

GPUs are best described as accelerators. Accelerators have long been used in the computing world. First let’s define what we mean by an accelerator.

DEFINITION:   Accelerator (Hardware) — a special-purpose device that supplements the main general-purpose CPU to speed up certain operations.

A classic example of an accelerator is the original PC that came with the 8088 CPU, but had the option and a socket for the 8087 “coprocessor” that did floating-point operations in hardware rather than software. Today, the most common hardware accelerator is the graphics processor, which can be either a separate hardware component or integrated on the main processor. The distinction of being called an accelerator is that it’s a special-purpose rather than a general-purpose device, but the difference isn’t always clear-cut.

A GPU is an additional hardware component that can perform operations alongside a CPU. GPUs come in two flavors:

  1. Integrated GPUs – a graphics processor engine which is contained on the CPU.
  2. Dedicated GPUs – a GPU on a separate peripheral card.

Integrated GPUs are either integrated directly into the CPU chip or on a motherboard. Integrated GPUs share RAM resources with the CPU. Dedicated GPUs are attached to the motherboard via a Peripheral Component Interconnect (PCI) slot. The PCI slot is a physical component that allows data to be transmitted between the CPU and GPU. It’s commonly referred to as the PCI Bus.

Integrated GPUs: an underused option on commodity-based systems

Intel has long included an integrated graphical processing unit with their CPUs for the lower-price market. They fully expected that users wanting real performance would buy a discrete GPU. The Intel integrated GPUs have historically been relatively weak in comparison to the integrated version from AMD. This has changed recently with Intel claiming that the integrated graphics on their Ice Lake processor are on a par with AMD integrated GPUs.

The AMD integrated GPUs are called Accelerated Processing Units (APUs) which are a tightly-coupled combination of the CPU and a GPU. The source of the GPU design originally came from the AMD purchase of the ATI graphics card company in 2006. In the AMD APU, the CPU and GPU share the same processor memory. The GPUs are smaller than a discrete GPU, but still proportionally give GPU graphics (and compute) performance. The real target for AMD for APUs is to provide a more cost-effective, but performant system for the mass market. The shared memory is also attractive because it eliminates the data transfer over the PCI Bus which is often a serious performance bottleneck.

The ubiquitous nature of the integrated graphics processor is important. For us, it means that many commodity desktops and laptops have some compute acceleration capability. The goal on these systems is a relatively modest performance boost and perhaps reducing the energy cost or improving battery life. For extreme performance, the discrete GPUs are still the undisputed performance champions.

Dedicated GPUs: the workhorse option

In this article, we focus primarily on GPU accelerated platforms with dedicated GPUs, also called discrete GPUs. Dedicated GPUs generally offer more compute power than integrated GPUs. Additionally, they can be isolated to execute general purpose computing tasks. Figure 1 conceptually illustrates a CPU-GPU system with a dedicated GPU. A CPU has access to its own memory space (CPU RAM) and it’s connected to a GPU via a PCI bus. It’s able to send data and instructions over the PCI Bus for the GPU to work with. The GPU has its own memory space, separate from the CPU memory space.

Figure 1. Block diagram of GPU accelerated system using a dedicated GPU. The CPU and GPU each have their own memory. The CPU and GPU communicate over a PCI Bus.

In order for work to be executed on the GPU, at some point, data must be transferred from the CPU to the GPU. When the work is complete, and the results are going to be written to file, the GPU must send data back to the CPU. The instructions the GPU must execute are also sent from CPU to GPU. Each one of these transactions is mediated by the PCI Bus. Although we won’t discuss how to make these actions happen in this article, we discuss the hardware performance limitations of the PCI Bus. Due to the limitations of the PCI bus, the GPU can potentially have worse performance than a CPU only code. We also discuss the internal architecture of the GPU and the performance of the GPU with regards to memory operations and floating-point operations.

The GPU and the thread engine

For those of us who have done thread programming over the years on a CPU, the graphics processor is like the ideal thread engine. The components of this thread engine are

  • a seemingly infinite number of threads,
  • zero-time cost for switching or starting threads, and
  • latency hiding of memory accesses through automatic switching between work groups.

Let’s take a look at the hardware architecture of a GPU to get an idea of how it performs this magic. To show this conceptual model of a GPU, we abstract the common elements from different GPU vendors and even between design variations from the same vendor.

We must remind you that there are hardware variations which aren’t captured by these abstract models. Adding to this the polyglot of terminology currently in use, it’s unsurprising that it’s difficult for a newcomer to the field to understand the GPU hardware and programming languages. Still, this terminology is relatively sane compared to the graphics world with vertex shaders, texture mapping units and fragment generators.

Table 1 summarizes the rough equivalence of terminology, but beware that because the hardware architectures aren’t exactly the same, the correspondence in terminology varies depending on the context and user.

Table 1. Hardware terminology: a rough translation





Intel Gen11


Compute device





Compute Unit (CU)

Compute Unit (CU)

Streaming Multiprocessor (SM)


Processing Core or Core for short

Processing Element (PE)

Processing Element (PE)

Compute Cores or CUDA Cores

Execution Units (EU)


Work Item

Work Item


Vector or SIMD



Emulated with SIMT Warp


The last row in table 1 is for a hardware layer that implements a single instruction on multiple data, commonly referred to as SIMD. Strictly speaking, the Nvidia hardware doesn’t have vector hardware (SIMD), but emulates it through a collection of threads in what it calls a warp in a single instruction, multiple thread (SIMT) model. Other GPUs can also perform SIMT operations on what OpenCL and AMD call subgroups that are equivalent to the Nvidia warps. This article focuses on the GPU hardware, its architecture and concepts.

Often, GPUs also have hardware blocks of replication, some of which are listed in table 9.2, to simplify the scaling of their hardware designs to more units. These units of replication are a manufacturing convenience, but they often show up in the specification lists and discussions.

Table 2. GPU hardware replication units by vendor



Intel Gen11

Shader Engine (SE)

Graphics Processing Cluster


Figure 2 depicts a simplified block diagram of a single node system with a single multiprocessor CPU and two GPUs. A single node can have a wide variety of configurations composed of one or more multiprocessor CPUs with an integrated GPU and from one to six discrete GPUs. In OpenCL nomenclature, each GPU is a compute device, but compute devices can also be a CPU.

Figure 2. A simplified block diagram of a GPU system showing two compute devices, each of them a separate GPU, GPU memory, and multiple Compute Units (CUs) on each compute device. The Nvidia Cuda terminology refers to compute units as Streaming Multiprocessors (SMs).

DEFINITION:  Compute device (OpenCL) – any computational hardware that can perform computation and supports OpenCL. This can include GPUs, CPUs, or even more exotic hardware such as embedded processors or FPGAs.

The simplified diagram in figure 2 is our model for describing the components of a GPU and it’s useful for understanding how a GPU processes data. A GPU is composed of:

  • GPU RAM, also known as “global memory”
  • Workload distributor
  • Compute Units (CUs), called Streaming Multiprocessors (SMs) in CUDA

These Compute Units have their own internal architecture, often referred to as the microarchitecture. Instructions and data received from the CPU are processed by the workload distributor. This distributor coordinates instruction execution and data movement onto and off the compute units. The achievable performance of a GPU depends on

  • Global memory bandwidth
  • Compute Unit Bandwidth
  • The number of compute units

In this section, we explore each of the components in our model of a GPU. With each component, we discuss models for theoretical peak bandwidth. Additionally, we show how to use micro-benchmark tools.

The compute unit is the streaming multiprocessor

A GPU compute device has multiple compute units. Compute units (CUs) is the term agreed to by the community for the OpenCL standard. Nvidia calls these streaming multiprocessors (SMs) and Intel refers to them as subslices.

Processing elements are the individual processors

  1. Each compute unit contains multiple graphics processors called Processing Elements (PEs) in OpenCL terminology. Nvidia calls them CUDA cores or Compute Cores. Intel refers to them as Execution Units (EUs). The graphics community calls them shader processors. Figure 3 shows a simplified conceptual diagram of a processing element. These processors aren’t equivalent to a CPU processor; they are simpler designs needed to perform graphics operations. But the operations needed for graphics includes nearly all the arithmetic operations that a programmer is used to on a regular processor.

Figure 3. Simplified block diagram of a Compute Unit (CU) with a large number of Processing Elements.

Multiple data operations by each processing element

Within each processing element, it may be possible for an operation to be performed on more than one data item. Depending on the details of the GPU microprocessor architecture and the GPU vendor, these are referred to as SIMT, SIMD, or Vector operations. A similar type of functionality may be provided by ganging processing elements together.

Calculating the peak theoretical flops for some leading GPUs

With an understanding of the GPU hardware, we can now calculate the peak theoretical flops for some recent GPUS. These include the Nvidia V100, AMD Vega20, and the integrated Gen11 GPU on the Intel Ice Lake CPU. The specifications for these three GPUs are listed in table 3. We’ll use these specifications to calculate the theoretical performance of each device. Knowing the theoretical performance, you can make comparisons on how each may perform. This might help you with purchasing decisions or estimating how much faster or slower another GPU might be on your calculations. Hardware specifications for many GPU cards can be found at TechPowerUp, For Nvidia and AMD, the GPUs targeted to the HPC market have the hardware cores to perform one double precision operation for every two single precision operations. This relative flop capability can be expressed as a ratio of 1:2, where double precision is 1:2 of single precision on top-end GPUs. The importance of this ratio is that it tells you that you can roughly double your performance by reducing your precision requirements from double precision to single. For many GPUs, half precision has a ratio of 2:1 to single precision or double the flop capability. The Intel integrated GPU has 1:4 double precision relative to single precision and some commodity GPUs have 1:8 ratios of double precision to single precision. GPUs with these lower ratios of double precision are targeted at the graphics market or machine learning. To get these ratios, take the FP64 row and divide by the FP32 row.

Table 3. Specifications for recent discrete GPUs from Nvidia, AMD and an integrated Intel GPU


Nvidia V100 (Volta)

Nvidia A100


AMD Vega 20 (MI50)

Intel Gen11 Integrated

Compute Units (CU)





FP32 Cores/CU





FP64 Cores/CU




GPU Clock Nominal/Boost

1290/1530 MHz

?/1410 MHz

1200/1746 MHz

400/1000 MHz

Subgroup or warp size




Memory clock

876 MHz

1215 MHz

1000 MHz

shared memory

Memory type

HBM2(32 GB)

HBM2(40 GB)



Memory data width

4096 bits

5120 bits

4096 bits

384 bits

Memory bus type

NVLink or PCIe 3.0×16

NVLink or PCIe Gen 4

Infinity Fabric or PCIe 4.0×16

shared memory

Design Power

300 watts

400 watts

300 watts

28 watts

The peak theoretical flops can be calculated by taking the clock rate times the number of processors times the number of floating-point operations per cycle.

The flops per cycle accounts for the fused-multiply add (FMA) which does two operations in one cycle.

Example: Peak theoretical flop for some leading GPUs

Theoretical Peak Flops for Nvidia V100

2 x 1530 x 80 x 64 /10^6 = 15.6 TFlops (single precision)

2 x 1530 x 80 x 32 /10^6 = 7.8 TFlops (double precision)

Theoretical Peak Flops for Nvidia Ampere

2 x 1410 x 108 x 64 /10^6 = 19.5 TFlops (single precision)

2 x 1410 x 108 x 32 /10^6 = 9.7 TFlops (double precision)

Theoretical Peak Flops for AMD Vega 20 (MI50)

2 x 1746 x 60 x 64 /10^6 = 13.4 TFlops (single precision)

2 x 1746 x 60 x 32 /10^6 = 6.7 TFlops (double precision)

Theoretical Peak Flops for Intel Integrated Gen 11 on Ice Lake

2 x 1000 x 64 x 8 /10^6 = 1.0 TFlops (single precision)

Both the Nvidia V100 and the AMD Vega 20 give impressive floating point peak performance. The Ampere shows some additional improvement in floating point performance, but it’s the memory performance that promises greater increases. The Intel integrated GPU is also quite impressive given that it’s limited by the available silicon space and lower nominal design power of a CPU. With Intel announcing plans for a discrete graphics card in 2020, there’ll be more GPU options in the future.

That’s all for this article. If you want to learn more about the book, check it out on our browser-based liveBook platform here.