With the recent announcement of AMD Smart Access Memory it seemed to be the right time to write about the different types of memories available to be used by applications targeting dedicated GPUs. This article aims to provide an introduction to different memory pools within such a system, their access characteristics, and why enabling access to the entire VRAM through the PCI-Express bus could be a game changer.
Discrete GPUs are characterized by having their own, dedicated memory, the VRAM. The memory itself and its connection to the GPU on the graphics board are both optimized for maximum bandwidth to be able to feed the many GPU cores with data at a sufficient rate. This is primarily why discrete GPUs have usually significantly higher performance compared to integrated GPUs, or rather, this additional bandwidth available is what enables them to scale to higher core counts. This is a fundamental difference compared to CPUs where a relatively low number of cores will need to stomp on memory and thus lower latency is a higher priority than pure bandwidth.
|Memory Type||Transactions / Pin||Bus Width||Bandwidth|
|DDR4||~ 3.2 GT/s||64-bit (single-channel)|
|~ 25.6 GB/s|
~ 51.2 GB/s
|GDDR5||~ 8 GT/s||128-bit|
|~ 128 GB/s|
~ 256 GB/s
~ 384 GB/s
|GDDR6||~ 16 GT/s||128-bit|
|~ 256 GB/s|
~ 384 GB/s
~ 512 GB/s
|HBM||~ 1 GT/s||2048-bit|
|~ 256 GB/s|
~ 384 GB/s
~ 512 GB/s
|HBM2||~ 2 GT/s||2048-bit|
|~ 512 GB/s|
~ 768 GB/s
~ 1024 GB/s
The table above showcases well this difference between system memory and VRAM, as it can be seen that GPUs use much wider buses and bandwidth optimized memory chips to deliver effective bandwidths about an order of magnitude larger than the ones used for system memory.
Nonetheless, these two memory pools in a system aren’t working in isolation as there needs to be a way to communicate with one another. This requires at least one of the GPU or CPU to be able to access the memory of the other. In practice this is possible in both directions, which leads us to the following diagram depicting the various types of memories available on such systems:
As it can be seen above, the available RAM accessible to the GPU can be split into two main categories:
- Local Memory, i.e. VRAM – accessible by the GPU through its local wide memory bus.
- Remote Memory, i.e. GPU-visible system RAM – accessible by the GPU through the PCIe bus
From the CPU’s perspective, it’s also worth noting that there is a CPU-visible portion of VRAM that is accessible by the CPU through the PCIe bus. This is where AMD’s Smart Access Memory technology comes into the picture, as we’ll see later.
We’ve already covered the scale of bandwidth that may be provided by the memory chip types themselves, but we didn’t talk yet about the PCI-Express bus bandwidth, which is going to be the limiting factor for inter-device data sharing.
The GPU is usually connected to the motherboard through a PCIe x16 connector. However, that’s just the form factor, as the effective bus type (i.e. PCIe version) and number of PCIe lanes available for communication between the graphics card and the motherboard will be the common denominator of the values supported by the particular CPU, motherboard, PCIe slot, and GPU. E.g. the latest GPUs usually support PCIe 4.0 x16, but their lower performance versions may only support 8 lanes, and many motherboards support 16 lanes only on their primary PCIe x16 connector, the second (if available) being often limited to 4 lanes.
|Bus Type||Transactions / Lane||Number of Lanes||Bandwidth|
|PCIe 3.0||8 GT/s||x4|
~ 16 GB/s
|PCIe 4.0||16 GT/s||x4|
~ 16 GB/s
 In each direction as each lane allows two-way communication
 Effective data rate is lower due to 128b/130b coding
Looking at the numbers in the table above makes it clear that even though the PCIe bus enables both processors to access the memory of the other, available bandwidth is much lower than each accessing their own local memory, especially in the case of the GPU. As such, the decision on where certain data is placed can have a significant impact on the performance of workloads using that data.
So far we only talked about bandwidth, but the other important factor when it comes to efficiently feeding data to the processors is access latency. Although, before going there we should first have at least a few words about caches.
We won’t go into too much detail about the cache hierarchies of modern CPUs and GPUs here, as that alone deserves a much deeper discussion. Nonetheless, it’s worth noting that both exist in some form to improve the performance of spacially local memory reads (e.g. reads from nearby memory addresses) and multiple accesses to the same data. One key note to take is that GPUs follow more relaxed cache coherency rules compared to CPUs, primarily in pursuit of higher performance, which means that the GPU usually caches all memory accesses, while that’s often not the case for the CPU, especially when it comes to accessing GPU memory data.
In particular, in the context of remote memory heap, which is the portion of system memory usable as graphics memory (i.e. accessible by the GPU), individual memory allocations may be marked as cached or uncached, controlling whether the CPU is allowed to cache data from the underlying memory or not. We will discuss some of the implications of that distinction later in this article, but for now the key take away is that this leaves us with the following four graphics memory types:
- Local Visible Memory – VRAM directly accessible by the CPU
- Local Invisible Memory – VRAM not directly accessible by the CPU
- Remote Cached Memory – system memory accessible by the GPU, and cached by the CPU
- Remote Uncached Memory – system memory accessible by the GPU, but not cached by the CPU
In summary, all graphics memory accesses are cached by the GPU, but none is cached by the CPU with the exception of remote cached memory. This is important to note at least for the following reasons:
- CPU reads from uncached memory will usually take significantly more time
- Coherency of CPU cached memory with respect to the GPU may need CPU cache flushes and invalidations at the appropriate time, or some snooping protocols in place
But let’s focus on the memory access latencies first, ignoring caching for now. The table below lists measured memory read latencies from CPU and GPU perspective, respectively, when accessing memories of different types. As access characteristics are identical for local visible and invisible memory, as the only difference between them is whether the corresponding portion of VRAM is visible to the CPU, a single set of measurements are shown in the table using local visible memory.
|Memory Type||CPU read latency||GPU read latency|
|GPU local (VRAM)||~ 1200 ns||~ 340 ns|
|Remote cached||~ 100 ns||~ 1100 ns|
|Remote uncached||~ 115 ns||~ 1100 ns|
 Depends on GPU clock rates and may go up to 1900 ns
 Latecy is measured in case of CPU cache miss, the latency is ~ 1/3/9 ns in case of L1/L2/L3 cache hit
 Latency is measured in case of GPU cache miss, on L2 cache hit the latency is ~ 150 ns
The values tell us two stories. First, they display well that VRAM has a higher latency than system memory, which is in line with what we’ve discussed already about the GPU memory system being optimized for throughput even at the cost of added latency. Second, it’s clear from the data that going through the PCIe bus for memory access significantly increases the latency of the operation.
Resizable BAR Support
Now that we have the most important information on the table about different memory types and their performance characteristics, let’s talk about resizable BAR support, which is at the heart of AMD’s Smart Access Memory technology.
To address a PCIe device, it must be enabled by mapping it into the system’s I/O port or memory address space, so that device drivers are able to communicate with the underlying hardware, whether that be direct I/O or memory mapped I/O operations sent to the device, or having access to the physical memory of the device, if any. This is done by programming the Base Address Registers (or BARs). The available local visible memory is determined by this BAR configuration.
Traditionally, the GPU memory BAR has been limited in size at 256MB, meaning that no matter how much VRAM is on the graphics card the CPU could only see a 256MB window of it. Even that small window wasn’t always available at the disposal of application developers. In the earlier days of traditional APIs like OpenGL, this access was already limited by the fact that the APIs provided no way for the application to explicitly select a particular memory type, but rather were at the mercy of the graphics driver to select an appropriate memory type based on the intent on usage provided by the application.
This changed with new APIs like Vulkan that expose the individual memory types explicitly and enable the application developer full control over selecting where to place their memory allocations and resources. Though even there the availability of local visible memory was not universal for a long time with AMD exposing it from the beginning while NVIDIA following suit only earlier this year.
Still, until recently, local visible memory was limited to a 256MB window, even though the capability for resizable BAR is part of the PCIe specification since 2008. Immediate adoption probably wasn’t an option though as at that time 32-bit operating systems were still widespread and it would have been impractical to dedicate more than 256MBs to video memory from the already small 4GB address space available on such systems. That being said, the main reason behind why we had to wait for so long for this feature to appear is likely the lack of appropriate support in certain layers of the hardware/firmware/software stack, e.g. Windows support seems to have only arrived relatively recently.
AMD is the first to enable resizable BAR support, even if support is currently limited to AMD’s latest graphics card series. Nonetheless, we can soon expect a proliferation of support for resizable BAR across all vendors and supported products as there is nothing fundamentally priorietary in the technology, in fact it’s mostly just leveraging existing capabilities by enabling support in various parts of the stack, let that be BIOS, operating system, or device driver. It’s thus not suprising that NVIDIA is interested in introducing their own version of SAM and that AMD is willing to help its competitors in enabling the feature on other system configurations. It will be especially interesting to see how far vendors will (and can) back-port the feature in order to speed up adoption.
This means that the days of local invisible memory are numbered and we can soon see the entire VRAM exposed as local visible memory in APIs like Vulkan on discrete GPUs of all vendors, and on all platforms.
It is clear from the bandwidth and latency figures that keeping all data that may need to be accessed by the GPU in VRAM provides the best performance, but it is less trivial how enabling CPU access to the entire VRAM can provide the performance benefits that have been demonstrated in the press. In order to better understand the reason behind that, we need to acknowledge that a notable amount of input data is often dynamically generated on the CPU by the application each frame, let those be per-frame and per-object constant data (e.g. transforms), other dynamic geometric data, streamed resources, or anything else. These are all operations that need to write to memory that accessible by the GPU afterwards.
The obvious target choice for these is buffer memory in a local visible memory allocation, as we don’t really care about latency for write operations and the data will be available in the most performant memory from the perspective of subsequent GPU reads by its many cores. However, as the traditional 256MB local visible memory window is very easily exhausted, applications often have to resort to one of the following fall-backs once that happens:
- Use remote memory instead (typically remote uncached)
- Write to remote memory first then upload to local invisible memory using a GPU copy
Both approaches have their draw-backs compared to being able to directly write to local visible memory, as (1) will result in the GPU being limited by the bandwidth and latency penalty of going through the PCIe bus with each request, while (2) requires an additional copy which, even if performed using the GPU’s built-in DMA engine, and thus not taking any GPU core processing time, it takes time to complete and may require additional synchronization operations, let alone the fact that it doubles the overall memory requirements at least temporarily.
For small amount of data shared across many GPU work items (threads) like per-frame or per-draw constants, going with fall-back (1) may be acceptable as the relatively low bandwidth of the PCIe bus is not going to be a limiting factor, and if the data is sufficiently small and used often enough, it will likely stay cached on the GPU thus the latency cost of going through the PCIe bus can be amortized across the many threads reading it. Nonetheless, the added latency compared to accessing from VRAM will still have an impact on performance as on the first access it’s quite possible for thousands of threads to be blocked waiting for the data to arrive with no or insufficient amount of independent work ready to run to allow the GPU to hide that latency, and depending on memory access pattern the data could also be evicted from the GPU’s caches multiple times in the course of the processing of the frame or draw call.
For larger data sets, no matter if they are stream read or “randomly” accessed, fall-back (2) is preferred over (1), as beyond the latency penalty, it is also significantly more likely for the GPU to struggle to feed its processing cores fast enough with data due to the limited bandwidth of the PCIe bus. Even though the DMA transfer from remote to local memory is also limited by the PCIe connection bandwidth, it may take better advantage of the available bandwidth by reading a continuous stream compared to the more irregular read bursts that would otherwise come from the GPU in response to in-thread reads, let alone the fact that in the meantime the dependent threads aren’t occupying precious register space on the GPU cores, allowing more non-dependent work to execute.
These are the main reasons why having CPU access to the entire VRAM thanks to resizable BAR support can have a very direct effect on overall performance, but so far we only talked about graphics memory allocated directly by the application, while graphics drivers themselves have their own resources that need to be stored in GPU accessible memory. These include traditional resources used for internal purposes, descriptor tables/pools that are often sourced directly from graphics memory by the GPU, and the elephant in the room: command buffer memory. If the driver uses local visible memory for these then the application has less available for its own use, and if the 256MB window is depleted, the driver has to use similar fall-backs as the application. Thus the same priciples outlined above apply similar to these driver-internal resources.
We’ve seen that generally local memory is the best place to store GPU data. Although, that doesn’t mean that all data can and should be stored in local memory. For very large data sets it’s simply impossible to fit everything in VRAM, thus often remote memory needs to be used as a back-up storage, or better, VRAM should be used as a cache for the active (or most performance critical) data set instead. In addition, there are specific cases when remote memory is a better choice for storing certain data.
One example is the staging buffer used for texture uploads. As texture data is typically available in a device-independent format on the application side that anyway needs to be converted to the optimal tiling format supported by the target GPU, the pitch-linear staging texture data should preferably be written by the CPU to remote uncached memory which is then uploaded into VRAM with a so called linear-to-tiled copy using the graphics card’s DMA engine. The data anyway has to go through the PCIe bus once, and the DMA engine may be able to perform that more efficiently while also not consuming any CPU or GPU processing core time.
Also, as local memory is not cached by the CPU, any non-trivial data read-back, let that be texture or buffer data, can benefit from using remote cached memory instead, as taking advantage of the CPU caches during read-back can be several orders of magnitude faster. The “non-trivial” part is important though as for reading back small pieces of data, let that be a counter value or a query result, using uncached memory is still preferred as it can often eliminate the need to invalidate CPU caches.
We managed to cover the key topics related to memory types available on a system with a discrete GPU. We’ve seen whether and how the CPU and GPU can access data in these memory types, and what are the bandwidth and latency characteristics of those accesses.
The news focus of this article has been the recent arrival of AMD’s Smart Access Memory technology and the implications of its introduction. The feature itself is a great example of why the technology world should never settle with the status quo, as it enables a fairly simple functionality with great benefits that could have been available much earlier if it wouldn’t be held back by small gaps in the supporting hardware/firmware/software stack.
As part of that discussion, we’ve also seen when and how can the CPU access to the entire VRAM improve overall application performance, and in general covered the main use cases for the different types of graphics memories.
While resizable BAR support is not the holy grail of discrete GPU technology, it is definitely a game changer for application developers who now have much greater freedom in placing their data in the appropriate memory pool while eliminating unnecessary copies that may have been needed previously to achieve that.