Introduction
On 64-bit Arm machines today running Linux, PCIe remote memory device address ranges are mapped as normal instead of device memory.
What does the Arm BSA say?
E.2 PCI Express Memory Space
I It is system specific whether a system supports mapping PCI Express memory space as cacheable.R PCI_MM_01 All systems must support mapping PCI Express memory space as device memory.
DEN0094C – Arm Base System Architecture
R PCI_MM_02 All systems must support mapping PCI Express memory space as non-cacheable memory.
R PCI_MM_03 When PCI Express memory space is mapped as normal memory, the system must support unaligned accesses to that region.
The Arm BSA says that supporting maps of PCIe BARs as normal non-cacheable memory is mandatory. It does not specify the behavior for normal cacheable mappings.
What do other OSes do?
Windows does not support the normal non-cacheable memory type. PCIe BARs are mapped as device memory on Windows.
On Apple Silicon, PCIe BARs are only accessible as device memory. Otherwise, an SError is raised by the SoC. As such, Apple OSes map PCIe BARs as device memory.
Wait, does this mean that Linux doesn’t work properly on some hardware?
Yes. Many SoCs do not provide the expected normal memory semantics as defined by the Arm BSA when mapping PCIe BARs as normal memory.
Currently, a part of the Linux community handwaves this away instead of dealing with the situation properly.
The affected SoCs include:
- Ampere Altra systems
- Honeycomb LX2K and a number of other Cortex-A72 based systems
- Apple Silicon systems (for which an exception was done as an SError is raised on mapping PCIe BARs as normal memory)
- Raspberry Pi systems (but those don’t even implement device memory semantics properly… 64-bit wide MMIO accesses being broken there)
- and a lot of others.
Why even want to map the BARs as normal memory? What are they even used for anyway?
The PCIe BARs (for data) are mapped as normal memory under Linux to allow for unaligned memory accesses.
For access to device register ranges, allowing unaligned accesses does not make sense. An unaligned access to device registers has very unclear semantics, potentially causing unintended read-modify-write sequences to happen. They are to be strictly avoided. Currently however, this scenario might not raise an exception on Linux systems if pgprot_device is not explicitly used. Note that the vast majority of drivers handle that part right today.
And then there are GPUs and other accelerators where data can be mapped over PCIe BARs. Access to uncached memory is quite expensive and has quite varied semantics for atomics depending on SoC. As such, this is generally used only for memcpy() to a buffer in host memory for further processing. You might also consider using PCIe DMA when available instead of accessing data over a PCIe BAR.
OK, what should be done for GPUs?
For GPUs, the right thing to do here is to not use PCIe BARs to map data to user-space. A unified memory mechanism with page migration on kernel launch or use should be used instead. This also allows the user applications to operate on that allocation directly without tanking performance.
If the affected data is accessed to often from the host but rarely from the device, you might also consider a pinned memory allocation. Such an allocation is device accessible but always resident in host memory.
Please never directly operate on data mapped as PCIe BARs – outside of a memory copy. This is a big performance trap to be strictly avoided. If you want to operate directly on remote device memory, please migrate to a coherent interconnect such as CXL for your own sanity.
What is the right thing to do?
Map PCIe BARs as device memory on all arm64 Linux systems to provide the expected semantics to everyone.
update: clarify that pgprot_device is used for device register access.