That FPGA was attached to the CPU over PCIe. After digging through the disassembly, one change was that the transfer loop had been autovectorised into 64-bit stores of value pairs, instead of the previous 32-bit store instructions.
That FPGA was attached to the CPU over PCIe. After digging through the disassembly, one change was that the transfer loop had been autovectorised into 64-bit stores of value pairs, instead of the previous 32-bit store instructions.
The stores were being executed, and there were no bus errors, but it turns out the vendor’s PCIe endpoint in the FPGA only implemented handling for 32-bit TLPs. It latched in the larger packets OK, but just discarded them. Root cause: somebody forgot a volatile, many years prior.