This commit ensures that the host gpu is constantly fed with commands to
work with, while the guest gpu keeps producing the rest of the commands.
This reduces syncing time between host and guest gpu.
Instead of holding a reference that will get invalidated by
dma_pushbuffer.pop(), hold it as a copy. This doesn't have any
performance cost since CommandListHeader is 8 bytes long.
As fetching command list headers and and the list of command headers is a fixed 1:1 relation now, they can be implemented within a single call.
This cleans up the Step() logic quite a bit.
Fetching every u32 from memory leads to a big overhead. So let's fetch all of them as a block if possible.
This reduces the Memory::* calls by the dma_pusher by a factor of 10.