It's also worth noting that modern CPUs will draw data used regularly into L1 cache, no matter where it is in main memory. This will significantly increase the computational time in loops on small data sets, and that even outside of registers.
Registers are typically 1 or 0 clock cycle access depending on whether or not they can be coupled with other instructions. L1 cache access on modern Intel CPUs is about 4 clock cycles. On AMD it's about the same, though both architectures vary.
While registers are undoubtedly faster, depending on the computation there would still be spilling and filling of the available registers (in 32-bit compiled code, there are only a few registers available for use (eax, ebx, ecx, edx, esi, edi, and possibly ebp though it's typically associated with the stack for parameters and local/temp variable storage) for integer processing).
If you can guarantee that you could always use ecx for your loop counter, and eax for your accumulator, and esi and edi for pointer references to source and destination offsets for input and output of foreign data, and ebp for local data, then you'd only have ebx and edx available for general processing use (such as performing some computation) before you need to spill and fill). And there you lose a few clock cycles here and there because of spilling and filling (which require memory accesses themselves), then you're not that much better off than if you would've only used memory references throughout your code.
In the end, such instances might execute more slowly, but only until they're pulled into L1 cache, which means on repeated loops it wouldn't be that much slower. Maybe 4x slower in total, but when you're talking about the speed of register-only loops ... even only 4x slower is blazingly fast. At 3.0 GHz, that could be several hundred million computations per second per core.
The bytecode solution is ingenious for general purpose code because it allows code to be written once, run anywhere. Where people make it hard is when they try to squeeze as much performance out of something as possible, or optimize for this, that, or the other thing, making the entire engine more complex than it needs to be for a marginal increase in performance.
We'll see though. In time.
Best regards,
Rick C. Hodgin