Assembly Performance Tuning
Assembly Performance Tuning
I am writing a compiler (more for fun than anything else), but I want to try to make it as efficient as possible. For example I was told that on Intel architecture the use of any register other than EAX
for performing math incurs a cost (presumably because it swaps into EAX
to do the actual piece of math). Here is at least one source that states the possibility (http://www.swansontec.com/sregisters.html).
I would like to verify and measure these differences in performance characteristics. Thus, I have written this program in C++:
#include "stdafx.h" #include #include using namespace std; int _tmain(int argc, _TCHAR* argv[]) { __int64 startval; __int64 stopval; unsigned int value; // Keep the value to keep from it being optomized out startval = __rdtsc(); // Get the CPU Tick Counter using assembly RDTSC opcode // Simple Math: a = (a << 3) + 0x0054E9 _asm { mov ebx, 0x1E532 // Seed shl ebx, 3 add ebx, 0x0054E9 mov value, ebx } stopval = __rdtsc(); __int64 val = (stopval - startval); cout << "Result: " << value << " -> " << val << endl; int i; cin >> i; return 0; }
I tried this code swapping eax
and ebx
but I'm not getting a "stable" number. I would hope that the test would be deterministic (the same number every time) because it's so short that it's unlikely a context switch is occurring during the test. As it stands there is no statistical difference but the number fluctuates so wildly that it would be impossible to make that determination. Even if I take a large number of samples the number is still impossibly varied.
I'd also like to test xor eax, eax
vs mov eax, 0
, but have the same problem.
Is there any way to do these kinds of performance tests on Windows (or anywhere else)? When I used to program Z80 for my TI-Calc I had a tool where I could select some assembly and it would tell me how many clock cycles to execute the code -- can that not be done with our new-fangeled modern processors?
EDIT: There are a lot of answers indicating to run the loop a million times. To clarify, this actually makes things worse. The CPU is much more likely to context switch and the test becomes about everything but what I am testing.
Answer by Bo Persson for Assembly Performance Tuning
The Z80, and possibly the TI, had the advantage of synchronized memory access, no caches, and in-order execution of the instructions. That made it a lot easier to calculate to number of clocks per instruction.
On current x86 CPUs, instructions using AX or EAX are not faster per se, but some instructions might be shorter than the instructions using other registers. That might just save a byte in the instruction cache!
Answer by ybungalobill for Assembly Performance Tuning
Go here and download the Architectures Optimization Reference Manual.
There are many myths. I think the EAX claim is one of them.
Also note that you can't talk anymore about 'which instruction is faster'. On today's hardware there are no 1 to 1 relation between instructions and execution time. Some instructions are preferred to others not because they are 'faster' but because they break dependencies between other instructions.
Answer by 500 - Internal Server Error for Assembly Performance Tuning
I believe that if there's a difference nowadays it will only be because some of the legacy instructions have a shorter encoding for the variant that uses EAX. To test this, repeat your test case a million times or more before you compare cycle counts.
Answer by Bjarke H. Roune for Assembly Performance Tuning
Starting your program is going to take much longer than running 4 assembly instructions once, so any difference from your assembly will drown in the noise. Running the program many times won't help, but it would probably help if you run the 4 assembly instructions inside a loop, say, a million times. That way the program start-up happens only once.
There can still be variation. One especially annoying thing that I've experienced myself is that your CPU might have a feature like Intel's Turbo Boost where it will dynamically adjust it's speed based on things like the temperature of your CPU. This is more likely to be the case on a laptop. If you've got that, then you will have to turn it off for any benchmark results to be reliable.
Answer by ninjalj for Assembly Performance Tuning
I think what the article tries to say about the EAX register, is that since some operations can only be performed on EAX, it's better to use it from the start. This was very true with the 8086 (MUL comes to mind), but the 386 made the ISA much more orthogonal, so it's much less true these days.
Answer by Jerry Coffin for Assembly Performance Tuning
To even have a hope of repeatable, determinstic timing at the level that RDTSC gives, you need to take some extra steps. First, RDTSC is not a serializing instruction, so it can be executed out of order, which will usually render it meaningless in a snippet like the one above.
You normally want to use a serializing instruction, then your RDTSC, then the code in question, another serializing instruction, and the second RDTSC.
Nearly the only serializing instruction available in user mode is CPUID. That, however, adds one more minor wrinkle: CPUID is documented by Intel as requiring varying amounts of time to execute -- the first couple of executions can be slower than others.
As such, the normal timing sequence for your code would be something like this:
XOR EAX, EAX CPUID XOR EAX, EAX CPUID XOR EAX, EAX CPUID ; Intel says by the third execution, the timing will be stable. RDTSC ; read the clock push eax ; save the start time push edx mov ebx, 0x1E532 // Seed // execute test sequence shl ebx, 3 add ebx, 0x0054E9 mov value, ebx XOR EAX, EAX ; serialize CPUID rdtsc ; get end time pop ecx ; get start time back pop ebp sub eax, ebp ; find end-start sbb edx, ecx
We're starting to get close, but there's on last point that's difficult to deal with using inline code on most compilers: there can also be some effects from crossing cache lines, so you normally want to force your code to be aligned to a 16-byte (paragraph) boundary. Any decent assembler will support that, but inline assembly in a compiler usually won't.
Having said all that, I think you're wasting your time. As you can guess, I've done a fair amount of timing at this level, and I'm quite certain what you've heard is an outright myth. In reality, all recent x86 CPUs use a set of what are called "rename registers". To make a long story short, this means the name you use for a register doesn't really matter much -- the CPU has a much larger set of registers (e.g., around 40 for Intel) that it uses for the actual operations, so your putting a value in EBX vs. EAX has little effect on the register that the CPU is really going to use internally. Either could be mapped to any rename register, depending primarily on which rename registers happen to be free when that instruction sequence starts.
Answer by zwol for Assembly Performance Tuning
You're getting ridiculous variance because rdtsc
does not serialize execution. Depending on inaccessible details of the execution state, the instructions you're trying to benchmark may in fact be executed entirely before or after the interval between the rdtsc
instructions! You will probably get better results if you insert a serializing instruction (such as cpuid
) immediately after the first rdtsc
and immediately before the second. See this Intel tech note (PDF) for gory details.
Answer by Matthew Slattery for Assembly Performance Tuning
I'd suggest taking a look at Agner Fog's "Software optimization resources" - in particular, the assembly and microarchitecture manuals (2 and 3), and the test code, which includes a rather more sophisticated framework for measurements using the performance monitor counters.
Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72
0 comments:
Post a Comment