Assembly Performance Tuning ~ Discussion of Coding

Assembly Performance Tuning

I am writing a compiler (more for fun than anything else), but I want to try to make it as efficient as possible. For example I was told that on Intel architecture the use of any register other than EAX for performing math incurs a cost (presumably because it swaps into EAX to do the actual piece of math). Here is at least one source that states the possibility (http://www.swansontec.com/sregisters.html).

I would like to verify and measure these differences in performance characteristics. Thus, I have written this program in C++:

#include "stdafx.h"  #include   #include     using namespace std;    int _tmain(int argc, _TCHAR* argv[])  {      __int64 startval;      __int64 stopval;      unsigned int value; // Keep the value to keep from it being optomized out        startval = __rdtsc(); // Get the CPU Tick Counter using assembly RDTSC opcode        // Simple Math: a = (a << 3) + 0x0054E9      _asm {          mov ebx, 0x1E532 // Seed          shl ebx, 3          add ebx, 0x0054E9          mov value, ebx      }        stopval = __rdtsc();      __int64 val = (stopval - startval);      cout << "Result: " << value << " -> " << val << endl;        int i;      cin >> i;        return 0;  }

I tried this code swapping eax and ebx but I'm not getting a "stable" number. I would hope that the test would be deterministic (the same number every time) because it's so short that it's unlikely a context switch is occurring during the test. As it stands there is no statistical difference but the number fluctuates so wildly that it would be impossible to make that determination. Even if I take a large number of samples the number is still impossibly varied.

I'd also like to test xor eax, eax vs mov eax, 0, but have the same problem.

Is there any way to do these kinds of performance tests on Windows (or anywhere else)? When I used to program Z80 for my TI-Calc I had a tool where I could select some assembly and it would tell me how many clock cycles to execute the code -- can that not be done with our new-fangeled modern processors?

EDIT: There are a lot of answers indicating to run the loop a million times. To clarify, this actually makes things worse. The CPU is much more likely to context switch and the test becomes about everything but what I am testing.

Answer by Bo Persson for Assembly Performance Tuning

The Z80, and possibly the TI, had the advantage of synchronized memory access, no caches, and in-order execution of the instructions. That made it a lot easier to calculate to number of clocks per instruction.

On current x86 CPUs, instructions using AX or EAX are not faster per se, but some instructions might be shorter than the instructions using other registers. That might just save a byte in the instruction cache!

Answer by ybungalobill for Assembly Performance Tuning

Go here and download the Architectures Optimization Reference Manual.

There are many myths. I think the EAX claim is one of them.

Also note that you can't talk anymore about 'which instruction is faster'. On today's hardware there are no 1 to 1 relation between instructions and execution time. Some instructions are preferred to others not because they are 'faster' but because they break dependencies between other instructions.

Answer by 500 - Internal Server Error for Assembly Performance Tuning

I believe that if there's a difference nowadays it will only be because some of the legacy instructions have a shorter encoding for the variant that uses EAX. To test this, repeat your test case a million times or more before you compare cycle counts.

Answer by Bjarke H. Roune for Assembly Performance Tuning

Starting your program is going to take much longer than running 4 assembly instructions once, so any difference from your assembly will drown in the noise. Running the program many times won't help, but it would probably help if you run the 4 assembly instructions inside a loop, say, a million times. That way the program start-up happens only once.

There can still be variation. One especially annoying thing that I've experienced myself is that your CPU might have a feature like Intel's Turbo Boost where it will dynamically adjust it's speed based on things like the temperature of your CPU. This is more likely to be the case on a laptop. If you've got that, then you will have to turn it off for any benchmark results to be reliable.

Answer by ninjalj for Assembly Performance Tuning

I think what the article tries to say about the EAX register, is that since some operations can only be performed on EAX, it's better to use it from the start. This was very true with the 8086 (MUL comes to mind), but the 386 made the ISA much more orthogonal, so it's much less true these days.

Answer by Jerry Coffin for Assembly Performance Tuning

To even have a hope of repeatable, determinstic timing at the level that RDTSC gives, you need to take some extra steps. First, RDTSC is not a serializing instruction, so it can be executed out of order, which will usually render it meaningless in a snippet like the one above.

You normally want to use a serializing instruction, then your RDTSC, then the code in question, another serializing instruction, and the second RDTSC.

Nearly the only serializing instruction available in user mode is CPUID. That, however, adds one more minor wrinkle: CPUID is documented by Intel as requiring varying amounts of time to execute -- the first couple of executions can be slower than others.

As such, the normal timing sequence for your code would be something like this:

XOR EAX, EAX  CPUID  XOR EAX, EAX  CPUID  XOR EAX, EAX  CPUID            ; Intel says by the third execution, the timing will be stable.  RDTSC            ; read the clock  push eax         ; save the start time  push edx        mov ebx, 0x1E532 // Seed // execute test sequence      shl ebx, 3      add ebx, 0x0054E9      mov value, ebx    XOR EAX, EAX      ; serialize  CPUID     rdtsc             ; get end time  pop ecx           ; get start time back  pop ebp  sub eax, ebp      ; find end-start  sbb edx, ecx

We're starting to get close, but there's on last point that's difficult to deal with using inline code on most compilers: there can also be some effects from crossing cache lines, so you normally want to force your code to be aligned to a 16-byte (paragraph) boundary. Any decent assembler will support that, but inline assembly in a compiler usually won't.

Having said all that, I think you're wasting your time. As you can guess, I've done a fair amount of timing at this level, and I'm quite certain what you've heard is an outright myth. In reality, all recent x86 CPUs use a set of what are called "rename registers". To make a long story short, this means the name you use for a register doesn't really matter much -- the CPU has a much larger set of registers (e.g., around 40 for Intel) that it uses for the actual operations, so your putting a value in EBX vs. EAX has little effect on the register that the CPU is really going to use internally. Either could be mapped to any rename register, depending primarily on which rename registers happen to be free when that instruction sequence starts.

Answer by zwol for Assembly Performance Tuning

You're getting ridiculous variance because rdtsc does not serialize execution. Depending on inaccessible details of the execution state, the instructions you're trying to benchmark may in fact be executed entirely before or after the interval between the rdtsc instructions! You will probably get better results if you insert a serializing instruction (such as cpuid) immediately after the first rdtsc and immediately before the second. See this Intel tech note (PDF) for gory details.

Answer by Matthew Slattery for Assembly Performance Tuning

I'd suggest taking a look at Agner Fog's "Software optimization resources" - in particular, the assembly and microarchitecture manuals (2 and 3), and the test code, which includes a rather more sophisticated framework for measurements using the performance monitor counters.

Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72

Discussion of Coding

Blog coding and discussion of coding about JavaScript, PHP, CGI, general web building etc.

Sunday, December 11, 2016