7/21/2010

07-21-10 - x86

x86 is really fucking wonderful and it's a damn shame that we don't have it on all platforms. (addendum : I don't really mean x86 the ISA, I mean x86 as short hand for the family of modern processors that run x86; in particular P-Pro through Core i7).

It's not just that the chips are super fast and easy to use. It's that they actually encourage good software engineering. In order to make the in-order PPC chips you have to abandon everything you've learned about good software practices in the last 20 years. You can't abstract or encapsulate. Everything has to be in locals, every function has to be inline.

1. Complex addressing.

This is way more huge than I ever expected. There are two important subaspects here :

1.A. being able to do addition and simple muls in the addressing, eg. [eax + ecx*2]

1.B. being able to use memory locations directly in instructions instead of moving through registers.

Together these work together to make it so that on x86 you don't have to fuck around with loading shit out to temporaries. It makes working on variables in structs almost exactly the same speed as working on variables in a local.

e.g.


  x = y;

 is

  mov eax, ecx;


while

  x = s.array[i];

 is

  mov eax, [eax + ecx*4 + 48h]

and those run at almost the same speed !

This is nice for C and accessing structs and arrays of course, but it's especially important for C++ where lots of things are this-> based. The compiler keeps "this" in a register, and this-> references run at the same speed as locals!

ADDENDUM : the really bad issue with the current PPC chips is that the pipeline from integer computations to load/stores is very bad, it causes a full latency stall. If you have to compute an address and then load from it, and you can't get other instructions in between, it's very slow. The great thing about x86 is not that it's one instruction, it's that it's fast. Again, to be clear, the point here is not that CISC is better or whatever, it's simply that having fast complex addressing you don't have to worry about changes the way you write code. It lets you use structs, it lets you just use for(i) loops and index off i and not worry about it. Instead on the PPC you have to worry about things like indexing byte arrays is faster than any other size, and if you're writing loops and accessing dword arrays maybe you should be iterating with a pointer instead of an index, or maybe iterate with index*4, or whatever.

2. Out of order execution.

Most people thing of OOE as just making things faster and letting you be a bit lazy about code gen and stalls and so on. Yes, that is true, but there's a more significant point I think : OOE makes C++ fast.

In particular, the entire method of referencing things through pointers is impossible in even moderate performant code without OOE.

The nasty thing that C++ (or any modern language really, Java ,C#, etc. are actually much much worse in this regard) does is make you use a lot of pointers, because your data types may be polymorphic or indeterminate, it's often hard to hold them by value. Many people think that's a huge performance problem, but on the PC/x86 it actually doesn't hurt much. Why?

Typical C++ code may be something like :


.. stuff ..
this->m_obj->func();
.. stuff ..

this may involve several dependent memory fetches. On an in-order chip this is stall city. With OOE it can get rearranged to :

..stuff ..
temp = this->m_obj;
.. stuff ..
vtable = temp->vtable;
.. stuff ..
vtable->func();
.. stuff ..

And as long as you have enough stuff to do in between it's no problem. Now obviously doing lots of random calls through objects and vtables in a row will still make you slow, but that's not a common C++ pattern and it's okay if that's slow. But the common pattern of just getting a class pointer from somewhere then doing a bunch of stuff on it is fast (or fast enough for not-super-low-level code anyway).

ADDENDUM : obviously if your code path was completely static, then a compile-time scheduler could do the same thing. But your code path is not static, and the caches have basically random delays because other threads might be using them too, so no static scheduling can ever be as good. And even beyond that, the compiler is just woefully unable to handle scheduling for these things. For example, to schedule as well as OOP can, you would have to do things like speculatively read ptr and *ptr even if it might only be needed if a certain branch is taken (because if you don't do the prefetching the stall will be horrific) etc. Furthermore, the scheduling can only really compete when all your functions are inline; OOP sort of inlines your functions for you since it can schedule functions across the jump. etc. etc.

ADDENDUM : 3. Another issue that I think might be a big one is the terrible penalty for "jump to variable" on PPC. This hits you when you do a switch() and also when you make virtual calls. It can only handle branch prediction for static branches, there's no "branch target predictor" like modern x86 chips have. Maybe I'll write a whole post about virtual functions.

Final addendum :

Anyway, the whole point of this post was not to make yet another rant about how current consoles are slow or bad chips. Everyone knows that and it's old news and boring.

What I have realized and what I'm trying to say is that these bad old chips are not only slow - much worse than that! They cause a regression in software engineering practice back to the bad old days when you have to worry about shit like whether you pre-increment or post-increment your pointers. They make clean, robust, portable programming catastrophically inefficient. All the things we have made progress on in the last 20 years, since I started coding on Amigas and 286's where we had to worry about this shit, we moved into an enlightened age where algorithms were more important than micro bullsit, and now we have regressed.


At the moment, the PowerPC console targets are *SO* much slower than the PC, that the correct way to write code is just to write with only the PowerPC in mind, and whatever speed you get on x86 will be fine. That is, don't think about the PC/x86 performance at all, just 100% write your code for the PPC.

There are lots of little places where they differ - for example on x86 you should write code to take use of complex addressing, you can have fewer data dependencies if you just set up one base variable and then do lots of referencing off it. On PPC this might hurt a lot. Similarly there are quirks about how you organize your branches or what data types you use (PPC is very sensitive to the types of variables), alignment, how you do loops (preincrement is better for PPC), etc.

Rather than bothering with #if __X86 and making fast paths for that, the right thing is just to write it for PPC and not sweat the x86, because it will be like a bazillion times faster than the PPC anyway.

Some other PPC notes :

1. False load-hit-stores because of the 4k aliasing is an annoying and real problem (only the bottom bits of the address are used for LHS conflict detection). In particular, it can easily come up when you allocate big arrays, because the allocators will tend to give you large memory blocks on 4k alignment. If you then do a memcpy between two large arrays you will get a false LHS on every byte! WTF !?!? The result is that you can get wins by randomly offsetting your arrays when you know they will be used together. Some amount of this is just unavoidable.

2. The (unnamed console) compiler just seems to be generally terrible about knowing when it can keep things in registers and when it can't. I noted before about the failure to load array base addresses, but it also fucks up badly if you *EVER* call a function using common variables. For example, say you write a function like this :


{
int x = 0;

  for( ... one million .. )
  {
    .. do lots of stuff using x ..
    x = blah;
  }

  external_func(&x);
}

the correct thing of course is to just keep x in a register through the whole function and not store its value back to the stack until right before the function :

{
//int x; // x = r7
r7 = 0;

  for( ... one million .. )
  {
    .. do lots of stuff using r7 ..
    r7 = blah;
  }

stack_x = r7
external_func(&stack_x);
}

Instead what I see is that a store to the stack is done *every time* x is manipulated in the function :


{
//int x; // x = r7
r7 = 0;
stack_x = r7;

  for( ... one million .. )
  {
    .. do lots of stuff using r7 - stores to stack_x every time ! ..
    r7 = blah;
   stack_x = r7;
  }

external_func(&stack_x);
}

The conclusion is the same one I came to last time :

When you write performance-critical code, you need to completely isolate it from function calls, setup code, etc. Try to pass in everything you need as a function argument so you never had to load from globals or constants (even loading static constants seems to be compiled very badly, you have to pass them in to make sure they get into registers), and do everything inside the function on locals (which you never take the address of). Never call external functions.

10 comments:

Tom Forsyth said...

Load-op on x86 is frequently not free. An instruction like "add rax, [rbx]" is usually broken into two uops - "mov temp, [rbx]" and "add rax, temp". These are sceduled separately, they execute in different units, and they take register file bandwidth just the same as if it was two real instructions. The two advantages over just using two real instructions are (1) you don't have to use an x86 register for the temporary. This used to be exciting, but in 64-bit we have 16 integer registers and it's not a big deal any more. (2) the combined instruction is slightly shorter to decode than the two separate ones, so saves a bit of I$ space.

That's on the current out-of-order cores. On the in-order cores (Atom, Larrabee, probably Rock Creek) they're almost always a bad idea and are slower than breaking the instructions apart.

It gets even more complex with instructions like "add [rbx], rax", i.e. with the memory as destination. These can cause a lot of problems for OOO cores, because even though it's broken into three uops, it still has to behave as if it is a single one. There's subtleties there to do with fault handling and how many times you're allowed to read or write memory.

As for OOO vs in-order, you've made a fairly standard mistake of thinking the PowerPC cores are shit because they're in-order. No, they're shit because they're shit. They're just badly made. They built them for a clock speed that was too high, so their pipelines have way too many stages, they don't have enough bypasses to handle that many stages, and so they stall like crazy. If they'd designed them to run at half the clock speed the pipelines would be far shorter, much simpler, and they'd run a lot better.

cbloom said...

"As for OOO vs in-order, you've made a fairly standard mistake of thinking the PowerPC cores are shit because they're in-order. No, they're shit because they're shit"

Sigh. No, that's not my point at all. My point is not that they are slow, but how they affect coding practice.

In particular, the issue is that on x86, complex C++ is maybe only 25-50% slower than simple C code (no structs, only locals). On in-order PPC the C++ might be 100-200% slower than simplified C.

Certainly the reason this is so important at the moment is because these PPC chips really suck. But even if everything was faster it would still favor the old plain-C style of coding.

Basically it give credibility to all the curmudgeons and dinosaurs who think that allocations during your frame and polymorphism and such like are terrible. On x86 you can just roll your eyes and get on with your C++, but on in-order PPC you have to grudgingly admit that they are in fact correct.

Now an actual interesting topic is how a tiny bit of out-of-orderness might fix this which I guess is what they're doing in Atom2. But I don't know the details about that.

Tom Forsyth said...

It looks like most of your gripes are to do with the compiler, not with the low-level ASM architecture.

In practice, good code for x86 looks very similar to good code for PPC code. Yes, the PPC kicks you in the nuts much much harder when you (or rather, the compiler) screw up - that's because it's a shitty design.

None of your points are anything to do with the perpetual in-order vs slightly-OOO vs fully-OOO religious wars. They're everything to do with how awful C is to compile.

Brian said...

I do have a gripe about x86. Why aren't the instructions that do moves (rep movs*) the fastest way to copy memory? Instead you have to do crazy stuff with mmx/sse instructions to reach the peak bandwidth....

  said...

yup, PPC is a different beast which requires dumb coding styles. Its also very hard to convince some people of this who have typically only worked with X86 processors in the past.

cbloom said...

"None of your points are anything to do with the perpetual in-order vs slightly-OOO vs fully-OOO religious wars. They're everything to do with how awful C is to compile."

I don't know if you're just being difficult or if you have a point that I just can't see. Maybe you should write an article about how to run C++ fast on an in order core?

The issue with OO is memory and branch stalls.

If your memory stalls always took exactly the same amount of time and you always took the branches the same way, then yes a compiler could statically schedule the code perfectly. (* I would also argue that while this scheduling is theoretically possible it is neither realistic nor even desirable in practice, because it would have to be at link time and be very slow).

But more importantly, the stalls are *not* predictable, so no in-order execution can ever keep up with OO unit which is running ahead on whatever it can manage.

For example, say you execute a loop of complicated C++ code 3 times. The first time there will be branch misses and memory stalls. The OO will run as much code as it can while it waits for the stalls. The next time through it will run full speed without stalls.

You just can't beat dynamic scheduling for typical C++ish workloads.

(and recall I'm not talking about tight inner loops and super optimized stuff; I'm talking about the 90% of game code that you want to be decently fast but you don't want to have to hand massage).

Tom Forsyth said...

Your opening statement is that x86 is greater than any other ISA. And then the reasons you give are largely misunderstandings of what happens to x86 when it is actually executed.

But now you've switched to the completely orthogonal OOO vs in-order argument. Of course OOO gives higher performance per clock cycle (higher perf per square mm or per watt is a much more complex argument). But that's got nothing to do with the ISA used.

The actual reason x86 is better than PPC is the fairly subtle one that for rarely-executed code, the CISC encoding means you get fewer I$ misses. And that's pretty much it.

vv said...

PPC and not sweat the x86, because it will be like a bazillion times faster than the PPC anyway
PPU is 1/3 of K8 per cycle on branch and microcode-heavy code.
This is comparable to issue width difference. That means "x86s" are not as effecient as you trying to imagine.
Xenon/PPU performs mediocre even on good code. But that not a big deal.They're good enough.
You failed to realize, what fast code will be fast on either architecture, MUCH faster than typical C++ shit-code.

Basically it give credibility to all the curmudgeons and dinosaurs who think that allocations during your frame and polymorphism and such like are terrible
On a system with limited memory,
going crazy with dynamic heap allocations (no guarantees, fragmentation) is a terrible sin. I have to deal with legacy PC codebase where are many places like that caused major headache and crashes.
And I really hate it.
Allocation is evil everywhere, even in Java there it is cheap.
You can't be fast until you place your dynamic data in pools and ring-buffers.

cbloom said...

"You failed to realize, what fast code will be fast on either architecture, MUCH faster than typical C++ shit-code."

Oh yeah it never occurred to me that fast code would be fast.

And it's just not true. Super-optimized very low level tweaked C on the in-order-PPC is significantly slower than incredibly high level elegant simple C++ on the x86/Core chips.

"On a system with limited memory,
going crazy with dynamic heap allocations (no guarantees, fragmentation) is a terrible sin. ...
You can't be fast until you place your dynamic data in pools and ring-buffers."

This is just not true. Yes, it might be easier to manage your limitted memory if you do fixed size pools, but that is not a more efficient use of the memory, nor is it necessarily faster.

vv said...

Super-optimized very low level tweaked C on the in-order-PPC is significantly slower than incredibly high level elegant simple C++ on the x86/Core chips.
That doesn't mean the high level "elegant" c++ is actually fast.
I'd say C++ is not a high level language at all. Have fun with boost http://yfrog.com/htboosterrorp

By your logic, engineers are wasting their time trying to SIMD-ize and low-level tune libraries like MKL etc. They might just use high-level C++.

but that is not a more efficient use of the memory, nor is it necessarily faster.
So you claim what pieces of data scattered through the memory is faster?

Maybe it just me who think on a level of 128B aligned dma blocks?

old rants