10-09-13 - Urgh ; Threads and Memory

This is a problem I've been trying to avoid really facing, so I keep hacking around it, but it keeps coming back to bite me every few months.

Threads/Jobs and memory allocation is a nasty problem.

Say you're trying to process some 8 GB file on a 32-bit system. You'd like to fire up a bunch of threads and let them all crank on chunks of the file simultaneously. But how big of a chunk can each thread work on? And how many threads can you run?

The problem is those threads may need to do allocations to do their processing. With free-form allocations you don't necessarily know in advance how much they need to allocate (it might depend on processing options or the data they see or whatever). With a multi-process OS you also don't know in advance how much memory you have available (it may reduce while you're running). So you can't just say "I have the memory to run 4 threads at a time". You don't know. You can run out of memory, and you have to abort the whole process and try again with less threads.

In case it's not obvious, you can't just try running 4 threads, and when one of them runs out of memory you pause that thread and run others, or kill that thread, because the thread may do work and allocations incrementally, like work,alloc,work,alloc,etc. so that by the time an alloc fails, you're alread holding a bunch of other allocs and no other thread may be able to run.

To be really clear, imagine you have 2 MB free and your threads do { alloc 1 MB, work A, alloc 1 MB, work B }. You try to run 2 threads, and they both get up to work A. Now neither thread can continue because you're out of memory.

The real solution is for each Job to pre-declare its resource requirements. Like "I need 80 MB" to run. Then it becomes the responsibility of the Job Manager to do the allocation, so when the Job is started, it is handed the memory and it knows it can run; all allocations within the Job then come from the reserved pool, not from the system.

(there are of course other solutions; for example you could make all your jobs rewindable, so if one fails an allocation it is aborted (and any changes to global state undone), or similarly all your jobs could work in two stages, a "gather" stage where allocs are allowed, but no changes to the global state are allowed, and a "commit" phase where the changes are applied; the job can be aborted during "gather" but must not fail during "commit").

So the Job Manager might try to allocate memory for a job, fail, and run some other jobs that need less memory. eg. if you have jobs that take { 10, 1, 10, 1 } of memories, and you have only 12 memories free, you can't run the two 10's at the same time, but you can run the 1's while a 10 is running.

While you're at it, you may as well put some load-balancing in your Jobs as well. You could have each Job mark up to what extend it needs CPU, GPU, or IO resources (in addition to memory use). Then the Job Manager can try to run jobs that are of different types (eg. don't run two IO-heavy jobs at the same time).

If you want to go even more extreme, you could have Jobs pre-declare the shared system resources that they need locks on, and the Job Manager can try to schedule jobs to avoid lock contention. (the even super extreme version of this is to pre-declare *all* your locks and have the Job Manager take them for you, so that you are gauranteed to get them; at this point you're essentially making Jobs into snippets that you know cannot ever fail and cannot even ever *block*, that is they won't even start unless they can run straight to completion).

I haven't wanted to go down this route because it violates one of my Fundamental Theorems of Jobs, which is that job code should be the same as main-thread code, not some weird meta-language that requires lots of markup and is totally different code from what you would write in the non-threaded case.

Anyway, because I haven't properly addressed this, it means that in low-memory scenarios (eg. any 32-bit platform), the Oodle compressors (at the optimal parse level) can run out of memory if you use too many worker threads, and it's hard to really know that's going to happen in advance (since the exact memory use depends on a bunch of options and is hard to measure). Bleh.

(and obviously what I need to do for Oodle, rather than solving this problem correctly and generally, is just to special case my LZ string matchers and have them allocate their memory before starting the parallel compress, so I know how many threads I can run)


10-03-13 - SetLastError(0)

Public reminder to myself about something I discovered a while ago.

If you want to do IO really robustly in Windows, you can't just assume that your ReadFile / WriteFile will succeed under normal usage. There are lots of nasty cases where you need to retry (perhaps with smaller IO sizes, or just after waiting a bit).

In particular you can see these errors in normal runs :

ERROR_NOT_ENOUGH_MEMORY = too many AsyncIO 's pending

ERROR_NOT_ENOUGH_QUOTA  = single IO call too large
    not enough process space pages available
    -> SetProcessWorkingSetSize

    failure to alloc pages in the kernel address space for the IO
    try again with smaller IOs  

    normal async IO return value

    sometimes normal EOF return value

anyway, this post is not about the specifics of IO errors. (random aside : I believe that some of these annoying errors were much more common in 32-bit windows; the failure to get address space to map IO pages was a bigger problem in 32-bit (I saw it most often when running with the /3GB option which makes the kernel page space a scarce commodity), I don't think I've seen it in the field in 64-bit windows)

I discovered a while ago that ReadFile and WriteFile can fail (return false) but not set last error to anything. That is, you have something like :

SetLastError(77); // something bogus

if ( ! ReadFile(....) )
    // failure, get code :
    DWORD new_error = GetLastError();

    // new_error should be the error info about ReadFile failing
    // but sometimes it's still 77

and *sometimes* new_error is still 77 (or whatever; that is, it wasn't actually set when ReadFile failed).

I have no idea exactly what situations make the error get set or not. I have no idea how many other Win32 APIs are affected by this flaw, I only have empirical proof of ReadFile and WriteFile.

Anyhoo, the conclusion is that best practice on Win32 is to call SetLastError(0) before invoking any API where you need to know for sure that the error code you get was in fact set by that call. eg.

if ( ! SomeWin32API(...) )
    DWORD hey_I_know_this_error_is_legit = GetLastError();


That is, Win32 APIs returning failure does *not* guarantee that they set LastError.

ADD : while I'm at it :

in the VC watch window is pretty sweet.

GetLastError is :

*(DWORD *)($tib+0x34)

or *(DWORD *)(FS:[0x34]) on x86

old rants