3/30/2013

03-30-13 - Error codes

Some random rambling on the topic of returning error codes.

Recently I've been fixing up a bunch of code that does things like

void  MutexLock( Mutex * m )
{
    if ( ! m ) return;
    ...

yikes. Invalid argument and you just silently do nothing. No thank you.

We should all know that silently nopping in failure cases is pretty horrible. But I'm also dealing with a lot of error code returns, and it occurs to me that returning an error code in that situation is not much better.

Personally I want unexpected or unhandleable errors to just blow up my app. In my own code I would just assert; unfortunately that's not viable in OS code or perhaps even in a library.

The classic example is malloc. I hate mallocs that return null. If I run out of memory, there's no way I'm handling it cleanly and reducing my footprint and carrying on. Just blow up my app. Personally whenever I implement an allocator if it can't get memory from the OS it just prints a message and exits (*).

(* = aside : even better is "functions that don't fail" which I might write more about later; basically the idea is the function tries to handle the failure case itself and never returns it out to the larger app. So in the case of malloc it might print a message like "tried to alloc N bytes; (a)bort/(r)etry/return (n)ull?". Another common case is when you try to open a file for write and it fails for whatever reason, it should just handle that at the low level and say "couldn't open X for write; (a)bort/(r)etry/change (n)ame?" )

I think error code returns are okay for *expected* and *recoverable* errors.

On functions that you realistically expect to always succeed and will not check error codes for, they shouldn't return error codes at all. I wrote recently about wrapping system APIs for portable code ; an example of the style of level 2 wrapping that I like is to "fix" the error returns.

(obviously this is not something the OS should do, they just have to return every error; it requires app-specific knowledge about what kind of errors your app can encounter and successfully recover from and continue, vs. ones that just mean you have a catastrophic unexpected bug)

For example, functions like lock & unlock a mutex shouldn't fail (in my code). 99% of the user code in the world that locks and unlocks mutexes doesn't check the return value, they just call lock and then proceed assuming the lock succeeded - so don't return it :


void mypthread_mutex_lock(mypthread_mutex_t *mutex)
{
    int ret = pthread_mutex_lock(mutex);
    if ( ret != 0 )
        CB_FAIL("pthread_mutex_lock",ret);
}

When you get a crazy unexpected error like that, the app should just blow up right at the call site (rather than silently failing and then blowing up somewhere weird later on because the mutex wasn't actually locked).

In other cases there are a mix of expected failures and unexpected ones, and the level-2 wrapper should differentiate between them :


bool mysem_trywait(mysem * sem)
{
    for(;;)
    {
        int res = sem_trywait( sem );
        if ( res == 0 ) return true; // got it

        int err = errno;
        if ( err == EINTR )
        {
            // UNIX is such balls
            continue;
        }
        else if ( err == EAGAIN )
        {
            // expected failure, no count in sem to dec :
            return false;
        }
        else
        {
            // crazy failure; blow up :
            CB_FAIL("sem_trywait",err);
        }
    }
}

(BTW best practice these days is always to copy "errno" out to an int, because errno may actually be #defined to a function call in the multithreaded world)

And since I just stumbled into it by accident, I may as well talk about EINTR. Now I understand that there may be legitimate reasons why you *want* an OS API that's interrupted by signals - we're going to ignore that, because that's not what the EINTR debate is about. So for purposes of discussion pretend that you never have a use case where you want EINTR and it's just a question of whether the API should put that trouble on the user or not.

I ranted about EINTR at RAD a while ago and was informed (reminded) this was an ancient argument that I was on the wrong side of.

Mmm. One thing certainly is true : if you want to write an operating system (or any piece of software) such that it is easy to port to lots of platforms and maintain for a long time, then it should be absolutely as simple as possible (meaning simple to implement, not simple in the API or simple to use), even at the cost of "rightness" and pain to the user. That I certainly agree with; UNIX has succeeded at being easy to port (and also succeeded at being a pain to the user).

But most people who argue on the pro-EINTR side of the argument are just wrong; they are confused about what the advantage of the pro-EINTR argument is (for example Jeff Atwood takes off on a general rant against complexity ; I think we all should know by now that huge complex APIs are bad; that's not interesting, and that's not what "Worse is Better" is about; or Jeff's example of INI files vs the registry - INI files are just massively better in every way, it's not related at all, there's no pro-con there).

(to be clear and simple : the pro-EINTR argument is entirely about simplicity of implementation and porting of the API; it's about requiring the minimum from the system)

The EINTR-returning API is not simpler (than one that doesn't force you to loop). Consider an API like this :


U64 system( U64 code );

doc :

if the top 32 bits of code are 77 this is a file open and the bottom 32 bits specify a device; the
return values then are 0 = call the same function again with the first 8 chars of the file name ...
if it returns 7 then you must sleep at least 1 milli and then call again with code = 44 ...
etc.. docs for 100 pages ...

what you should now realize is that *the docs are part of the API*. (that is not a "simple" API)

An API that requires you to carefully read about the weird special cases and understand what is going on inside the system is NOT a simple API. It might look simple, but it's in disguise. A simple API does what you expect it to. You should be able to just look at the function signature and guess what it does and be right 99% of the time.

Aside from the issue of simplicity, any API that requires you to write the exact same boiler-plate every time you use it is just a broken fucking API.

Also, I strongly believe that any API which returns error codes should be usable if you don't check the error code at all. Yeah yeah in real production code of course you check the error code, but for little test apps you should be able to do :


int fd = open("blah");

read(fd,buf);

close(fd);

and that should work okay in my hack test app. Nope, not in UNIX it doesn't. Thanks to its wonderful "simplicity" you have to call "read" in a loop because it might decide to return before the whole read is done.

Another example that occurs to me is the reuse of keywords and syntax in C. Things like making "static" mean something completely different depending on how you use it makes the number of special keywords smaller. But I believe it actually makes the "API" of the language much *more* complex. Instead of having intuitive and obvious separate clear keywords for each meaning that you could perhaps figure out just by looking at them, you instead have to read a bunch of docs and have very technical knowledge of the internals of what the keywords mean in each usage. (there are legitimate advantages to minimizing the number of keywords, of course, like leaving as many names available to users as possible). Knowledge required to use an API is part of the API. Simplicity is determined by the amount of knowledge required to do things correctly.

25 comments:

won3d said...

First of all, I've always felt that "worse is better" was more of an apology than a design principle. I think the root issue is that the benefits of formalism often don't pass any Pepsi-challenges (aka, look how terse my Hello World program is!).

Fortunately, we still have people around with formalism hard-ons like the type theory nerds who explore this space and espouse the benefits of things like algebraic data types, monoids, and the zoo of monad utility (zippers, iteratee, etc). Of course, the formal versions of these tend to be discovered long after the hackers find the specialized or shitty versions. In other words, these techniques are outbred a la Idiocracy.

Tangential to the main point, but still perhaps useful to you, GCC (and probably clang) have the function attribute "warn_unused_result" that complains when you silently let a return value slip by.

As for your malloc example, a somewhat useful wrapper is a NOTNULL(ptr) thing that panic-fails if ptr == NULL and return ptr otherwise (or the macro equivalent).

johnb said...

The EINTR thing (sort of) gets more complicated, because different systems and different syscalls do different things. On some platforms, with some sigaction and signal handling configurations, for some syscalls, the system will do the right thing and restart automatically instead of sending EINTR. (c.f. glibc's signal docs and the SA_RESTART sigaction flag)

Although I suppose it doesn't matter; you just always write the EINTR handling loop and if it doesn't loop then you haven't lost anything significant.

Memory allocation failure (system resource allocation failure in general) is a PITA. You should really have fixed memory budgets for everything and just allocate everything at the start, and then you don't have to worry about it. Of course some programs can't do that because they have a dataset of unknown size and can't process data as a stream with a fixed-size working buffer. In that case you should compute an upper bound on memory use as early as possible and allocate it so that you've only got a handful of places where you have to cope with memory allocation failure. But some programs can't even do *that* because they're doing something with no useful way of putting a bound on memory requirements, and then you get the worst case and have to write all of the code so that you can always get back to a stable internal state, and you can always gracefully save your file and close, all without any allocation (so having fixed, pre-allocated memory to handle all of that stuff is still essential).

Unfortunately, because our languages and tools in general don't help with that way of writing code, the effort required to work out a (safe) memory budget and allocate everything up front is massive and so it's overall cheaper to just assume you'll have enough memory and blow up when you find out you're wrong.

cbloom said...

"First of all, I've always felt that "worse is better" was more of an apology than a design principle."

I can't speak to how the author intended it (the tone is pretty hard for me to parse in that piece), but since then people have definitely taken it as an argument for simplicity of the implementation of the OS/library over all else.

cbloom said...

"Tangential to the main point, but still perhaps useful to you, GCC (and probably clang) have the function attribute "warn_unused_result" that complains when you silently let a return value slip by."

Yeah I spotted that in one of my recent gcc ports.

If all your functions that shouldn't really ever fail are constantly spewing return values at you, then I think that warning is more annoyance than useful.

But if you did as I advise and made most errors be either explosions or handled inside the function, and only return error codes where it is likely and reasonable that the caller actually will write error checking code - then I think this warning is pretty cool and would be a good robustinator.

cbloom said...

BTW adding on to that comment, it gave me a moment of clarity :

Any time you are writing a function, ask yourself :

Will a caller of this function actually realistically write code that takes the error value I return and checks it and does something useful with that information?

(eg. will the client write code like

int err = whatever(..)
if ( err == Eblah )
do X
else if ( err = Eshite )
do Y
etc.

)

If so, then yes, sure return an error.

But if you are honest with yourself and most callers of the function will just want to do :

whatever();

or

int err = whatever();
ASSERT_RELEASE( err == EOK );

then don't return an error code.

Unknown said...

"look how terse my Hello World program is! ...we still have people around with formalism hard-ons like the type theory nerds who explore this space and espouse the benefits of things like algebraic data types, monoids..."

Programming Languages as an academic discipline is so dull... inventing new ways to solve problems that already have solutions... most of the interesting/useful stuff has been known for three decades or more. Functional programming languages become ever more esoteric. To me, a programming language serves a purpose not unlike spoken language. Yeah, English has flaws, and you could invent a better language, according to some definition of better, maybe something along the lines of Esperanto. But the point is what you're trying to say and that you can make yourself understood, not that you're speaking the perfect ideal language -- whatever that would mean.

"Unfortunately, because our languages and tools in general don't help with that way of writing code, the effort required to work out a (safe) memory budget and allocate everything up front is massive and so it's overall cheaper to just assume you'll have enough memory and blow up when you find out you're wrong."

I don't think that's an unreasonable way to go. If you're targeting a 32-bit architecture, then you might need to care a lot about virtual address space, but physical memory is a fluid thing.

Here's some nitty-gritty on memory allocation in Linux:

https://news.ycombinator.com/item?id=2544387
http://www.etalabs.net/overcommit.html

I was going to make the point that the second link calls a "myth," but now I'd feel like a liar if I said it, that your app won't know if the system runs out of memory until a page fault. Obviously it's complicated and depends a lot on the OS you happen to be running on.

Unknown said...

"I can't speak to how the author intended it (the tone is pretty hard for me to parse in that piece), but since then people have definitely taken it as an argument for simplicity of the implementation of the OS/library over all else."

It's a very old document, in computer terms, and I think it can't be completely separated from its historical context. It came out of the peculiar Lisp culture that was in decline at that time. Although I agree with it, I think it's ironic that Lisp is held up as "better" -- Lisp was hardly designed as a "diamond-like jewel." Common Lisp and Scheme were late implementations and benefited from a long evolution.

Unknown said...

Isn't EINTR something you have to deal with only if you're writing your own signal handlers? It seems like a simplifying assumption if, in your signal handler, you don't have to worry that another part of the process is blocked on a system call.

cbloom said...

WRST the original Worse is Better I found this to be a pretty decent discussion of the issue of complexity :

http://www.faqs.org/docs/artu/complexitychapter.html

As for why EINTR is not useless in reality, see :

http://en.wikipedia.org/wiki/PCLSRing

http://www.250bpm.com/blog:12

http://marc.info/?l=linux-kernel&m=97328060332308&w=2

https://issues.apache.org/bugzilla/show_bug.cgi?id=41975

cbloom said...

"You should really have fixed memory budgets for everything and just allocate everything at the start, and then you don't have to worry about it."

Total disagree. The range of applications where this is appropriate is zero narrow, perhaps only games (and not casual games that the user might play while doing other things).

Apps should consume my system resources proportionally to how much I use them.

In fact one of the massive fails of the modern era is the way everyone using the standard STL's never return any memory to the OS, because STL's all come with a freelist node allocator that never gives back memory, so memory usage never goes down from peak.

Allocation strategies all have different tradeoffs.

johnb said...

"Apps should consume my system resources proportionally to how much I use them".

Right, but 1) I already said that you can't allocate absolutely everything up front if your dataset size is unknown, and 2) sharing memory between programs is what the virtual memory system is for.

Individual programs should still reserve what they need as soon as they can work out how much they need; that way they fail early or not at all instead of failing in the middle of their work and having to either blow up completely or paying the development and maintenance cost of having zero-allocation graceful clean-up code absolutely everywhere.

Cyan said...

Java seems to have a better mechanism to handle errors, it's called "exceptions".

My understanding is that Java's exceptions force the programmer to actually do something about the exception, if he wants the program to go on.
No more "silent error" because the programmer did not checked if the return code is an error message, something which is much too easy to do in C.

cbloom said...

@jb - yeah okay I basically agree with that. A more general principal is that if your app might fail anywhere, try to do it as soon as possible. Don't run a long compute and then fail a malloc after half an hour - try to preload all your resource acquisition.

I thought you were talking more about the more game-style allocate yourself a 500M chunk at startup (regardless of whether you will actually need that much in this session) and then allocate out of that.

johnb said...

"I thought you were talking more about the more game-style allocate yourself a 500M chunk at startup"

I partly was, but you're right that's an unusual special case.

Unknown said...

"My understanding is that Java's exceptions force the programmer to actually do something about the exception"

The programming language can't force the programmer to do anything, save add a bunch of boilerplate to satisfy the compiler. Java's checked exceptions are widely considered to have been a bad idea. Googling "checked exceptions" turns up mostly complaints.

Unknown said...

cbloom --

Nice link on complexity. This gets to the issue:

"Early distributed-hypertext projects such as NLS and Xanadu were severely constrained by the MIT-philosophy assumption that dangling links were an unacceptable breakdown in the user interface; this constrained the systems to either browsing only a controlled, closed set of documents (such as on a single CD-ROM) or implementing various increasingly elaborate replication, caching, and indexing methods in an attempt to prevent documents from randomly disappearing. Tim Berners-Lee cut through this Gordian knot by punting the problem in classic New Jersey style."

So it's not only true that "worse" can be better, "better" is sometimes just worse masquerading as better.

"Better" is stubborn. It assumes that it knows which problems are important. Dangling links turned out to be an acceptable trade-off.

Anonymous said...

"Yeah yeah in real production code of course you check the error code, but for little test apps you should be able to do: [..]
And you can. At least on Linux, EINTR only occurs in programs that ask for it (i.e. hook signal handlers and don't specify SA_RESTART). See man 7 signal for details, specifically the "Interruption of system calls and library functions by signal handlers" section.

cbloom said...

Oh yeah, you're right. I was thinking that's terrible practice, but of course it's totally fine for test apps.

Anonymous said...

FWIW, I rewrote the error handling in Iggy a couple weeks ago. Now detailed error codes are only returned if it seems they can usefully be handled; boolean success/fail is returned if that can usefully be handled; otherwise void.

All the old error cases are still detected and an error message is sent to a user-installed callback function which is intended to print to debug console.

My slogan-y way of thinking about this is along the lines of "if the program can fix it return an error code; if the programmer must fix it print a message".

cbloom said...

@Fabian again - I just remember that I specifically used "read" in that example, because on UNIX you have to call read() in a loop even without the EINTR issue. read() is allowed to return before the full read is done, even in non-EOF or error conditions.

(presumably that doesn't actually happen in practice on normal files, only on network ports and other weird things, but because of the API spec if you don't call it in a loop you could have broken code)

cbloom said...

Or something. Bleh, whatever. I used to think it was neat to learn all the quirks and tricks about each OS and API, but now I just find it ever so tedious.

Anonymous said...

read() will return with less bytes read than specified in the exact same cases as Windows ReadFile would (end of file, reading from a pipe/socket that doesn't have any extra data available, reading from a console buffer that has less than the specified amount of bytes you can read).

The only extra complication is signals, but as said if you use SA_RESTART (which is the default for all signal handlers) read will always complete on Linux (unless it hits one of the other conditions I just mentioned), again see man 7 signal.

cbloom said...

"read() will return with less bytes read than specified in the exact same cases as Windows ReadFile would"

Where does this information come from? That's not what the docs say.

I don't see anywhere in the Win32 docs that say ReadFile will return less than the # requested except in error cases.

Conversely, the POSIX docs specifically say read() can return less than the # requested and do not guarantee under what conditions that is done.

Anonymous said...

Just found this thread again and realized I never replied. :)

In addition to the cases mentioned in the ReadFile docs (pipes, consoles/serial devices) I've also seen it return partial reads on network shares, even when there was no error. In particular, I've seen this on large-block (in the ballpark of 32MB) reads via SMB on Windows XP (a couple years ago). No idea whether that still happens, but I wouldn't be surprised.

cbloom said...

Good to know.

Making Win32's ReadFile actually robust is *crazy* complicated.

I should probably post a snippet that does it some day.

I've also never gotten comfortable with Win32's networked file performance.

old rants