12/05/2008

12-05-08 - lrotl

Well I found one x86 ASM widget. I've always known you could do nice fast barrel rotations on x86 but thought they were inaccessible from C. Huzzah! Stdlib has a function "_lrotl()" which is exactly what you want, and happily it is one of the magic functions the MSVC recognizes in their compiler and turns into assembly with all goodness. (They also have custom handling for strcmp, memcpy, etc.)

Oh, I noticed lrotl in OpenSSL which seems to have a ton of good code for different hashes/checksums/digests/whatever-the-fuck-you-call-them's.

As a test I tried it on Sean's hash, which is quite good and fast for C strings :


RADINLINE U32 stb_hash(const char *str)
{
   U32 hash = 0;
   while (*str)
   {
      hash = (hash << 7) + (hash >> 25) + *str++;
   }
   return hash;
}

RADINLINE U32 stb_hash_rot(const char *str)
{
   U32 hash = 0;
   while (*str)
   {
      hash = _lrotl(hash,7) + *str++;
   }
   return hash;
}

stb_hash : 6.43 clocks per char
stb_hash_rot : 3.24 clocks per char

Woot! Then I also remembered something neat I saw today at Paul Hsieh's Assembly Lab . A quick check for whether a 32 bit word has any null byte in it :

#define has_nullbyte(x) (((x) - 0x01010101) & ( ~(x) & 0x80808080 ) )

Which can of course be used to make an unrolled stb_hash :

RADINLINE U32 stb_hash_rot_dw(const char *str)
{
   U32 hash = 0;
   
   while ( ! has_nullbyte( *((U32 *)str) ) )
   {
      hash = _lrotl(hash,7) + str[0];
      hash = _lrotl(hash,7) + str[1];
      hash = _lrotl(hash,7) + str[2];
      hash = _lrotl(hash,7) + str[3];
      str += 4;
   }
   while (*str)
   {
      hash = _lrotl(hash,7) + *str++;
   }
   return hash;
}

stb_hash_rot_dw : 2.50 clocks

So anyway, I'm getting distracted by pointless nonsense, but it's nice to know lrotl works. (and yes, yes, you could be faster still by switching the hash algorithm to something that works directly on dwords)

2 comments:

won3d said...

So, GCC manages to do some nice things. "(x<25) + (x>>7)" automatically turns into rotl, and accessing the high 32-bits of the 64-bit multiplication result doesn't have any ridiculousness. Maybe some extra MOVs.

Anyway, combining chocolate/peanut butter, using all 64 bits of the MUL seems like it makes a pretty good/fast hash. I'm still trying to figure out a good way to handle the final mix with MUL, which seems expensive there.

won3d said...

OK I think I figured out a good way to do the last, NUL-containing word. The trick is to overlap the multiply with the search for the NUL. What seems to work pretty well is to first check if the first byte is NUL, then kick off the multiply, then find the NUL byte to compute a mask you use on the eventual product. The mask should include the valid and NUL bytes, but zero-out the following garbage bytes.

Also, it is probably a good idea to incorporate the length of the string in the hash somehow, although less important for NUL-terminated strings.

old rants