2018-03-03 09:35:28 +00:00
|
|
|
|
---
|
|
|
|
|
created_at: '2016-07-03T17:42:37.000Z'
|
|
|
|
|
title: Bit Twiddling Hacks (2005)
|
|
|
|
|
url: https://graphics.stanford.edu/~seander/bithacks.html
|
|
|
|
|
author: e-sushi
|
|
|
|
|
points: 127
|
|
|
|
|
story_text:
|
|
|
|
|
comment_text:
|
|
|
|
|
num_comments: 24
|
|
|
|
|
story_id:
|
|
|
|
|
story_title:
|
|
|
|
|
story_url:
|
|
|
|
|
parent_id:
|
|
|
|
|
created_at_i: 1467567757
|
|
|
|
|
_tags:
|
|
|
|
|
- story
|
|
|
|
|
- author_e-sushi
|
|
|
|
|
- story_12026879
|
|
|
|
|
objectID: '12026879'
|
2018-06-08 12:05:27 +00:00
|
|
|
|
year: 2005
|
2018-03-03 09:35:28 +00:00
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
## Bit Twiddling Hacks
|
|
|
|
|
|
|
|
|
|
### By Sean Eron Anderson
|
|
|
|
|
seander@cs. stanford.edu
|
|
|
|
|
|
|
|
|
|
Individually, the **code snippets here are in the public domain**
|
|
|
|
|
(unless otherwise noted) — feel free to use them however you please. The
|
|
|
|
|
aggregate collection and descriptions are © 1997-2005 Sean Eron
|
|
|
|
|
Anderson. The code and descriptions are distributed in the hope that
|
|
|
|
|
they will be useful, but **WITHOUT ANY WARRANTY** and without even the
|
|
|
|
|
implied warranty of merchantability or fitness for a particular purpose.
|
|
|
|
|
As of May 5, 2005, all the code has been tested thoroughly. Thousands of
|
|
|
|
|
people have read it. Moreover, [Professor Randal
|
|
|
|
|
Bryant](http://www-2.cs.cmu.edu/~bryant/), the Dean of Computer Science
|
|
|
|
|
at Carnegie Mellon University, has personally tested almost everything
|
|
|
|
|
with his [Uclid code verification
|
|
|
|
|
system](http://www-2.cs.cmu.edu/~uclid/). What he hasn't tested, I have
|
|
|
|
|
checked against all possible inputs on a 32-bit machine. **To the first
|
|
|
|
|
person to inform me of a legitimate bug in the code, I'll pay a bounty
|
|
|
|
|
of US$10 (by check or Paypal)**. If directed to a charity, I'll pay
|
|
|
|
|
US$20.
|
|
|
|
|
|
|
|
|
|
### Contents
|
|
|
|
|
|
|
|
|
|
When totaling the number of operations for algorithms here, any C
|
|
|
|
|
operator is counted as one operation. Intermediate assignments, which
|
|
|
|
|
need not be written to RAM, are not counted. Of course, this operation
|
|
|
|
|
counting approach only serves as an approximation of the actual number
|
|
|
|
|
of machine instructions and CPU time. All operations are assumed to take
|
|
|
|
|
the same amount of time, which is not true in reality, but CPUs have
|
|
|
|
|
been heading increasingly in this direction over time. There are many
|
|
|
|
|
nuances that determine how fast a system will run a given sample of
|
|
|
|
|
code, such as cache sizes, memory bandwidths, instruction sets, etc. In
|
|
|
|
|
the end, benchmarking is the best way to determine whether one method is
|
|
|
|
|
really faster than another, so consider the techniques below as
|
|
|
|
|
possibilities to test on your target architecture.
|
|
|
|
|
|
|
|
|
|
int v; // we want to find the sign of v
|
|
|
|
|
int sign; // the result goes here
|
|
|
|
|
|
|
|
|
|
// CHAR_BIT is the number of bits per byte (normally 8).
|
|
|
|
|
sign = -(v < 0); // if v < 0 then -1, else 0.
|
|
|
|
|
// or, to avoid branching on CPUs with flag registers (IA32):
|
|
|
|
|
sign = -(int)((unsigned int)((int)v) >> (sizeof(int) * CHAR_BIT - 1));
|
|
|
|
|
// or, for one less instruction (but not portable):
|
|
|
|
|
sign = v >> (sizeof(int) * CHAR_BIT - 1);
|
|
|
|
|
|
|
|
|
|
The last expression above evaluates to sign = v \>\> 31 for 32-bit
|
|
|
|
|
integers. This is one operation faster than the obvious way, sign = -(v
|
|
|
|
|
\< 0). This trick works because when signed integers are shifted right,
|
|
|
|
|
the value of the far left bit is copied to the other bits. The far left
|
|
|
|
|
bit is 1 when the value is negative and 0 otherwise; all 1 bits gives
|
|
|
|
|
-1. Unfortunately, this behavior is architecture-specific.
|
|
|
|
|
|
|
|
|
|
Alternatively, if you prefer the result be either -1 or +1, then
|
|
|
|
|
use:
|
|
|
|
|
|
|
|
|
|
sign = +1 | (v >> (sizeof(int) * CHAR_BIT - 1)); // if v < 0 then -1, else +1
|
|
|
|
|
|
|
|
|
|
On the other hand, if you prefer the result be either -1, 0, or +1, then
|
|
|
|
|
use:
|
|
|
|
|
|
|
|
|
|
sign = (v != 0) | -(int)((unsigned int)((int)v) >> (sizeof(int) * CHAR_BIT - 1));
|
|
|
|
|
// Or, for more speed but less portability:
|
|
|
|
|
sign = (v != 0) | (v >> (sizeof(int) * CHAR_BIT - 1)); // -1, 0, or +1
|
|
|
|
|
// Or, for portability, brevity, and (perhaps) speed:
|
|
|
|
|
sign = (v > 0) - (v < 0); // -1, 0, or +1
|
|
|
|
|
|
|
|
|
|
If instead you want to know if something is non-negative, resulting in
|
|
|
|
|
+1 or else 0, then
|
|
|
|
|
use:
|
|
|
|
|
|
|
|
|
|
sign = 1 ^ ((unsigned int)v >> (sizeof(int) * CHAR_BIT - 1)); // if v < 0 then 0, else 1
|
|
|
|
|
|
|
|
|
|
Caveat: On March 7, 2003, Angus Duggan pointed out that the 1989 ANSI C
|
|
|
|
|
specification leaves the result of signed right-shift
|
|
|
|
|
implementation-defined, so on some systems this hack might not work. For
|
|
|
|
|
greater portability, Toby Speight suggested on September 28, 2005 that
|
|
|
|
|
CHAR\_BIT be used here and throughout rather than assuming bytes were 8
|
|
|
|
|
bits long. Angus recommended the more portable versions above, involving
|
|
|
|
|
casting on March 4, 2006. [Rohit Garg](http://rpg-314.blogspot.com/)
|
|
|
|
|
suggested the version for non-negative integers on September 12, 2009.
|
|
|
|
|
|
|
|
|
|
int x, y; // input values to compare signs
|
|
|
|
|
|
|
|
|
|
bool f = ((x ^ y) < 0); // true iff x and y have opposite signs
|
|
|
|
|
|
|
|
|
|
Manfred Weis suggested I add this entry on November 26, 2009.
|
|
|
|
|
|
|
|
|
|
int v; // we want to find the absolute value of v
|
|
|
|
|
unsigned int r; // the result goes here
|
|
|
|
|
int const mask = v >> sizeof(int) * CHAR_BIT - 1;
|
|
|
|
|
|
|
|
|
|
r = (v + mask) ^ mask;
|
|
|
|
|
|
|
|
|
|
Patented variation:
|
|
|
|
|
|
|
|
|
|
r = (v ^ mask) - mask;
|
|
|
|
|
|
|
|
|
|
Some CPUs don't have an integer absolute value instruction (or the
|
|
|
|
|
compiler fails to use them). On machines where branching is expensive,
|
|
|
|
|
the above expression can be faster than the obvious approach, r = (v \<
|
|
|
|
|
0) ? -(unsigned)v : v, even though the number of operations is the same.
|
|
|
|
|
|
|
|
|
|
On March 7, 2003, Angus Duggan pointed out that the 1989 ANSI C
|
|
|
|
|
specification leaves the result of signed right-shift
|
|
|
|
|
implementation-defined, so on some systems this hack might not work.
|
|
|
|
|
I've read that ANSI C does not require values to be represented as two's
|
|
|
|
|
complement, so it may not work for that reason as well (on a
|
|
|
|
|
diminishingly small number of old machines that still use one's
|
|
|
|
|
complement). On March 14, 2004, Keith H. Duggar sent me the patented
|
|
|
|
|
variation above; it is superior to the one I initially came up with,
|
|
|
|
|
`r=(+1|(v>>(sizeof(int)*CHAR_BIT-1)))*v`, because a multiply is not
|
|
|
|
|
used. Unfortunately, this method has been
|
|
|
|
|
[patented](http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=/netahtml/search-adv.htm&r=1&f=G&l=50&d=ptxt&S1=6073150&OS=6073150&RS=6073150)
|
|
|
|
|
in the USA on June 6, 2000 by Vladimir Yu Volkonsky and assigned to [Sun
|
|
|
|
|
Microsystems](http://www.sun.com/). On August 13, 2006, Yuriy Kaminskiy
|
|
|
|
|
told me that the patent is likely invalid because the method was
|
|
|
|
|
published well before the patent was even filed, such as in [How to
|
|
|
|
|
Optimize for the Pentium
|
|
|
|
|
Processor](http://www.goof.com/pcg/doc/pentopt.txt) by Agner Fog, dated
|
|
|
|
|
November, 9, 1996. Yuriy also mentioned that this document was
|
|
|
|
|
translated to Russian in 1997, which Vladimir could have read. Moreover,
|
|
|
|
|
the Internet Archive also has an old
|
|
|
|
|
[link](http://web.archive.org/web/19961201174141/www.x86.org/ftp/articles/pentopt/PENTOPT.TXT)
|
|
|
|
|
to it. On January 30, 2007, Peter Kankowski shared with me an [abs
|
|
|
|
|
version](http://smallcode.weblogs.us/2007/01/31/microsoft-probably-uses-the-abs-function-patented-by-sun/)
|
|
|
|
|
he discovered that was inspired by Microsoft's Visual C++ compiler
|
|
|
|
|
output. It is featured here as the primary solution. On December 6,
|
|
|
|
|
2007, Hai Jin complained that the result was signed, so when computing
|
|
|
|
|
the abs of the most negative value, it was still negative. On April 15,
|
|
|
|
|
2008 Andrew Shapira pointed out that the obvious approach could
|
|
|
|
|
overflow, as it lacked an (unsigned) cast then; for maximum portability
|
|
|
|
|
he suggested `(v < 0) ? (1 + ((unsigned)(-1-v))) : (unsigned)v`. But
|
|
|
|
|
citing the ISO C99 spec on July 9, 2008, Vincent Lefèvre convinced me to
|
|
|
|
|
remove it becasue even on non-2s-complement machines -(unsigned)v will
|
|
|
|
|
do the right thing. The evaluation of -(unsigned)v first converts the
|
|
|
|
|
negative value of v to an unsigned by adding 2\*\*N, yielding a 2s
|
|
|
|
|
complement representation of v's value that I'll call U. Then, U is
|
|
|
|
|
negated, giving the desired result, -U = 0 - U = 2\*\*N - U = 2\*\*N -
|
|
|
|
|
(v+2\*\*N) = -v = abs(v).
|
|
|
|
|
|
|
|
|
|
int x; // we want to find the minimum of x and y
|
|
|
|
|
int y;
|
|
|
|
|
int r; // the result goes here
|
|
|
|
|
|
|
|
|
|
r = y ^ ((x ^ y) & -(x < y)); // min(x, y)
|
|
|
|
|
|
|
|
|
|
On some rare machines where branching is very expensive and no condition
|
|
|
|
|
move instructions exist, the above expression might be faster than the
|
|
|
|
|
obvious approach, r = (x \< y) ? x : y, even though it involves two more
|
|
|
|
|
instructions. (Typically, the obvious approach is best, though.) It
|
|
|
|
|
works because if x \< y, then -(x To find the maximum, use:
|
|
|
|
|
|
|
|
|
|
r = x ^ ((x ^ y) & -(x < y)); // max(x, y)
|
|
|
|
|
|
|
|
|
|
#### Quick and dirty versions:
|
|
|
|
|
|
|
|
|
|
If you know that INT\_MIN \<= x - y \<= INT\_MAX, then you can use the
|
|
|
|
|
following, which are faster because (x - y) only needs to be evaluated
|
|
|
|
|
once.
|
|
|
|
|
|
|
|
|
|
r = y + ((x - y) & ((x - y) >> (sizeof(int) * CHAR_BIT - 1))); // min(x, y)
|
|
|
|
|
r = x - ((x - y) & ((x - y) >> (sizeof(int) * CHAR_BIT - 1))); // max(x, y)
|
|
|
|
|
|
|
|
|
|
Note that the 1989 ANSI C specification doesn't specify the result of
|
|
|
|
|
signed right-shift, so these aren't portable. If exceptions are thrown
|
|
|
|
|
on overflows, then the values of x and y should be unsigned or cast to
|
|
|
|
|
unsigned for the subtractions to avoid unnecessarily throwing an
|
|
|
|
|
exception, however the right-shift needs a signed operand to produce all
|
|
|
|
|
one bits when negative, so cast to signed there.
|
|
|
|
|
|
|
|
|
|
On March 7, 2003, Angus Duggan pointed out the right-shift portability
|
|
|
|
|
issue. On May 3, 2005, Randal E. Bryant alerted me to the need for the
|
|
|
|
|
precondition, INT\_MIN \<= x - y \<= INT\_MAX, and suggested the
|
|
|
|
|
non-quick and dirty version as a fix. Both of these issues concern only
|
|
|
|
|
the quick and dirty version. Nigel Horspoon observed on July 6, 2005
|
|
|
|
|
that gcc produced the same code on a Pentium as the obvious solution
|
|
|
|
|
because of how it evaluates (x \< y). On July 9, 2008 Vincent Lefèvre
|
|
|
|
|
pointed out the potential for overflow exceptions with subtractions in r
|
|
|
|
|
= y + ((x - y) & -(x \< y)), which was the previous version. Timothy B.
|
|
|
|
|
Terriberry suggested using xor rather than add and subract to avoid
|
|
|
|
|
casting and the risk of overflows on June 2, 2009.
|
|
|
|
|
|
|
|
|
|
unsigned int v; // we want to see if v is a power of 2
|
|
|
|
|
bool f; // the result goes here
|
|
|
|
|
|
|
|
|
|
f = (v & (v - 1)) == 0;
|
|
|
|
|
|
|
|
|
|
Note that 0 is incorrectly considered a power of 2 here. To remedy this,
|
|
|
|
|
use:
|
|
|
|
|
|
|
|
|
|
f = v && !(v & (v - 1));
|
|
|
|
|
|
|
|
|
|
Sign extension is automatic for built-in types, such as chars and ints.
|
|
|
|
|
But suppose you have a signed two's complement number, x, that is stored
|
|
|
|
|
using only b bits. Moreover, suppose you want to convert x to an int,
|
|
|
|
|
which has more than b bits. A simple copy will work if x is positive,
|
|
|
|
|
but if negative, the sign must be extended. For example, if we have only
|
|
|
|
|
4 bits to store a number, then -3 is represented as 1101 in binary. If
|
|
|
|
|
we have 8 bits, then -3 is 11111101. The most-significant bit of the
|
|
|
|
|
4-bit representation is replicated sinistrally to fill in the
|
|
|
|
|
destination when we convert to a representation with more bits; this is
|
|
|
|
|
sign extending. In C, sign extension from a constant bit-width is
|
|
|
|
|
trivial, since bit fields may be specified in structs or unions. For
|
|
|
|
|
example, to convert from 5 bits to an full integer:
|
|
|
|
|
|
|
|
|
|
int x; // convert this from using 5 bits to a full int
|
|
|
|
|
int r; // resulting sign extended number goes here
|
|
|
|
|
struct {signed int x:5;} s;
|
|
|
|
|
r = s.x = x;
|
|
|
|
|
|
|
|
|
|
The following is a C++ template function that uses the same language
|
|
|
|
|
feature to convert from B bits in one operation (though the compiler is
|
|
|
|
|
generating more, of course).
|
|
|
|
|
|
|
|
|
|
template <typename T, unsigned B>
|
|
|
|
|
inline T signextend(const T x)
|
|
|
|
|
{
|
|
|
|
|
struct {T x:B;} s;
|
|
|
|
|
return s.x = x;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
int r = signextend<signed int,5>(x); // sign extend 5 bit number x to r
|
|
|
|
|
|
|
|
|
|
John Byrd caught a typo in the code (attributed to html formatting) on
|
|
|
|
|
May 2, 2005. On March 4, 2006, Pat Wood pointed out that the ANSI C
|
|
|
|
|
standard requires that the bitfield have the keyword "signed" to be
|
|
|
|
|
signed; otherwise, the sign is undefined.
|
|
|
|
|
|
|
|
|
|
Sometimes we need to extend the sign of a number but we don't know a
|
|
|
|
|
priori the number of bits, b, in which it is represented. (Or we could
|
|
|
|
|
be programming in a language like Java, which lacks bitfields.)
|
|
|
|
|
|
|
|
|
|
unsigned b; // number of bits representing the number in x
|
|
|
|
|
int x; // sign extend this b-bit number to r
|
|
|
|
|
int r; // resulting sign-extended number
|
|
|
|
|
int const m = 1U << (b - 1); // mask can be pre-computed if b is fixed
|
|
|
|
|
|
|
|
|
|
x = x & ((1U << b) - 1); // (Skip this if bits in x above position b are already zero.)
|
|
|
|
|
r = (x ^ m) - m;
|
|
|
|
|
|
|
|
|
|
The code above requires four operations, but when the bitwidth is a
|
|
|
|
|
constant rather than variable, it requires only two fast operations,
|
|
|
|
|
assuming the upper bits are already zeroes.
|
|
|
|
|
|
|
|
|
|
A slightly faster but less portable method that doesn't depend on the
|
|
|
|
|
bits in x above position b being zero is:
|
|
|
|
|
|
|
|
|
|
int const m = CHAR_BIT * sizeof(x) - b;
|
|
|
|
|
r = (x << m) >> m;
|
|
|
|
|
|
|
|
|
|
Sean A. Irvine suggested that I add sign extension methods to this page
|
|
|
|
|
on June 13, 2004, and he provided `m = (1 << (b - 1)) - 1; r = -(x & ~m)
|
|
|
|
|
| x;` as a starting point from which I optimized to get m = 1U \<\< (b -
|
|
|
|
|
1); r = -(x & m) | x. But then on May 11, 2007, Shay Green suggested the
|
|
|
|
|
version above, which requires one less operation than mine. Vipin Sharma
|
|
|
|
|
suggested I add a step to deal with situations where x had possible ones
|
|
|
|
|
in bits other than the b bits we wanted to sign-extend on Oct. 15, 2008.
|
|
|
|
|
On December 31, 2009 Chris Pirazzi suggested I add the faster version,
|
|
|
|
|
which requires two operations for constant bit-widths and three for
|
|
|
|
|
variable widths.
|
|
|
|
|
|
|
|
|
|
The following may be slow on some machines, due to the effort required
|
|
|
|
|
for multiplication and division. This version is 4 operations. If you
|
|
|
|
|
know that your initial bit-width, b, is greater than 1, you might do
|
|
|
|
|
this type of sign extension in 3 operations by using r = (x \*
|
|
|
|
|
multipliers\[b\]) / multipliers\[b\], which requires only one array
|
|
|
|
|
lookup.
|
|
|
|
|
|
|
|
|
|
unsigned b; // number of bits representing the number in x
|
|
|
|
|
int x; // sign extend this b-bit number to r
|
|
|
|
|
int r; // resulting sign-extended number
|
|
|
|
|
#define M(B) (1U << ((sizeof(x) * CHAR_BIT) - B)) // CHAR_BIT=bits/byte
|
|
|
|
|
static int const multipliers[] =
|
|
|
|
|
{
|
|
|
|
|
0, M(1), M(2), M(3), M(4), M(5), M(6), M(7),
|
|
|
|
|
M(8), M(9), M(10), M(11), M(12), M(13), M(14), M(15),
|
|
|
|
|
M(16), M(17), M(18), M(19), M(20), M(21), M(22), M(23),
|
|
|
|
|
M(24), M(25), M(26), M(27), M(28), M(29), M(30), M(31),
|
|
|
|
|
M(32)
|
|
|
|
|
}; // (add more if using more than 64 bits)
|
|
|
|
|
static int const divisors[] =
|
|
|
|
|
{
|
|
|
|
|
1, ~M(1), M(2), M(3), M(4), M(5), M(6), M(7),
|
|
|
|
|
M(8), M(9), M(10), M(11), M(12), M(13), M(14), M(15),
|
|
|
|
|
M(16), M(17), M(18), M(19), M(20), M(21), M(22), M(23),
|
|
|
|
|
M(24), M(25), M(26), M(27), M(28), M(29), M(30), M(31),
|
|
|
|
|
M(32)
|
|
|
|
|
}; // (add more for 64 bits)
|
|
|
|
|
#undef M
|
|
|
|
|
r = (x * multipliers[b]) / divisors[b];
|
|
|
|
|
|
|
|
|
|
The following variation is not portable, but on architectures that
|
|
|
|
|
employ an arithmetic right-shift, maintaining the sign, it should be
|
|
|
|
|
fast.
|
|
|
|
|
|
|
|
|
|
const int s = -b; // OR: sizeof(x) * CHAR_BIT - b;
|
|
|
|
|
r = (x << s) >> s;
|
|
|
|
|
|
|
|
|
|
Randal E. Bryant pointed out a bug on May 3, 2005 in an earlier version
|
|
|
|
|
(that used multipliers\[\] for divisors\[\]), where it failed on the
|
|
|
|
|
case of x=1 and b=1.
|
|
|
|
|
|
|
|
|
|
bool f; // conditional flag
|
|
|
|
|
unsigned int m; // the bit mask
|
|
|
|
|
unsigned int w; // the word to modify: if (f) w |= m; else w &= ~m;
|
|
|
|
|
|
|
|
|
|
w ^= (-f ^ w) & m;
|
|
|
|
|
|
|
|
|
|
// OR, for superscalar CPUs:
|
|
|
|
|
w = (w & ~m) | (-f & m);
|
|
|
|
|
|
|
|
|
|
On some architectures, the lack of branching can more than make up for
|
|
|
|
|
what appears to be twice as many operations. For instance, informal
|
|
|
|
|
speed tests on an AMD Athlon™ XP 2100+ indicated it was 5-10% faster. An
|
|
|
|
|
Intel Core 2 Duo ran the superscalar version about 16% faster than the
|
|
|
|
|
first. Glenn Slayden informed me of the first expression on December 11,
|
|
|
|
|
2003. Marco Yu shared the superscalar version with me on April 3, 2007
|
|
|
|
|
and alerted me to a typo 2 days later.
|
|
|
|
|
|
|
|
|
|
If you need to negate only when a flag is false, then use the following
|
|
|
|
|
to avoid branching:
|
|
|
|
|
|
|
|
|
|
bool fDontNegate; // Flag indicating we should not negate v.
|
|
|
|
|
int v; // Input value to negate if fDontNegate is false.
|
|
|
|
|
int r; // result = fDontNegate ? v : -v;
|
|
|
|
|
|
|
|
|
|
r = (fDontNegate ^ (fDontNegate - 1)) * v;
|
|
|
|
|
|
|
|
|
|
If you need to negate only when a flag is true, then use this:
|
|
|
|
|
|
|
|
|
|
bool fNegate; // Flag indicating if we should negate v.
|
|
|
|
|
int v; // Input value to negate if fNegate is true.
|
|
|
|
|
int r; // result = fNegate ? -v : v;
|
|
|
|
|
|
|
|
|
|
r = (v ^ -fNegate) + fNegate;
|
|
|
|
|
|
|
|
|
|
Avraham Plotnitzky suggested I add the first version on June 2, 2009.
|
|
|
|
|
Motivated to avoid the multiply, I came up with the second version on
|
|
|
|
|
June 8, 2009. Alfonso De Gregorio pointed out that some parens were
|
|
|
|
|
missing on November 26, 2009, and received a bug bounty.
|
|
|
|
|
|
|
|
|
|
unsigned int a; // value to merge in non-masked bits
|
|
|
|
|
unsigned int b; // value to merge in masked bits
|
|
|
|
|
unsigned int mask; // 1 where bits from b should be selected; 0 where from a.
|
|
|
|
|
unsigned int r; // result of (a & ~mask) | (b & mask) goes here
|
|
|
|
|
|
|
|
|
|
r = a ^ ((a ^ b) & mask);
|
|
|
|
|
|
|
|
|
|
This shaves one operation from the obvious way of combining two sets of
|
|
|
|
|
bits according to a bit mask. If the mask is a constant, then there may
|
|
|
|
|
be no advantage.
|
|
|
|
|
|
|
|
|
|
Ron Jeffery sent this to me on February 9, 2006.
|
|
|
|
|
|
|
|
|
|
unsigned int v; // count the number of bits set in v
|
|
|
|
|
unsigned int c; // c accumulates the total bits set in v
|
|
|
|
|
|
|
|
|
|
for (c = 0; v; v >>= 1)
|
|
|
|
|
{
|
|
|
|
|
c += v & 1;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
The naive approach requires one iteration per bit, until no more bits
|
|
|
|
|
are set. So on a 32-bit word with only the high set, it will go through
|
|
|
|
|
32 iterations.
|
|
|
|
|
|
|
|
|
|
static const unsigned char BitsSetTable256[256] =
|
|
|
|
|
{
|
|
|
|
|
# define B2(n) n, n+1, n+1, n+2
|
|
|
|
|
# define B4(n) B2(n), B2(n+1), B2(n+1), B2(n+2)
|
|
|
|
|
# define B6(n) B4(n), B4(n+1), B4(n+1), B4(n+2)
|
|
|
|
|
B6(0), B6(1), B6(1), B6(2)
|
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
unsigned int v; // count the number of bits set in 32-bit value v
|
|
|
|
|
unsigned int c; // c is the total bits set in v
|
|
|
|
|
|
|
|
|
|
// Option 1:
|
|
|
|
|
c = BitsSetTable256[v & 0xff] +
|
|
|
|
|
BitsSetTable256[(v >> 8) & 0xff] +
|
|
|
|
|
BitsSetTable256[(v >> 16) & 0xff] +
|
|
|
|
|
BitsSetTable256[v >> 24];
|
|
|
|
|
|
|
|
|
|
// Option 2:
|
|
|
|
|
unsigned char * p = (unsigned char *) &v;
|
|
|
|
|
c = BitsSetTable256[p[0]] +
|
|
|
|
|
BitsSetTable256[p[1]] +
|
|
|
|
|
BitsSetTable256[p[2]] +
|
|
|
|
|
BitsSetTable256[p[3]];
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
// To initially generate the table algorithmically:
|
|
|
|
|
BitsSetTable256[0] = 0;
|
|
|
|
|
for (int i = 0; i < 256; i++)
|
|
|
|
|
{
|
|
|
|
|
BitsSetTable256[i] = (i & 1) + BitsSetTable256[i / 2];
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
On July 14, 2009 Hallvard Furuseth suggested the macro compacted table.
|
|
|
|
|
|
|
|
|
|
unsigned int v; // count the number of bits set in v
|
|
|
|
|
unsigned int c; // c accumulates the total bits set in v
|
|
|
|
|
for (c = 0; v; c++)
|
|
|
|
|
{
|
|
|
|
|
v &= v - 1; // clear the least significant bit set
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
Brian Kernighan's method goes through as many iterations as there are
|
|
|
|
|
set bits. So if we have a 32-bit word with only the high bit set, then
|
|
|
|
|
it will only go once through the loop.
|
|
|
|
|
|
|
|
|
|
Published in 1988, the C Programming Language 2nd Ed. (by Brian W.
|
|
|
|
|
Kernighan and Dennis M. Ritchie) mentions this in exercise 2-9. On April
|
|
|
|
|
19, 2006 Don Knuth pointed out to me that this method "was first
|
|
|
|
|
published by Peter Wegner in CACM 3 (1960), 322. (Also discovered
|
|
|
|
|
independently by Derrick Lehmer and published in 1964 in a book edited
|
|
|
|
|
by Beckenbach.)"
|
|
|
|
|
|
|
|
|
|
unsigned int v; // count the number of bits set in v
|
|
|
|
|
unsigned int c; // c accumulates the total bits set in v
|
|
|
|
|
|
|
|
|
|
// option 1, for at most 14-bit values in v:
|
|
|
|
|
c = (v * 0x200040008001ULL & 0x111111111111111ULL) % 0xf;
|
|
|
|
|
|
|
|
|
|
// option 2, for at most 24-bit values in v:
|
|
|
|
|
c = ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
|
|
|
|
|
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL)
|
|
|
|
|
% 0x1f;
|
|
|
|
|
|
|
|
|
|
// option 3, for at most 32-bit values in v:
|
|
|
|
|
c = ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
|
|
|
|
|
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) %
|
|
|
|
|
0x1f;
|
|
|
|
|
c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
|
|
|
|
|
|
|
|
|
|
This method requires a 64-bit CPU with fast modulus division to be
|
|
|
|
|
efficient. The first option takes only 3 operations; the second option
|
|
|
|
|
takes 10; and the third option takes 15.
|
|
|
|
|
|
|
|
|
|
Rich Schroeppel originally created a 9-bit version, similiar to option
|
|
|
|
|
1; see the Programming Hacks section of [Beeler, M., Gosper, R. W., and
|
|
|
|
|
Schroeppel, R. HAKMEM. MIT AI Memo 239, Feb. 29,
|
|
|
|
|
1972.](http://www.inwap.com/pdp10/hbaker/hakmem/hakmem.html) His method
|
|
|
|
|
was the inspiration for the variants above, devised by Sean Anderson.
|
|
|
|
|
Randal E. Bryant offered a couple bug fixes on May 3, 2005. Bruce Dawson
|
|
|
|
|
tweaked what had been a 12-bit version and made it suitable for 14 bits
|
|
|
|
|
using the same number of operations on Feburary 1, 2007.
|
|
|
|
|
|
|
|
|
|
unsigned int v; // count bits set in this (32-bit value)
|
|
|
|
|
unsigned int c; // store the total here
|
|
|
|
|
static const int S[] = {1, 2, 4, 8, 16}; // Magic Binary Numbers
|
|
|
|
|
static const int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF};
|
|
|
|
|
|
|
|
|
|
c = v - ((v >> 1) & B[0]);
|
|
|
|
|
c = ((c >> S[1]) & B[1]) + (c & B[1]);
|
|
|
|
|
c = ((c >> S[2]) + c) & B[2];
|
|
|
|
|
c = ((c >> S[3]) + c) & B[3];
|
|
|
|
|
c = ((c >> S[4]) + c) & B[4];
|
|
|
|
|
|
|
|
|
|
The B array, expressed as binary, is:
|
|
|
|
|
|
|
|
|
|
B[0] = 0x55555555 = 01010101 01010101 01010101 01010101
|
|
|
|
|
B[1] = 0x33333333 = 00110011 00110011 00110011 00110011
|
|
|
|
|
B[2] = 0x0F0F0F0F = 00001111 00001111 00001111 00001111
|
|
|
|
|
B[3] = 0x00FF00FF = 00000000 11111111 00000000 11111111
|
|
|
|
|
B[4] = 0x0000FFFF = 00000000 00000000 11111111 11111111
|
|
|
|
|
|
|
|
|
|
We can adjust the method for larger integer sizes by continuing with the
|
|
|
|
|
patterns for the Binary Magic Numbers, B and S. If there are k bits,
|
|
|
|
|
then we need the arrays S and B to be ceil(lg(k)) elements long, and we
|
|
|
|
|
must compute the same number of expressions for c as S or B are long.
|
|
|
|
|
For a 32-bit v, 16 operations are used.
|
|
|
|
|
|
|
|
|
|
The best method for counting bits in a 32-bit integer v is the
|
|
|
|
|
following:
|
|
|
|
|
|
|
|
|
|
v = v - ((v >> 1) & 0x55555555); // reuse input as temporary
|
|
|
|
|
v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // temp
|
|
|
|
|
c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count
|
|
|
|
|
|
|
|
|
|
The best bit counting method takes only 12 operations, which is the same
|
|
|
|
|
as the lookup-table method, but avoids the memory and potential cache
|
|
|
|
|
misses of a table. It is a hybrid between the purely parallel method
|
|
|
|
|
above and the earlier methods using multiplies (in the section on
|
|
|
|
|
counting bits with 64-bit instructions), though it doesn't use 64-bit
|
|
|
|
|
instructions. The counts of bits set in the bytes is done in parallel,
|
|
|
|
|
and the sum total of the bits set in the bytes is computed by
|
|
|
|
|
multiplying by 0x1010101 and shifting right 24 bits.
|
|
|
|
|
|
|
|
|
|
A generalization of the best bit counting method to integers of
|
|
|
|
|
bit-widths upto 128 (parameterized by type T) is this:
|
|
|
|
|
|
|
|
|
|
v = v - ((v >> 1) & (T)~(T)0/3); // temp
|
|
|
|
|
v = (v & (T)~(T)0/15*3) + ((v >> 2) & (T)~(T)0/15*3); // temp
|
|
|
|
|
v = (v + (v >> 4)) & (T)~(T)0/255*15; // temp
|
|
|
|
|
c = (T)(v * ((T)~(T)0/255)) >> (sizeof(T) - 1) * CHAR_BIT; // count
|
|
|
|
|
|
|
|
|
|
See [Ian Ashdown's nice newsgroup
|
|
|
|
|
post](http://groups.google.com/groups?q=reverse+bits&num=100&hl=en&group=comp.graphics.algorithms&imgsafe=off&safe=off&rnum=2&ic=1&selm=4fulhm%248dn%40atlas.uniserve.com)
|
|
|
|
|
for more information on counting the number of bits set (also known as
|
|
|
|
|
sideways addition). The best bit counting method was brought to my
|
|
|
|
|
attention on October 5, 2005 by [Andrew Shapira](http://onezero.org); he
|
|
|
|
|
found it in pages 187-188 of [Software Optimization Guide for AMD
|
|
|
|
|
Athlon™ 64 and Opteron™
|
|
|
|
|
Processors](http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF).
|
|
|
|
|
Charlie Gordon suggested a way to shave off one operation from the
|
|
|
|
|
purely parallel version on December 14, 2005, and Don Clugston trimmed
|
|
|
|
|
three more from it on December 30, 2005. I made a typo with Don's
|
|
|
|
|
suggestion that Eric Cole spotted on January 8, 2006. Eric later
|
|
|
|
|
suggested the arbitrary bit-width generalization to the best method on
|
|
|
|
|
November 17, 2006. On April 5, 2007, Al Williams observed that I had a
|
|
|
|
|
line of dead code at the top of the first method.
|
|
|
|
|
|
|
|
|
|
The following finds the the rank of a bit, meaning it returns the sum of
|
|
|
|
|
bits that are set to 1 from the most-signficant bit downto the bit at
|
|
|
|
|
the given
|
|
|
|
|
position.
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
uint64_t v; // Compute the rank (bits set) in v from the MSB to pos.
|
|
|
|
|
unsigned int pos; // Bit position to count bits upto.
|
|
|
|
|
uint64_t r; // Resulting rank of bit at pos goes here.
|
|
|
|
|
|
|
|
|
|
// Shift out bits after given position.
|
|
|
|
|
r = v >> (sizeof(v) * CHAR_BIT - pos);
|
|
|
|
|
// Count set bits in parallel.
|
|
|
|
|
// r = (r & 0x5555...) + ((r >> 1) & 0x5555...);
|
|
|
|
|
r = r - ((r >> 1) & ~0UL/3);
|
|
|
|
|
// r = (r & 0x3333...) + ((r >> 2) & 0x3333...);
|
|
|
|
|
r = (r & ~0UL/5) + ((r >> 2) & ~0UL/5);
|
|
|
|
|
// r = (r & 0x0f0f...) + ((r >> 4) & 0x0f0f...);
|
|
|
|
|
r = (r + (r >> 4)) & ~0UL/17;
|
|
|
|
|
// r = r % 255;
|
|
|
|
|
r = (r * (~0UL/255)) >> ((sizeof(v) - 1) * CHAR_BIT);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Juha Järvi sent this to me on November 21, 2009 as an inverse operation
|
|
|
|
|
to the computing the bit position with the given rank, which follows.
|
|
|
|
|
|
|
|
|
|
The following 64-bit code selects the position of the rth 1 bit when
|
|
|
|
|
counting from the left. In other words if we start at the most
|
|
|
|
|
significant bit and proceed to the right, counting the number of bits
|
|
|
|
|
set to 1 until we reach the desired rank, r, then the position where we
|
|
|
|
|
stop is returned. If the rank requested exceeds the count of bits set,
|
|
|
|
|
then 64 is returned. The code may be modified for 32-bit or counting
|
|
|
|
|
from the right.
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
uint64_t v; // Input value to find position with rank r.
|
|
|
|
|
unsigned int r; // Input: bit's desired rank [1-64].
|
|
|
|
|
unsigned int s; // Output: Resulting position of bit with rank r [1-64]
|
|
|
|
|
uint64_t a, b, c, d; // Intermediate temporaries for bit count.
|
|
|
|
|
unsigned int t; // Bit count temporary.
|
|
|
|
|
|
|
|
|
|
// Do a normal parallel bit count for a 64-bit integer,
|
|
|
|
|
// but store all intermediate steps.
|
|
|
|
|
// a = (v & 0x5555...) + ((v >> 1) & 0x5555...);
|
|
|
|
|
a = v - ((v >> 1) & ~0UL/3);
|
|
|
|
|
// b = (a & 0x3333...) + ((a >> 2) & 0x3333...);
|
|
|
|
|
b = (a & ~0UL/5) + ((a >> 2) & ~0UL/5);
|
|
|
|
|
// c = (b & 0x0f0f...) + ((b >> 4) & 0x0f0f...);
|
|
|
|
|
c = (b + (b >> 4)) & ~0UL/0x11;
|
|
|
|
|
// d = (c & 0x00ff...) + ((c >> 8) & 0x00ff...);
|
|
|
|
|
d = (c + (c >> 8)) & ~0UL/0x101;
|
|
|
|
|
t = (d >> 32) + (d >> 48);
|
|
|
|
|
// Now do branchless select!
|
|
|
|
|
s = 64;
|
|
|
|
|
// if (r > t) {s -= 32; r -= t;}
|
|
|
|
|
s -= ((t - r) & 256) >> 3; r -= (t & ((t - r) >> 8));
|
|
|
|
|
t = (d >> (s - 16)) & 0xff;
|
|
|
|
|
// if (r > t) {s -= 16; r -= t;}
|
|
|
|
|
s -= ((t - r) & 256) >> 4; r -= (t & ((t - r) >> 8));
|
|
|
|
|
t = (c >> (s - 8)) & 0xf;
|
|
|
|
|
// if (r > t) {s -= 8; r -= t;}
|
|
|
|
|
s -= ((t - r) & 256) >> 5; r -= (t & ((t - r) >> 8));
|
|
|
|
|
t = (b >> (s - 4)) & 0x7;
|
|
|
|
|
// if (r > t) {s -= 4; r -= t;}
|
|
|
|
|
s -= ((t - r) & 256) >> 6; r -= (t & ((t - r) >> 8));
|
|
|
|
|
t = (a >> (s - 2)) & 0x3;
|
|
|
|
|
// if (r > t) {s -= 2; r -= t;}
|
|
|
|
|
s -= ((t - r) & 256) >> 7; r -= (t & ((t - r) >> 8));
|
|
|
|
|
t = (v >> (s - 1)) & 0x1;
|
|
|
|
|
// if (r > t) s--;
|
|
|
|
|
s -= ((t - r) & 256) >> 8;
|
|
|
|
|
s = 65 - s;
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
If branching is fast on your target CPU, consider uncommenting the
|
|
|
|
|
if-statements and commenting the lines that follow them.
|
|
|
|
|
|
|
|
|
|
Juha Järvi sent this to me on November 21, 2009.
|
|
|
|
|
|
|
|
|
|
unsigned int v; // word value to compute the parity of
|
|
|
|
|
bool parity = false; // parity will be the parity of v
|
|
|
|
|
|
|
|
|
|
while (v)
|
|
|
|
|
{
|
|
|
|
|
parity = !parity;
|
|
|
|
|
v = v & (v - 1);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
The above code uses an approach like Brian Kernigan's bit counting,
|
|
|
|
|
above. The time it takes is proportional to the number of bits set.
|
|
|
|
|
|
|
|
|
|
static const bool ParityTable256[256] =
|
|
|
|
|
{
|
|
|
|
|
# define P2(n) n, n^1, n^1, n
|
|
|
|
|
# define P4(n) P2(n), P2(n^1), P2(n^1), P2(n)
|
|
|
|
|
# define P6(n) P4(n), P4(n^1), P4(n^1), P4(n)
|
|
|
|
|
P6(0), P6(1), P6(1), P6(0)
|
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
unsigned char b; // byte value to compute the parity of
|
|
|
|
|
bool parity = ParityTable256[b];
|
|
|
|
|
|
|
|
|
|
// OR, for 32-bit words:
|
|
|
|
|
unsigned int v;
|
|
|
|
|
v ^= v >> 16;
|
|
|
|
|
v ^= v >> 8;
|
|
|
|
|
bool parity = ParityTable256[v & 0xff];
|
|
|
|
|
|
|
|
|
|
// Variation:
|
|
|
|
|
unsigned char * p = (unsigned char *) &v;
|
|
|
|
|
parity = ParityTable256[p[0] ^ p[1] ^ p[2] ^ p[3]];
|
|
|
|
|
|
|
|
|
|
Randal E. Bryant encouraged the addition of the (admittedly) obvious
|
|
|
|
|
last variation with variable p on May 3, 2005. Bruce Rawles found a typo
|
|
|
|
|
in an instance of the table variable's name on September 27, 2005, and
|
|
|
|
|
he received a $10 bug bounty. On October 9, 2006, Fabrice Bellard
|
|
|
|
|
suggested the 32-bit variations above, which require only one table
|
|
|
|
|
lookup; the previous version had four lookups (one per byte) and were
|
|
|
|
|
slower. On July 14, 2009 Hallvard Furuseth suggested the macro compacted
|
|
|
|
|
table.
|
|
|
|
|
|
|
|
|
|
unsigned char b; // byte value to compute the parity of
|
|
|
|
|
bool parity =
|
|
|
|
|
(((b * 0x0101010101010101ULL) & 0x8040201008040201ULL) % 0x1FF) & 1;
|
|
|
|
|
|
|
|
|
|
The method above takes around 4 operations, but only works on bytes.
|
|
|
|
|
|
|
|
|
|
The following method computes the parity of the 32-bit value in only 8
|
|
|
|
|
operations using a multiply.
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
unsigned int v; // 32-bit word
|
|
|
|
|
v ^= v >> 1;
|
|
|
|
|
v ^= v >> 2;
|
|
|
|
|
v = (v & 0x11111111U) * 0x11111111U;
|
|
|
|
|
return (v >> 28) & 1;
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Also for 64-bits, 8 operations are still enough.
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
unsigned long long v; // 64-bit word
|
|
|
|
|
v ^= v >> 1;
|
|
|
|
|
v ^= v >> 2;
|
|
|
|
|
v = (v & 0x1111111111111111UL) * 0x1111111111111111UL;
|
|
|
|
|
return (v >> 60) & 1;
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Andrew Shapira came up with this and sent it to me on Sept. 2, 2007.
|
|
|
|
|
|
|
|
|
|
unsigned int v; // word value to compute the parity of
|
|
|
|
|
v ^= v >> 16;
|
|
|
|
|
v ^= v >> 8;
|
|
|
|
|
v ^= v >> 4;
|
|
|
|
|
v &= 0xf;
|
|
|
|
|
return (0x6996 >> v) & 1;
|
|
|
|
|
|
|
|
|
|
The method above takes around 9 operations, and works for 32-bit words.
|
|
|
|
|
It may be optimized to work just on bytes in 5 operations by removing
|
|
|
|
|
the two lines immediately following "unsigned int v;". The method first
|
|
|
|
|
shifts and XORs the eight nibbles of the 32-bit value together, leaving
|
|
|
|
|
the result in the lowest nibble of v. Next, the binary number 0110 1001
|
|
|
|
|
1001 0110 (0x6996 in hex) is shifted to the right by the value
|
|
|
|
|
represented in the lowest nibble of v. This number is like a miniature
|
|
|
|
|
16-bit parity-table indexed by the low four bits in v. The result has
|
|
|
|
|
the parity of v in bit 1, which is masked and returned.
|
|
|
|
|
|
|
|
|
|
Thanks to Mathew Hendry for pointing out the shift-lookup idea at the
|
|
|
|
|
end on Dec. 15, 2002. That optimization shaves two operations off using
|
|
|
|
|
only shifting and XORing to find the parity.
|
|
|
|
|
|
|
|
|
|
#define SWAP(a, b) ((&(a) == &(b)) || \
|
|
|
|
|
(((a) -= (b)), ((b) += (a)), ((a) = (b) - (a))))
|
|
|
|
|
|
|
|
|
|
This swaps the values of a and b without using a temporary variable. The
|
|
|
|
|
initial check for a and b being the same location in memory may be
|
|
|
|
|
omitted when you know this can't happen. (The compiler may omit it
|
|
|
|
|
anyway as an optimization.) If you enable overflows exceptions, then
|
|
|
|
|
pass unsigned values so an exception isn't thrown. The XOR method that
|
|
|
|
|
follows may be slightly faster on some machines. Don't use this with
|
|
|
|
|
floating-point numbers (unless you operate on their raw integer
|
|
|
|
|
representations).
|
|
|
|
|
|
|
|
|
|
Sanjeev Sivasankaran suggested I add this on June 12, 2007. Vincent
|
|
|
|
|
Lefèvre pointed out the potential for overflow exceptions on July 9,
|
|
|
|
|
2008
|
|
|
|
|
|
|
|
|
|
#define SWAP(a, b) (((a) ^= (b)), ((b) ^= (a)), ((a) ^= (b)))
|
|
|
|
|
|
|
|
|
|
This is an old trick to exchange the values of the variables a and b
|
|
|
|
|
without using extra space for a temporary variable.
|
|
|
|
|
|
|
|
|
|
On January 20, 2005, Iain A. Fleming pointed out that the macro above
|
|
|
|
|
doesn't work when you swap with the same memory location, such as
|
|
|
|
|
SWAP(a\[i\], a\[j\]) with i == j. So if that may occur, consider
|
|
|
|
|
defining the macro as (((a) == (b)) || (((a) ^= (b)), ((b) ^= (a)), ((a)
|
|
|
|
|
^= (b)))). On July 14, 2009, Hallvard Furuseth suggested that on some
|
|
|
|
|
machines, (((a) ^ (b)) && ((b) ^= (a) ^= (b), (a) ^= (b))) might be
|
|
|
|
|
faster, since the (a) ^ (b) expression is reused.
|
|
|
|
|
|
|
|
|
|
unsigned int i, j; // positions of bit sequences to swap
|
|
|
|
|
unsigned int n; // number of consecutive bits in each sequence
|
|
|
|
|
unsigned int b; // bits to swap reside in b
|
|
|
|
|
unsigned int r; // bit-swapped result goes here
|
|
|
|
|
|
|
|
|
|
unsigned int x = ((b >> i) ^ (b >> j)) & ((1U << n) - 1); // XOR temporary
|
|
|
|
|
r = b ^ ((x << i) | (x << j));
|
|
|
|
|
|
|
|
|
|
As an example of swapping ranges of bits suppose we have have b =
|
|
|
|
|
**001**0**111**1 (expressed in binary) and we want to swap the n = 3
|
|
|
|
|
consecutive bits starting at i = 1 (the second bit from the right) with
|
|
|
|
|
the 3 consecutive bits starting at j = 5; the result would be r =
|
|
|
|
|
**111**0**001**1 (binary).
|
|
|
|
|
|
|
|
|
|
This method of swapping is similar to the general purpose XOR swap
|
|
|
|
|
trick, but intended for operating on individual bits. The variable x
|
|
|
|
|
stores the result of XORing the pairs of bit values we want to swap, and
|
|
|
|
|
then the bits are set to the result of themselves XORed with x. Of
|
|
|
|
|
course, the result is undefined if the sequences overlap.
|
|
|
|
|
|
|
|
|
|
On July 14, 2009 Hallvard Furuseth suggested that I change the 1 \<\< n
|
|
|
|
|
to 1U \<\< n because the value was being assigned to an unsigned and to
|
|
|
|
|
avoid shifting into a sign bit.
|
|
|
|
|
|
|
|
|
|
unsigned int v; // input bits to be reversed
|
|
|
|
|
unsigned int r = v; // r will be reversed bits of v; first get LSB of v
|
|
|
|
|
int s = sizeof(v) * CHAR_BIT - 1; // extra shift needed at end
|
|
|
|
|
|
|
|
|
|
for (v >>= 1; v; v >>= 1)
|
|
|
|
|
{
|
|
|
|
|
r <<= 1;
|
|
|
|
|
r |= v & 1;
|
|
|
|
|
s--;
|
|
|
|
|
}
|
|
|
|
|
r <<= s; // shift when v's highest bits are zero
|
|
|
|
|
|
|
|
|
|
On October 15, 2004, Michael Hoisie pointed out a bug in the original
|
|
|
|
|
version. Randal E. Bryant suggested removing an extra operation on May
|
|
|
|
|
3, 2005. Behdad Esfabod suggested a slight change that eliminated one
|
|
|
|
|
iteration of the loop on May 18, 2005. Then, on February 6, 2007, Liyong
|
|
|
|
|
Zhou suggested a better version that loops while v is not 0, so rather
|
|
|
|
|
than iterating over all bits it stops early.
|
|
|
|
|
|
|
|
|
|
static const unsigned char BitReverseTable256[256] =
|
|
|
|
|
{
|
|
|
|
|
# define R2(n) n, n + 2*64, n + 1*64, n + 3*64
|
|
|
|
|
# define R4(n) R2(n), R2(n + 2*16), R2(n + 1*16), R2(n + 3*16)
|
|
|
|
|
# define R6(n) R4(n), R4(n + 2*4 ), R4(n + 1*4 ), R4(n + 3*4 )
|
|
|
|
|
R6(0), R6(2), R6(1), R6(3)
|
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
unsigned int v; // reverse 32-bit value, 8 bits at time
|
|
|
|
|
unsigned int c; // c will get v reversed
|
|
|
|
|
|
|
|
|
|
// Option 1:
|
|
|
|
|
c = (BitReverseTable256[v & 0xff] << 24) |
|
|
|
|
|
(BitReverseTable256[(v >> 8) & 0xff] << 16) |
|
|
|
|
|
(BitReverseTable256[(v >> 16) & 0xff] << 8) |
|
|
|
|
|
(BitReverseTable256[(v >> 24) & 0xff]);
|
|
|
|
|
|
|
|
|
|
// Option 2:
|
|
|
|
|
unsigned char * p = (unsigned char *) &v;
|
|
|
|
|
unsigned char * q = (unsigned char *) &c;
|
|
|
|
|
q[3] = BitReverseTable256[p[0]];
|
|
|
|
|
q[2] = BitReverseTable256[p[1]];
|
|
|
|
|
q[1] = BitReverseTable256[p[2]];
|
|
|
|
|
q[0] = BitReverseTable256[p[3]];
|
|
|
|
|
|
|
|
|
|
The first method takes about 17 operations, and the second takes about
|
|
|
|
|
12, assuming your CPU can load and store bytes easily.
|
|
|
|
|
|
|
|
|
|
On July 14, 2009 Hallvard Furuseth suggested the macro compacted table.
|
|
|
|
|
|
|
|
|
|
unsigned char b; // reverse this (8-bit) byte
|
|
|
|
|
|
|
|
|
|
b = (b * 0x0202020202ULL & 0x010884422010ULL) % 1023;
|
|
|
|
|
|
|
|
|
|
The multiply operation creates five separate copies of the 8-bit byte
|
|
|
|
|
pattern to fan-out into a 64-bit value. The AND operation selects the
|
|
|
|
|
bits that are in the correct (reversed) positions, relative to each
|
|
|
|
|
10-bit groups of bits. The multiply and the AND operations copy the bits
|
|
|
|
|
from the original byte so they each appear in only one of the 10-bit
|
|
|
|
|
sets. The reversed positions of the bits from the original byte coincide
|
|
|
|
|
with their relative positions within any 10-bit set. The last step,
|
|
|
|
|
which involves modulus division by 2^10 - 1, has the effect of merging
|
|
|
|
|
together each set of 10 bits (from positions 0-9, 10-19, 20-29, ...) in
|
|
|
|
|
the 64-bit value. They do not overlap, so the addition steps underlying
|
|
|
|
|
the modulus division behave like or operations.
|
|
|
|
|
|
|
|
|
|
This method was attributed to Rich Schroeppel in the Programming Hacks
|
|
|
|
|
section of [Beeler, M., Gosper, R. W., and Schroeppel, R. HAKMEM. MIT AI
|
|
|
|
|
Memo 239, Feb. 29,
|
|
|
|
|
1972.](http://www.inwap.com/pdp10/hbaker/hakmem/hakmem.html)
|
|
|
|
|
|
|
|
|
|
unsigned char b; // reverse this byte
|
|
|
|
|
|
|
|
|
|
b = ((b * 0x80200802ULL) & 0x0884422110ULL) * 0x0101010101ULL >> 32;
|
|
|
|
|
|
|
|
|
|
The following shows the flow of the bit values with the boolean
|
|
|
|
|
variables `a, b, c, d, e, f, g,` and `h`, which comprise an 8-bit byte.
|
|
|
|
|
Notice how the first multiply fans out the bit pattern to multiple
|
|
|
|
|
copies, while the last multiply combines them in the fifth byte from the
|
|
|
|
|
right.
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
abcd efgh (-> hgfe dcba)
|
|
|
|
|
* 1000 0000 0010 0000 0000 1000 0000 0010 (0x80200802)
|
|
|
|
|
-------------------------------------------------------------------------------------------------
|
|
|
|
|
0abc defg h00a bcde fgh0 0abc defg h00a bcde fgh0
|
|
|
|
|
& 0000 1000 1000 0100 0100 0010 0010 0001 0001 0000 (0x0884422110)
|
|
|
|
|
-------------------------------------------------------------------------------------------------
|
|
|
|
|
0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000
|
|
|
|
|
* 0000 0001 0000 0001 0000 0001 0000 0001 0000 0001 (0x0101010101)
|
|
|
|
|
-------------------------------------------------------------------------------------------------
|
|
|
|
|
0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000
|
|
|
|
|
0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000
|
|
|
|
|
0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000
|
|
|
|
|
0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000
|
|
|
|
|
0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000
|
|
|
|
|
-------------------------------------------------------------------------------------------------
|
|
|
|
|
0000 d000 h000 dc00 hg00 dcb0 hgf0 dcba hgfe dcba hgfe 0cba 0gfe 00ba 00fe 000a 000e 0000
|
|
|
|
|
>> 32
|
|
|
|
|
-------------------------------------------------------------------------------------------------
|
|
|
|
|
0000 d000 h000 dc00 hg00 dcb0 hgf0 dcba hgfe dcba
|
|
|
|
|
& 1111 1111
|
|
|
|
|
-------------------------------------------------------------------------------------------------
|
|
|
|
|
hgfe dcba
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Note that the last two steps can be combined on some processors because
|
|
|
|
|
the registers can be accessed as bytes; just multiply so that a register
|
|
|
|
|
stores the upper 32 bits of the result and the take the low byte. Thus,
|
|
|
|
|
it may take only 6 operations.
|
|
|
|
|
|
|
|
|
|
Devised by Sean Anderson, July 13,
|
|
|
|
|
2001.
|
|
|
|
|
|
|
|
|
|
b = ((b * 0x0802LU & 0x22110LU) | (b * 0x8020LU & 0x88440LU)) * 0x10101LU >> 16;
|
|
|
|
|
|
|
|
|
|
Make sure you assign or cast the result to an unsigned char to remove
|
|
|
|
|
garbage in the higher bits. Devised by Sean Anderson, July 13, 2001.
|
|
|
|
|
Typo spotted and correction supplied by Mike Keith, January 3, 2002.
|
|
|
|
|
|
|
|
|
|
unsigned int v; // 32-bit word to reverse bit order
|
|
|
|
|
|
|
|
|
|
// swap odd and even bits
|
|
|
|
|
v = ((v >> 1) & 0x55555555) | ((v & 0x55555555) << 1);
|
|
|
|
|
// swap consecutive pairs
|
|
|
|
|
v = ((v >> 2) & 0x33333333) | ((v & 0x33333333) << 2);
|
|
|
|
|
// swap nibbles ...
|
|
|
|
|
v = ((v >> 4) & 0x0F0F0F0F) | ((v & 0x0F0F0F0F) << 4);
|
|
|
|
|
// swap bytes
|
|
|
|
|
v = ((v >> 8) & 0x00FF00FF) | ((v & 0x00FF00FF) << 8);
|
|
|
|
|
// swap 2-byte long pairs
|
|
|
|
|
v = ( v >> 16 ) | ( v << 16);
|
|
|
|
|
|
|
|
|
|
The following variation is also O(lg(N)), however it requires more
|
|
|
|
|
operations to reverse v. Its virtue is in taking less slightly memory by
|
|
|
|
|
computing the constants on the
|
|
|
|
|
fly.
|
|
|
|
|
|
|
|
|
|
unsigned int s = sizeof(v) * CHAR_BIT; // bit size; must be power of 2
|
|
|
|
|
unsigned int mask = ~0;
|
|
|
|
|
while ((s >>= 1) > 0)
|
|
|
|
|
{
|
|
|
|
|
mask ^= (mask << s);
|
|
|
|
|
v = ((v >> s) & mask) | ((v << s) & ~mask);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
These methods above are best suited to situations where N is large. If
|
|
|
|
|
you use the above with 64-bit ints (or larger), then you need to add
|
|
|
|
|
more lines (following the pattern); otherwise only the lower 32 bits
|
|
|
|
|
will be reversed and the result will be in the lower 32 bits.
|
|
|
|
|
|
|
|
|
|
See Dr. Dobb's Journal 1983, Edwin Freed's article on Binary Magic
|
|
|
|
|
Numbers for more information. The second variation was suggested by Ken
|
|
|
|
|
Raeburn on September 13, 2005. Veldmeijer mentioned that the first
|
|
|
|
|
version could do without ANDS in the last line on March 19, 2006.
|
|
|
|
|
|
|
|
|
|
const unsigned int n; // numerator
|
|
|
|
|
const unsigned int s;
|
|
|
|
|
const unsigned int d = 1U << s; // So d will be one of: 1, 2, 4, 8, 16, 32, ...
|
|
|
|
|
unsigned int m; // m will be n % d
|
|
|
|
|
m = n & (d - 1);
|
|
|
|
|
|
|
|
|
|
Most programmers learn this trick early, but it was included for the
|
|
|
|
|
sake of completeness.
|
|
|
|
|
|
|
|
|
|
unsigned int n; // numerator
|
|
|
|
|
const unsigned int s; // s > 0
|
|
|
|
|
const unsigned int d = (1 << s) - 1; // so d is either 1, 3, 7, 15, 31, ...).
|
|
|
|
|
unsigned int m; // n % d goes here.
|
|
|
|
|
|
|
|
|
|
for (m = n; n > d; n = m)
|
|
|
|
|
{
|
|
|
|
|
for (m = 0; n; n >>= s)
|
|
|
|
|
{
|
|
|
|
|
m += n & d;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
// Now m is a value from 0 to d, but since with modulus division
|
|
|
|
|
// we want m to be 0 when it is d.
|
|
|
|
|
m = m == d ? 0 : m;
|
|
|
|
|
|
|
|
|
|
This method of modulus division by an integer that is one less than a
|
|
|
|
|
power of 2 takes at most 5 + (4 + 5 \* ceil(N / s)) \* ceil(lg(N / s))
|
|
|
|
|
operations, where N is the number of bits in the numerator. In other
|
|
|
|
|
words, it takes at most O(N \* lg(N)) time.
|
|
|
|
|
|
|
|
|
|
Devised by Sean Anderson, August 15, 2001. Before Sean A. Irvine
|
|
|
|
|
corrected me on June 17, 2004, I mistakenly commented that we could
|
|
|
|
|
alternatively assign `m = ((m + 1) & d) - 1;` at the end. Michael Miller
|
|
|
|
|
spotted a typo in the code April 25, 2005.
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
// The following is for a word size of 32 bits!
|
|
|
|
|
|
|
|
|
|
static const unsigned int M[] =
|
|
|
|
|
{
|
|
|
|
|
0x00000000, 0x55555555, 0x33333333, 0xc71c71c7,
|
|
|
|
|
0x0f0f0f0f, 0xc1f07c1f, 0x3f03f03f, 0xf01fc07f,
|
|
|
|
|
0x00ff00ff, 0x07fc01ff, 0x3ff003ff, 0xffc007ff,
|
|
|
|
|
0xff000fff, 0xfc001fff, 0xf0003fff, 0xc0007fff,
|
|
|
|
|
0x0000ffff, 0x0001ffff, 0x0003ffff, 0x0007ffff,
|
|
|
|
|
0x000fffff, 0x001fffff, 0x003fffff, 0x007fffff,
|
|
|
|
|
0x00ffffff, 0x01ffffff, 0x03ffffff, 0x07ffffff,
|
|
|
|
|
0x0fffffff, 0x1fffffff, 0x3fffffff, 0x7fffffff
|
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
static const unsigned int Q[][6] =
|
|
|
|
|
{
|
|
|
|
|
{ 0, 0, 0, 0, 0, 0}, {16, 8, 4, 2, 1, 1}, {16, 8, 4, 2, 2, 2},
|
|
|
|
|
{15, 6, 3, 3, 3, 3}, {16, 8, 4, 4, 4, 4}, {15, 5, 5, 5, 5, 5},
|
|
|
|
|
{12, 6, 6, 6 , 6, 6}, {14, 7, 7, 7, 7, 7}, {16, 8, 8, 8, 8, 8},
|
|
|
|
|
{ 9, 9, 9, 9, 9, 9}, {10, 10, 10, 10, 10, 10}, {11, 11, 11, 11, 11, 11},
|
|
|
|
|
{12, 12, 12, 12, 12, 12}, {13, 13, 13, 13, 13, 13}, {14, 14, 14, 14, 14, 14},
|
|
|
|
|
{15, 15, 15, 15, 15, 15}, {16, 16, 16, 16, 16, 16}, {17, 17, 17, 17, 17, 17},
|
|
|
|
|
{18, 18, 18, 18, 18, 18}, {19, 19, 19, 19, 19, 19}, {20, 20, 20, 20, 20, 20},
|
|
|
|
|
{21, 21, 21, 21, 21, 21}, {22, 22, 22, 22, 22, 22}, {23, 23, 23, 23, 23, 23},
|
|
|
|
|
{24, 24, 24, 24, 24, 24}, {25, 25, 25, 25, 25, 25}, {26, 26, 26, 26, 26, 26},
|
|
|
|
|
{27, 27, 27, 27, 27, 27}, {28, 28, 28, 28, 28, 28}, {29, 29, 29, 29, 29, 29},
|
|
|
|
|
{30, 30, 30, 30, 30, 30}, {31, 31, 31, 31, 31, 31}
|
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
static const unsigned int R[][6] =
|
|
|
|
|
{
|
|
|
|
|
{0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000},
|
|
|
|
|
{0x0000ffff, 0x000000ff, 0x0000000f, 0x00000003, 0x00000001, 0x00000001},
|
|
|
|
|
{0x0000ffff, 0x000000ff, 0x0000000f, 0x00000003, 0x00000003, 0x00000003},
|
|
|
|
|
{0x00007fff, 0x0000003f, 0x00000007, 0x00000007, 0x00000007, 0x00000007},
|
|
|
|
|
{0x0000ffff, 0x000000ff, 0x0000000f, 0x0000000f, 0x0000000f, 0x0000000f},
|
|
|
|
|
{0x00007fff, 0x0000001f, 0x0000001f, 0x0000001f, 0x0000001f, 0x0000001f},
|
|
|
|
|
{0x00000fff, 0x0000003f, 0x0000003f, 0x0000003f, 0x0000003f, 0x0000003f},
|
|
|
|
|
{0x00003fff, 0x0000007f, 0x0000007f, 0x0000007f, 0x0000007f, 0x0000007f},
|
|
|
|
|
{0x0000ffff, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff},
|
|
|
|
|
{0x000001ff, 0x000001ff, 0x000001ff, 0x000001ff, 0x000001ff, 0x000001ff},
|
|
|
|
|
{0x000003ff, 0x000003ff, 0x000003ff, 0x000003ff, 0x000003ff, 0x000003ff},
|
|
|
|
|
{0x000007ff, 0x000007ff, 0x000007ff, 0x000007ff, 0x000007ff, 0x000007ff},
|
|
|
|
|
{0x00000fff, 0x00000fff, 0x00000fff, 0x00000fff, 0x00000fff, 0x00000fff},
|
|
|
|
|
{0x00001fff, 0x00001fff, 0x00001fff, 0x00001fff, 0x00001fff, 0x00001fff},
|
|
|
|
|
{0x00003fff, 0x00003fff, 0x00003fff, 0x00003fff, 0x00003fff, 0x00003fff},
|
|
|
|
|
{0x00007fff, 0x00007fff, 0x00007fff, 0x00007fff, 0x00007fff, 0x00007fff},
|
|
|
|
|
{0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff},
|
|
|
|
|
{0x0001ffff, 0x0001ffff, 0x0001ffff, 0x0001ffff, 0x0001ffff, 0x0001ffff},
|
|
|
|
|
{0x0003ffff, 0x0003ffff, 0x0003ffff, 0x0003ffff, 0x0003ffff, 0x0003ffff},
|
|
|
|
|
{0x0007ffff, 0x0007ffff, 0x0007ffff, 0x0007ffff, 0x0007ffff, 0x0007ffff},
|
|
|
|
|
{0x000fffff, 0x000fffff, 0x000fffff, 0x000fffff, 0x000fffff, 0x000fffff},
|
|
|
|
|
{0x001fffff, 0x001fffff, 0x001fffff, 0x001fffff, 0x001fffff, 0x001fffff},
|
|
|
|
|
{0x003fffff, 0x003fffff, 0x003fffff, 0x003fffff, 0x003fffff, 0x003fffff},
|
|
|
|
|
{0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff},
|
|
|
|
|
{0x00ffffff, 0x00ffffff, 0x00ffffff, 0x00ffffff, 0x00ffffff, 0x00ffffff},
|
|
|
|
|
{0x01ffffff, 0x01ffffff, 0x01ffffff, 0x01ffffff, 0x01ffffff, 0x01ffffff},
|
|
|
|
|
{0x03ffffff, 0x03ffffff, 0x03ffffff, 0x03ffffff, 0x03ffffff, 0x03ffffff},
|
|
|
|
|
{0x07ffffff, 0x07ffffff, 0x07ffffff, 0x07ffffff, 0x07ffffff, 0x07ffffff},
|
|
|
|
|
{0x0fffffff, 0x0fffffff, 0x0fffffff, 0x0fffffff, 0x0fffffff, 0x0fffffff},
|
|
|
|
|
{0x1fffffff, 0x1fffffff, 0x1fffffff, 0x1fffffff, 0x1fffffff, 0x1fffffff},
|
|
|
|
|
{0x3fffffff, 0x3fffffff, 0x3fffffff, 0x3fffffff, 0x3fffffff, 0x3fffffff},
|
|
|
|
|
{0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff}
|
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
unsigned int n; // numerator
|
|
|
|
|
const unsigned int s; // s > 0
|
|
|
|
|
const unsigned int d = (1 << s) - 1; // so d is either 1, 3, 7, 15, 31, ...).
|
|
|
|
|
unsigned int m; // n % d goes here.
|
|
|
|
|
|
|
|
|
|
m = (n & M[s]) + ((n >> s) & M[s]);
|
|
|
|
|
|
|
|
|
|
for (const unsigned int * q = &Q[s][0], * r = &R[s][0]; m > d; q++, r++)
|
|
|
|
|
{
|
|
|
|
|
m = (m >> *q) + (m & *r);
|
|
|
|
|
}
|
|
|
|
|
m = m == d ? 0 : m; // OR, less portably: m = m & -((signed)(m - d) >> s);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
This method of finding modulus division by an integer that is one less
|
|
|
|
|
than a power of 2 takes at most O(lg(N)) time, where N is the number of
|
|
|
|
|
bits in the numerator (32 bits, for the code above). The number of
|
|
|
|
|
operations is at most 12 + 9 \* ceil(lg(N)). The tables may be removed
|
|
|
|
|
if you know the denominator at compile time; just extract the few
|
|
|
|
|
relevent entries and unroll the loop. It may be easily extended to more
|
|
|
|
|
bits.
|
|
|
|
|
|
|
|
|
|
It finds the result by summing the values in base (1 \<\< s) in
|
|
|
|
|
parallel. First every other base (1 \<\< s) value is added to the
|
|
|
|
|
previous one. Imagine that the result is written on a piece of paper.
|
|
|
|
|
Cut the paper in half, so that half the values are on each cut piece.
|
|
|
|
|
Align the values and sum them onto a new piece of paper. Repeat by
|
|
|
|
|
cutting this paper in half (which will be a quarter of the size of the
|
|
|
|
|
previous one) and summing, until you cannot cut further. After
|
|
|
|
|
performing lg(N/s/2) cuts, we cut no more; just continue to add the
|
|
|
|
|
values and put the result onto a new piece of paper as before, while
|
|
|
|
|
there are at least two s-bit values.
|
|
|
|
|
|
|
|
|
|
Devised by Sean Anderson, August 20, 2001. A typo was spotted by Randy
|
|
|
|
|
E. Bryant on May 3, 2005 (after pasting the code, I had later added
|
|
|
|
|
"unsinged" to a variable declaration). As in the previous hack, I
|
|
|
|
|
mistakenly commented that we could alternatively assign `m = ((m + 1) &
|
|
|
|
|
d) - 1;` at the end, and Don Knuth corrected me on April 19, 2006 and
|
|
|
|
|
suggested `m = m & -((signed)(m - d) >> s)`. On June 18, 2009 Sean
|
|
|
|
|
Irvine proposed a change that used `((n >> s) & M[s])` instead of `((n &
|
|
|
|
|
~M[s]) >> s)`, which typically requires fewer operations because the
|
|
|
|
|
M\[s\] constant is already loaded.
|
|
|
|
|
|
|
|
|
|
unsigned int v; // 32-bit word to find the log base 2 of
|
|
|
|
|
unsigned int r = 0; // r will be lg(v)
|
|
|
|
|
|
|
|
|
|
while (v >>= 1) // unroll for more speed...
|
|
|
|
|
{
|
|
|
|
|
r++;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
The log base 2 of an integer is the same as the position of the highest
|
|
|
|
|
bit set (or most significant bit set, MSB). The following log base 2
|
|
|
|
|
methods are faster than this one.
|
|
|
|
|
|
|
|
|
|
int v; // 32-bit integer to find the log base 2 of
|
|
|
|
|
int r; // result of log_2(v) goes here
|
|
|
|
|
union { unsigned int u[2]; double d; } t; // temp
|
|
|
|
|
|
|
|
|
|
t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;
|
|
|
|
|
t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = v;
|
|
|
|
|
t.d -= 4503599627370496.0;
|
|
|
|
|
r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;
|
|
|
|
|
|
|
|
|
|
The code above loads a 64-bit (IEEE-754 floating-point) double with a
|
|
|
|
|
32-bit integer (with no paddding bits) by storing the integer in the
|
|
|
|
|
mantissa while the exponent is set to 252. From this newly minted
|
|
|
|
|
double, 252 (expressed as a double) is subtracted, which sets the
|
|
|
|
|
resulting exponent to the log base 2 of the input value, v. All that is
|
|
|
|
|
left is shifting the exponent bits into position (20 bits right) and
|
|
|
|
|
subtracting the bias, 0x3FF (which is 1023 decimal). This technique only
|
|
|
|
|
takes 5 operations, but many CPUs are slow at manipulating doubles, and
|
|
|
|
|
the endianess of the architecture must be accommodated.
|
|
|
|
|
|
|
|
|
|
Eric Cole sent me this on January 15, 2006. Evan Felix pointed out a
|
|
|
|
|
typo on April 4, 2006. Vincent Lefèvre told me on July 9, 2008 to change
|
|
|
|
|
the endian check to use the float's endian, which could differ from the
|
|
|
|
|
integer's endian.
|
|
|
|
|
|
|
|
|
|
static const char LogTable256[256] =
|
|
|
|
|
{
|
|
|
|
|
#define LT(n) n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n
|
|
|
|
|
-1, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
|
|
|
|
|
LT(4), LT(5), LT(5), LT(6), LT(6), LT(6), LT(6),
|
|
|
|
|
LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7)
|
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
unsigned int v; // 32-bit word to find the log of
|
|
|
|
|
unsigned r; // r will be lg(v)
|
|
|
|
|
register unsigned int t, tt; // temporaries
|
|
|
|
|
|
|
|
|
|
if (tt = v >> 16)
|
|
|
|
|
{
|
|
|
|
|
r = (t = tt >> 8) ? 24 + LogTable256[t] : 16 + LogTable256[tt];
|
|
|
|
|
}
|
|
|
|
|
else
|
|
|
|
|
{
|
|
|
|
|
r = (t = v >> 8) ? 8 + LogTable256[t] : LogTable256[v];
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
The lookup table method takes only about 7 operations to find the log of
|
|
|
|
|
a 32-bit value. If extended for 64-bit quantities, it would take roughly
|
|
|
|
|
9 operations. Another operation can be trimmed off by using four tables,
|
|
|
|
|
with the possible additions incorporated into each. Using int table
|
|
|
|
|
elements may be faster, depending on your architecture.
|
|
|
|
|
|
|
|
|
|
The code above is tuned to uniformly distributed output values. If your
|
|
|
|
|
inputs are evenly distributed across all 32-bit values, then consider
|
|
|
|
|
using the following:
|
|
|
|
|
|
|
|
|
|
if (tt = v >> 24)
|
|
|
|
|
{
|
|
|
|
|
r = 24 + LogTable256[tt];
|
|
|
|
|
}
|
|
|
|
|
else if (tt = v >> 16)
|
|
|
|
|
{
|
|
|
|
|
r = 16 + LogTable256[tt];
|
|
|
|
|
}
|
|
|
|
|
else if (tt = v >> 8)
|
|
|
|
|
{
|
|
|
|
|
r = 8 + LogTable256[tt];
|
|
|
|
|
}
|
|
|
|
|
else
|
|
|
|
|
{
|
|
|
|
|
r = LogTable256[v];
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
To initially generate the log table algorithmically:
|
|
|
|
|
|
|
|
|
|
LogTable256[0] = LogTable256[1] = 0;
|
|
|
|
|
for (int i = 2; i < 256; i++)
|
|
|
|
|
{
|
|
|
|
|
LogTable256[i] = 1 + LogTable256[i / 2];
|
|
|
|
|
}
|
|
|
|
|
LogTable256[0] = -1; // if you want log(0) to return -1
|
|
|
|
|
|
|
|
|
|
Behdad Esfahbod and I shaved off a fraction of an operation (on average)
|
|
|
|
|
on May 18, 2005. Yet another fraction of an operation was removed on
|
|
|
|
|
November 14, 2006 by Emanuel Hoogeveen. The variation that is tuned to
|
|
|
|
|
evenly distributed input values was suggested by David A. Butterfield on
|
|
|
|
|
September 19, 2008. Venkat Reddy told me on January 5, 2009 that log(0)
|
|
|
|
|
should return -1 to indicate an error, so I changed the first entry in
|
|
|
|
|
the table to that.
|
|
|
|
|
|
|
|
|
|
unsigned int v; // 32-bit value to find the log2 of
|
|
|
|
|
const unsigned int b[] = {0x2, 0xC, 0xF0, 0xFF00, 0xFFFF0000};
|
|
|
|
|
const unsigned int S[] = {1, 2, 4, 8, 16};
|
|
|
|
|
int i;
|
|
|
|
|
|
|
|
|
|
register unsigned int r = 0; // result of log2(v) will go here
|
|
|
|
|
for (i = 4; i >= 0; i--) // unroll for speed...
|
|
|
|
|
{
|
|
|
|
|
if (v & b[i])
|
|
|
|
|
{
|
|
|
|
|
v >>= S[i];
|
|
|
|
|
r |= S[i];
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
// OR (IF YOUR CPU BRANCHES SLOWLY):
|
|
|
|
|
|
|
|
|
|
unsigned int v; // 32-bit value to find the log2 of
|
|
|
|
|
register unsigned int r; // result of log2(v) will go here
|
|
|
|
|
register unsigned int shift;
|
|
|
|
|
|
|
|
|
|
r = (v > 0xFFFF) << 4; v >>= r;
|
|
|
|
|
shift = (v > 0xFF ) << 3; v >>= shift; r |= shift;
|
|
|
|
|
shift = (v > 0xF ) << 2; v >>= shift; r |= shift;
|
|
|
|
|
shift = (v > 0x3 ) << 1; v >>= shift; r |= shift;
|
|
|
|
|
r |= (v >> 1);
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
// OR (IF YOU KNOW v IS A POWER OF 2):
|
|
|
|
|
|
|
|
|
|
unsigned int v; // 32-bit value to find the log2 of
|
|
|
|
|
static const unsigned int b[] = {0xAAAAAAAA, 0xCCCCCCCC, 0xF0F0F0F0,
|
|
|
|
|
0xFF00FF00, 0xFFFF0000};
|
|
|
|
|
register unsigned int r = (v & b[0]) != 0;
|
|
|
|
|
for (i = 4; i > 0; i--) // unroll for speed...
|
|
|
|
|
{
|
|
|
|
|
r |= ((v & b[i]) != 0) << i;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
Of course, to extend the code to find the log of a 33- to 64-bit number,
|
|
|
|
|
we would append another element, 0xFFFFFFFF00000000, to b, append 32 to
|
|
|
|
|
S, and loop from 5 to 0. This method is much slower than the earlier
|
|
|
|
|
table-lookup version, but if you don't want big table or your
|
|
|
|
|
architecture is slow to access memory, it's a good choice. The second
|
|
|
|
|
variation involves slightly more operations, but it may be faster on
|
|
|
|
|
machines with high branch costs (e.g. PowerPC).
|
|
|
|
|
|
|
|
|
|
The second version was sent to me by [Eric
|
|
|
|
|
Cole](http://www.balance-software.com/ec/) on January 7, 2006. Andrew
|
|
|
|
|
Shapira subsequently trimmed a few operations off of it and sent me his
|
|
|
|
|
variation (above) on Sept. 1, 2007. The third variation was suggested to
|
|
|
|
|
me by [John Owens](http://www.ece.ucdavis.edu/~jowens/) on April 24,
|
|
|
|
|
2002; it's faster, but it is only suitable when the input is known to be
|
|
|
|
|
a power of 2. On May 25, 2003, Ken Raeburn suggested improving the
|
|
|
|
|
general case by using smaller numbers for b\[\], which load faster on
|
|
|
|
|
some architectures (for instance if the word size is 16 bits, then only
|
|
|
|
|
one load instruction may be needed). These values work for the general
|
|
|
|
|
version, but not for the special-case version below it, where v is a
|
|
|
|
|
power of 2; Glenn Slayden brought this oversight to my attention on
|
|
|
|
|
December 12, 2003.
|
|
|
|
|
|
|
|
|
|
uint32_t v; // find the log base 2 of 32-bit v
|
|
|
|
|
int r; // result goes here
|
|
|
|
|
|
|
|
|
|
static const int MultiplyDeBruijnBitPosition[32] =
|
|
|
|
|
{
|
|
|
|
|
0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
|
|
|
|
|
8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
|
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
v |= v >> 1; // first round down to one less than a power of 2
|
|
|
|
|
v |= v >> 2;
|
|
|
|
|
v |= v >> 4;
|
|
|
|
|
v |= v >> 8;
|
|
|
|
|
v |= v >> 16;
|
|
|
|
|
|
|
|
|
|
r = MultiplyDeBruijnBitPosition[(uint32_t)(v * 0x07C4ACDDU) >> 27];
|
|
|
|
|
|
|
|
|
|
The code above computes the log base 2 of a 32-bit integer with a small
|
|
|
|
|
table lookup and multiply. It requires only 13 operations, compared to
|
|
|
|
|
(up to) 20 for the previous method. The purely table-based method
|
|
|
|
|
requires the fewest operations, but this offers a reasonable compromise
|
|
|
|
|
between table size and speed.
|
|
|
|
|
|
|
|
|
|
If you know that v is a power of 2, then you only need the following:
|
|
|
|
|
|
|
|
|
|
static const int MultiplyDeBruijnBitPosition2[32] =
|
|
|
|
|
{
|
|
|
|
|
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
|
|
|
|
|
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
|
|
|
|
|
};
|
|
|
|
|
r = MultiplyDeBruijnBitPosition2[(uint32_t)(v * 0x077CB531U) >> 27];
|
|
|
|
|
|
|
|
|
|
Eric Cole devised this January 8, 2006 after reading about the entry
|
|
|
|
|
below to [round up to a power of 2](#RoundUpPowerOf2) and the method
|
|
|
|
|
below for [computing the number of trailing bits with a multiply and
|
|
|
|
|
lookup](#ZerosOnRightMultLookup) using a DeBruijn sequence. On December
|
|
|
|
|
10, 2009, Mark Dickinson shaved off a couple operations by requiring v
|
|
|
|
|
be rounded up to one less than the next power of 2 rather than the power
|
|
|
|
|
of
|
|
|
|
|
2.
|
|
|
|
|
|
|
|
|
|
unsigned int v; // non-zero 32-bit integer value to compute the log base 10 of
|
|
|
|
|
int r; // result goes here
|
|
|
|
|
int t; // temporary
|
|
|
|
|
|
|
|
|
|
static unsigned int const PowersOf10[] =
|
|
|
|
|
{1, 10, 100, 1000, 10000, 100000,
|
|
|
|
|
1000000, 10000000, 100000000, 1000000000};
|
|
|
|
|
|
|
|
|
|
t = (IntegerLogBase2(v) + 1) * 1233 >> 12; // (use a lg2 method from above)
|
|
|
|
|
r = t - (v < PowersOf10[t]);
|
|
|
|
|
|
|
|
|
|
The integer log base 10 is computed by first using one of the techniques
|
|
|
|
|
above for finding the log base 2. By the relationship log10(v) = log2(v)
|
|
|
|
|
/ log2(10), we need to multiply it by 1/log2(10), which is approximately
|
|
|
|
|
1233/4096, or 1233 followed by a right shift of 12. Adding one is needed
|
|
|
|
|
because the IntegerLogBase2 rounds down. Finally, since the value t is
|
|
|
|
|
only an approximation that may be off by one, the exact value is found
|
|
|
|
|
by subtracting the result of v \< PowersOf10\[t\].
|
|
|
|
|
|
|
|
|
|
This method takes 6 more operations than IntegerLogBase2. It may be sped
|
|
|
|
|
up (on machines with fast memory access) by modifying the log base 2
|
|
|
|
|
table-lookup method above so that the entries hold what is computed for
|
|
|
|
|
t (that is, pre-add, -mulitply, and -shift). Doing so would require a
|
|
|
|
|
total of only 9 operations to find the log base 10, assuming 4 tables
|
|
|
|
|
were used (one for each byte of v).
|
|
|
|
|
|
|
|
|
|
Eric Cole suggested I add a version of this on January 7,
|
|
|
|
|
2006.
|
|
|
|
|
|
|
|
|
|
unsigned int v; // non-zero 32-bit integer value to compute the log base 10 of
|
|
|
|
|
int r; // result goes here
|
|
|
|
|
|
|
|
|
|
r = (v >= 1000000000) ? 9 : (v >= 100000000) ? 8 : (v >= 10000000) ? 7 :
|
|
|
|
|
(v >= 1000000) ? 6 : (v >= 100000) ? 5 : (v >= 10000) ? 4 :
|
|
|
|
|
(v >= 1000) ? 3 : (v >= 100) ? 2 : (v >= 10) ? 1 : 0;
|
|
|
|
|
|
|
|
|
|
This method works well when the input is uniformly distributed over
|
|
|
|
|
32-bit values because 76% of the inputs are caught by the first compare,
|
|
|
|
|
21% are caught by the second compare, 2% are caught by the third, and so
|
|
|
|
|
on (chopping the remaining down by 90% with each comparision). As a
|
|
|
|
|
result, less than 2.6 operations are needed on average.
|
|
|
|
|
|
|
|
|
|
On April 18, 2007, Emanuel Hoogeveen suggested a variation on this where
|
|
|
|
|
the conditions used divisions, which were not as fast as simple
|
|
|
|
|
comparisons.
|
|
|
|
|
|
|
|
|
|
const float v; // find int(log2(v)), where v > 0.0 && finite(v) && isnormal(v)
|
|
|
|
|
int c; // 32-bit int c gets the result;
|
|
|
|
|
|
|
|
|
|
c = *(const int *) &v; // OR, for portability: memcpy(&c, &v, sizeof c);
|
|
|
|
|
c = (c >> 23) - 127;
|
|
|
|
|
|
|
|
|
|
The above is fast, but IEEE 754-compliant architectures utilize
|
|
|
|
|
subnormal (also called denormal) floating point numbers. These have the
|
|
|
|
|
exponent bits set to zero (signifying pow(2,-127)), and the mantissa is
|
|
|
|
|
not normalized, so it contains leading zeros and thus the log2 must be
|
|
|
|
|
computed from the mantissa. To accomodate for subnormal numbers, use the
|
|
|
|
|
following:
|
|
|
|
|
|
|
|
|
|
const float v; // find int(log2(v)), where v > 0.0 && finite(v)
|
|
|
|
|
int c; // 32-bit int c gets the result;
|
|
|
|
|
int x = *(const int *) &v; // OR, for portability: memcpy(&x, &v, sizeof x);
|
|
|
|
|
|
|
|
|
|
c = x >> 23;
|
|
|
|
|
|
|
|
|
|
if (c)
|
|
|
|
|
{
|
|
|
|
|
c -= 127;
|
|
|
|
|
}
|
|
|
|
|
else
|
|
|
|
|
{ // subnormal, so recompute using mantissa: c = intlog2(x) - 149;
|
|
|
|
|
register unsigned int t; // temporary
|
|
|
|
|
// Note that LogTable256 was defined earlier
|
|
|
|
|
if (t = x >> 16)
|
|
|
|
|
{
|
|
|
|
|
c = LogTable256[t] - 133;
|
|
|
|
|
}
|
|
|
|
|
else
|
|
|
|
|
{
|
|
|
|
|
c = (t = x >> 8) ? LogTable256[t] - 141 : LogTable256[x] - 149;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
On June 20, 2004, Sean A. Irvine suggested that I include code to handle
|
|
|
|
|
subnormal numbers. On June 11, 2005, Falk Hüffner pointed out that ISO
|
|
|
|
|
C99 6.5/7 specified undefined behavior for the common type punning idiom
|
|
|
|
|
\*(int \*)&, though it has worked on 99.9% of C compilers. He proposed
|
|
|
|
|
using memcpy for maximum portability or a union with a float and an int
|
|
|
|
|
for better code generation than memcpy on some compilers.
|
|
|
|
|
|
|
|
|
|
const int r;
|
|
|
|
|
const float v; // find int(log2(pow((double) v, 1. / pow(2, r)))),
|
|
|
|
|
// where isnormal(v) and v > 0
|
|
|
|
|
int c; // 32-bit int c gets the result;
|
|
|
|
|
|
|
|
|
|
c = *(const int *) &v; // OR, for portability: memcpy(&c, &v, sizeof c);
|
|
|
|
|
c = ((((c - 0x3f800000) >> r) + 0x3f800000) >> 23) - 127;
|
|
|
|
|
|
|
|
|
|
So, if r is 0, for example, we have c = int(log2((double) v)). If r is
|
|
|
|
|
1, then we have c = int(log2(sqrt((double) v))). If r is 2, then we have
|
|
|
|
|
c = int(log2(pow((double) v, 1./4))).
|
|
|
|
|
|
|
|
|
|
On June 11, 2005, Falk Hüffner pointed out that ISO C99 6.5/7 left the
|
|
|
|
|
type punning idiom \*(int \*)& undefined, and he suggested using memcpy.
|
|
|
|
|
|
|
|
|
|
unsigned int v; // input to count trailing zero bits
|
|
|
|
|
int c; // output: c will count v's trailing zero bits,
|
|
|
|
|
// so if v is 1101000 (base 2), then c will be 3
|
|
|
|
|
if (v)
|
|
|
|
|
{
|
|
|
|
|
v = (v ^ (v - 1)) >> 1; // Set v's trailing 0s to 1s and zero rest
|
|
|
|
|
for (c = 0; v; c++)
|
|
|
|
|
{
|
|
|
|
|
v >>= 1;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
else
|
|
|
|
|
{
|
|
|
|
|
c = CHAR_BIT * sizeof(v);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
The average number of trailing zero bits in a (uniformly distributed)
|
|
|
|
|
random binary number is one, so this O(trailing zeros) solution isn't
|
|
|
|
|
that bad compared to the faster methods below.
|
|
|
|
|
|
|
|
|
|
Jim Cole suggested I add a linear-time method for counting the trailing
|
|
|
|
|
zeros on August 15, 2007. On October 22, 2007, Jason Cunningham pointed
|
|
|
|
|
out that I had neglected to paste the unsigned modifier for
|
|
|
|
|
v.
|
|
|
|
|
|
|
|
|
|
unsigned int v; // 32-bit word input to count zero bits on right
|
|
|
|
|
unsigned int c = 32; // c will be the number of zero bits on the right
|
|
|
|
|
v &= -signed(v);
|
|
|
|
|
if (v) c--;
|
|
|
|
|
if (v & 0x0000FFFF) c -= 16;
|
|
|
|
|
if (v & 0x00FF00FF) c -= 8;
|
|
|
|
|
if (v & 0x0F0F0F0F) c -= 4;
|
|
|
|
|
if (v & 0x33333333) c -= 2;
|
|
|
|
|
if (v & 0x55555555) c -= 1;
|
|
|
|
|
|
|
|
|
|
Here, we are basically doing the same operations as finding the log base
|
|
|
|
|
2 in parallel, but we first isolate the lowest 1 bit, and then proceed
|
|
|
|
|
with c starting at the maximum and decreasing. The number of operations
|
|
|
|
|
is at most 3 \* lg(N) + 4, roughly, for N bit words.
|
|
|
|
|
|
|
|
|
|
Bill Burdick suggested an optimization, reducing the time from 4 \*
|
|
|
|
|
lg(N) on February 4, 2011.
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
unsigned int v; // 32-bit word input to count zero bits on right
|
|
|
|
|
unsigned int c; // c will be the number of zero bits on the right,
|
|
|
|
|
// so if v is 1101000 (base 2), then c will be 3
|
|
|
|
|
// NOTE: if 0 == v, then c = 31.
|
|
|
|
|
if (v & 0x1)
|
|
|
|
|
{
|
|
|
|
|
// special case for odd v (assumed to happen half of the time)
|
|
|
|
|
c = 0;
|
|
|
|
|
}
|
|
|
|
|
else
|
|
|
|
|
{
|
|
|
|
|
c = 1;
|
|
|
|
|
if ((v & 0xffff) == 0)
|
|
|
|
|
{
|
|
|
|
|
v >>= 16;
|
|
|
|
|
c += 16;
|
|
|
|
|
}
|
|
|
|
|
if ((v & 0xff) == 0)
|
|
|
|
|
{
|
|
|
|
|
v >>= 8;
|
|
|
|
|
c += 8;
|
|
|
|
|
}
|
|
|
|
|
if ((v & 0xf) == 0)
|
|
|
|
|
{
|
|
|
|
|
v >>= 4;
|
|
|
|
|
c += 4;
|
|
|
|
|
}
|
|
|
|
|
if ((v & 0x3) == 0)
|
|
|
|
|
{
|
|
|
|
|
v >>= 2;
|
|
|
|
|
c += 2;
|
|
|
|
|
}
|
|
|
|
|
c -= v & 0x1;
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
The code above is similar to the previous method, but it computes the
|
|
|
|
|
number of trailing zeros by accumulating c in a manner akin to binary
|
|
|
|
|
search. In the first step, it checks if the bottom 16 bits of v are
|
|
|
|
|
zeros, and if so, shifts v right 16 bits and adds 16 to c, which reduces
|
|
|
|
|
the number of bits in v to consider by half. Each of the subsequent
|
|
|
|
|
conditional steps likewise halves the number of bits until there is only
|
|
|
|
|
1. This method is faster than the last one (by about 33%) because the
|
|
|
|
|
bodies of the if statements are executed less often.
|
|
|
|
|
|
|
|
|
|
Matt Whitlock suggested this on January 25, 2006. Andrew Shapira shaved
|
|
|
|
|
a couple operations off on Sept. 5, 2007 (by setting c=1 and
|
|
|
|
|
unconditionally subtracting at the end).
|
|
|
|
|
|
|
|
|
|
unsigned int v; // find the number of trailing zeros in v
|
|
|
|
|
int r; // the result goes here
|
|
|
|
|
float f = (float)(v & -v); // cast the least significant bit in v to a float
|
|
|
|
|
r = (*(uint32_t *)&f >> 23) - 0x7f;
|
|
|
|
|
|
|
|
|
|
Although this only takes about 6 operations, the time to convert an
|
|
|
|
|
integer to a float can be high on some machines. The exponent of the
|
|
|
|
|
32-bit IEEE floating point representation is shifted down, and the bias
|
|
|
|
|
is subtracted to give the position of the least significant 1 bit set in
|
|
|
|
|
v. If v is zero, then the result is -127.
|
|
|
|
|
|
|
|
|
|
unsigned int v; // find the number of trailing zeros in v
|
|
|
|
|
int r; // put the result in r
|
|
|
|
|
static const int Mod37BitPosition[] = // map a bit value mod 37 to its position
|
|
|
|
|
{
|
|
|
|
|
32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0, 13, 4,
|
|
|
|
|
7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14, 9, 5,
|
|
|
|
|
20, 8, 19, 18
|
|
|
|
|
};
|
|
|
|
|
r = Mod37BitPosition[(-v & v) % 37];
|
|
|
|
|
|
|
|
|
|
The code above finds the number of zeros that are trailing on the right,
|
|
|
|
|
so binary 0100 would produce 2. It makes use of the fact that the first
|
|
|
|
|
32 bit position values are relatively prime with 37, so performing a
|
|
|
|
|
modulus division with 37 gives a unique number from 0 to 36 for each.
|
|
|
|
|
These numbers may then be mapped to the number of zeros using a small
|
|
|
|
|
lookup table. It uses only 4 operations, however indexing into a table
|
|
|
|
|
and performing modulus division may make it unsuitable for some
|
|
|
|
|
situations. I came up with this independently and then searched for a
|
|
|
|
|
subsequence of the table values, and found it was invented earlier by
|
|
|
|
|
Reiser, according to [Hacker's
|
|
|
|
|
Delight](http://www.hackersdelight.org/HDcode/ntz.c.txt).
|
|
|
|
|
|
|
|
|
|
unsigned int v; // find the number of trailing zeros in 32-bit v
|
|
|
|
|
int r; // result goes here
|
|
|
|
|
static const int MultiplyDeBruijnBitPosition[32] =
|
|
|
|
|
{
|
|
|
|
|
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
|
|
|
|
|
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
|
|
|
|
|
};
|
|
|
|
|
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];
|
|
|
|
|
|
|
|
|
|
Converting bit vectors to indices of set bits is an example use for
|
|
|
|
|
this. It requires one more operation than the earlier one involving
|
|
|
|
|
modulus division, but the multiply may be faster. The expression (v &
|
|
|
|
|
-v) extracts the least significant 1 bit from v. The constant
|
|
|
|
|
0x077CB531UL is a de Bruijn sequence, which produces a unique pattern of
|
|
|
|
|
bits into the high 5 bits for each possible bit position that it is
|
|
|
|
|
multiplied against. When there are no bits set, it returns 0. More
|
|
|
|
|
information can be found by reading the paper [Using de Bruijn Sequences
|
|
|
|
|
to Index 1 in a Computer
|
|
|
|
|
Word](http://citeseer.ist.psu.edu/leiserson98using.html) by Charles E.
|
|
|
|
|
Leiserson, Harald Prokof, and Keith H. Randall.
|
|
|
|
|
|
|
|
|
|
On October 8, 2005 [Andrew Shapira](http://onezero.org) suggested I add
|
|
|
|
|
this. Dustin Spicuzza asked me on April 14, 2009 to cast the result of
|
|
|
|
|
the multiply to a 32-bit type so it would work when compiled with 64-bit
|
|
|
|
|
ints.
|
|
|
|
|
|
|
|
|
|
unsigned int const v; // Round this 32-bit value to the next highest power of 2
|
|
|
|
|
unsigned int r; // Put the result here. (So v=3 -> r=4; v=8 -> r=8)
|
|
|
|
|
|
|
|
|
|
if (v > 1)
|
|
|
|
|
{
|
|
|
|
|
float f = (float)v;
|
|
|
|
|
unsigned int const t = 1U << ((*(unsigned int *)&f >> 23) - 0x7f);
|
|
|
|
|
r = t << (t < v);
|
|
|
|
|
}
|
|
|
|
|
else
|
|
|
|
|
{
|
|
|
|
|
r = 1;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
The code above uses 8 operations, but works on all v \<= (1\<\<31).
|
|
|
|
|
|
|
|
|
|
Quick and dirty version, for domain of 1 \< v \< (1\<\<25):
|
|
|
|
|
|
|
|
|
|
float f = (float)(v - 1);
|
|
|
|
|
r = 1U << ((*(unsigned int*)(&f) >> 23) - 126);
|
|
|
|
|
|
|
|
|
|
Although the quick and dirty version only uses around 6 operations, it
|
|
|
|
|
is roughly three times slower than the [technique
|
|
|
|
|
below](#RoundUpPowerOf2) (which involves 12 operations) when benchmarked
|
|
|
|
|
on an Athlon™ XP 2100+ CPU. Some CPUs will fare better with it, though.
|
|
|
|
|
|
|
|
|
|
On September 27, 2005 Andi Smithers suggested I include a technique for
|
|
|
|
|
casting to floats to find the lg of a number for rounding up to a power
|
|
|
|
|
of 2. Similar to the quick and dirty version here, his version worked
|
|
|
|
|
with values less than (1\<\<25), due to mantissa rounding, but it used
|
|
|
|
|
one more operation.
|
|
|
|
|
|
|
|
|
|
unsigned int v; // compute the next highest power of 2 of 32-bit v
|
|
|
|
|
|
|
|
|
|
v--;
|
|
|
|
|
v |= v >> 1;
|
|
|
|
|
v |= v >> 2;
|
|
|
|
|
v |= v >> 4;
|
|
|
|
|
v |= v >> 8;
|
|
|
|
|
v |= v >> 16;
|
|
|
|
|
v++;
|
|
|
|
|
|
|
|
|
|
In 12 operations, this code computes the next highest power of 2 for a
|
|
|
|
|
32-bit integer. The result may be expressed by the formula 1U \<\< (lg(v
|
|
|
|
|
- 1) + 1). Note that in the edge case where v is 0, it returns 0, which
|
|
|
|
|
isn't a power of 2; you might append the expression v += (v == 0) to
|
|
|
|
|
remedy this if it matters. It would be faster by 2 operations to use the
|
|
|
|
|
formula and the log base 2 method that uses a lookup table, but in some
|
|
|
|
|
situations, lookup tables are not suitable, so the above code may be
|
|
|
|
|
best. (On a Athlon™ XP 2100+ I've found the above shift-left and then OR
|
|
|
|
|
code is as fast as using a single BSR assembly language instruction,
|
|
|
|
|
which scans in reverse to find the highest set bit.) It works by copying
|
|
|
|
|
the highest set bit to all of the lower bits, and then adding one, which
|
|
|
|
|
results in carries that set all of the lower bits to 0 and one bit
|
|
|
|
|
beyond the highest set bit to 1. If the original number was a power of
|
|
|
|
|
2, then the decrement will reduce it to one less, so that we round up to
|
|
|
|
|
the same original value.
|
|
|
|
|
|
|
|
|
|
You might alternatively compute the next higher power of 2 in only 8 or
|
|
|
|
|
9 operations using a lookup table for floor(lg(v)) and then evaluating
|
|
|
|
|
1\<\<(1+floor(lg(v))); Atul Divekar suggested I mention this on
|
|
|
|
|
September 5, 2010.
|
|
|
|
|
|
|
|
|
|
Devised by Sean Anderson, Sepember 14, 2001. Pete Hart pointed me to [a
|
|
|
|
|
couple newsgroup
|
|
|
|
|
posts](http://groups.google.com/group/comp.lang.python/browse_thread/thread/c4d3aae0df917df5/6fdae3872f9de79d?lnk=st&q=comp.lang.python+zeddy&rnum=6#6fdae3872f9de79d)
|
|
|
|
|
by him and William Lewis in February of 1997, where they arrive at the
|
|
|
|
|
same
|
|
|
|
|
algorithm.
|
|
|
|
|
|
|
|
|
|
unsigned short x; // Interleave bits of x and y, so that all of the
|
|
|
|
|
unsigned short y; // bits of x are in the even positions and y in the odd;
|
|
|
|
|
unsigned int z = 0; // z gets the resulting Morton Number.
|
|
|
|
|
|
|
|
|
|
for (int i = 0; i < sizeof(x) * CHAR_BIT; i++) // unroll for more speed...
|
|
|
|
|
{
|
|
|
|
|
z |= (x & 1U << i) << i | (y & 1U << i) << (i + 1);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
Interleaved bits (aka Morton numbers) are useful for linearizing 2D
|
|
|
|
|
integer coordinates, so x and y are combined into a single number that
|
|
|
|
|
can be compared easily and has the property that a number is usually
|
|
|
|
|
close to another if their x and y values are close.
|
|
|
|
|
|
|
|
|
|
static const unsigned short MortonTable256[256] =
|
|
|
|
|
{
|
|
|
|
|
0x0000, 0x0001, 0x0004, 0x0005, 0x0010, 0x0011, 0x0014, 0x0015,
|
|
|
|
|
0x0040, 0x0041, 0x0044, 0x0045, 0x0050, 0x0051, 0x0054, 0x0055,
|
|
|
|
|
0x0100, 0x0101, 0x0104, 0x0105, 0x0110, 0x0111, 0x0114, 0x0115,
|
|
|
|
|
0x0140, 0x0141, 0x0144, 0x0145, 0x0150, 0x0151, 0x0154, 0x0155,
|
|
|
|
|
0x0400, 0x0401, 0x0404, 0x0405, 0x0410, 0x0411, 0x0414, 0x0415,
|
|
|
|
|
0x0440, 0x0441, 0x0444, 0x0445, 0x0450, 0x0451, 0x0454, 0x0455,
|
|
|
|
|
0x0500, 0x0501, 0x0504, 0x0505, 0x0510, 0x0511, 0x0514, 0x0515,
|
|
|
|
|
0x0540, 0x0541, 0x0544, 0x0545, 0x0550, 0x0551, 0x0554, 0x0555,
|
|
|
|
|
0x1000, 0x1001, 0x1004, 0x1005, 0x1010, 0x1011, 0x1014, 0x1015,
|
|
|
|
|
0x1040, 0x1041, 0x1044, 0x1045, 0x1050, 0x1051, 0x1054, 0x1055,
|
|
|
|
|
0x1100, 0x1101, 0x1104, 0x1105, 0x1110, 0x1111, 0x1114, 0x1115,
|
|
|
|
|
0x1140, 0x1141, 0x1144, 0x1145, 0x1150, 0x1151, 0x1154, 0x1155,
|
|
|
|
|
0x1400, 0x1401, 0x1404, 0x1405, 0x1410, 0x1411, 0x1414, 0x1415,
|
|
|
|
|
0x1440, 0x1441, 0x1444, 0x1445, 0x1450, 0x1451, 0x1454, 0x1455,
|
|
|
|
|
0x1500, 0x1501, 0x1504, 0x1505, 0x1510, 0x1511, 0x1514, 0x1515,
|
|
|
|
|
0x1540, 0x1541, 0x1544, 0x1545, 0x1550, 0x1551, 0x1554, 0x1555,
|
|
|
|
|
0x4000, 0x4001, 0x4004, 0x4005, 0x4010, 0x4011, 0x4014, 0x4015,
|
|
|
|
|
0x4040, 0x4041, 0x4044, 0x4045, 0x4050, 0x4051, 0x4054, 0x4055,
|
|
|
|
|
0x4100, 0x4101, 0x4104, 0x4105, 0x4110, 0x4111, 0x4114, 0x4115,
|
|
|
|
|
0x4140, 0x4141, 0x4144, 0x4145, 0x4150, 0x4151, 0x4154, 0x4155,
|
|
|
|
|
0x4400, 0x4401, 0x4404, 0x4405, 0x4410, 0x4411, 0x4414, 0x4415,
|
|
|
|
|
0x4440, 0x4441, 0x4444, 0x4445, 0x4450, 0x4451, 0x4454, 0x4455,
|
|
|
|
|
0x4500, 0x4501, 0x4504, 0x4505, 0x4510, 0x4511, 0x4514, 0x4515,
|
|
|
|
|
0x4540, 0x4541, 0x4544, 0x4545, 0x4550, 0x4551, 0x4554, 0x4555,
|
|
|
|
|
0x5000, 0x5001, 0x5004, 0x5005, 0x5010, 0x5011, 0x5014, 0x5015,
|
|
|
|
|
0x5040, 0x5041, 0x5044, 0x5045, 0x5050, 0x5051, 0x5054, 0x5055,
|
|
|
|
|
0x5100, 0x5101, 0x5104, 0x5105, 0x5110, 0x5111, 0x5114, 0x5115,
|
|
|
|
|
0x5140, 0x5141, 0x5144, 0x5145, 0x5150, 0x5151, 0x5154, 0x5155,
|
|
|
|
|
0x5400, 0x5401, 0x5404, 0x5405, 0x5410, 0x5411, 0x5414, 0x5415,
|
|
|
|
|
0x5440, 0x5441, 0x5444, 0x5445, 0x5450, 0x5451, 0x5454, 0x5455,
|
|
|
|
|
0x5500, 0x5501, 0x5504, 0x5505, 0x5510, 0x5511, 0x5514, 0x5515,
|
|
|
|
|
0x5540, 0x5541, 0x5544, 0x5545, 0x5550, 0x5551, 0x5554, 0x5555
|
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
unsigned short x; // Interleave bits of x and y, so that all of the
|
|
|
|
|
unsigned short y; // bits of x are in the even positions and y in the odd;
|
|
|
|
|
unsigned int z; // z gets the resulting 32-bit Morton Number.
|
|
|
|
|
|
|
|
|
|
z = MortonTable256[y >> 8] << 17 |
|
|
|
|
|
MortonTable256[x >> 8] << 16 |
|
|
|
|
|
MortonTable256[y & 0xFF] << 1 |
|
|
|
|
|
MortonTable256[x & 0xFF];
|
|
|
|
|
|
|
|
|
|
For more speed, use an additional table with values that are
|
|
|
|
|
MortonTable256 pre-shifted one bit to the left. This second table could
|
|
|
|
|
then be used for the y lookups, thus reducing the operations by two, but
|
|
|
|
|
almost doubling the memory required. Extending this same idea, four
|
|
|
|
|
tables could be used, with two of them pre-shifted by 16 to the left of
|
|
|
|
|
the previous two, so that we would only need 11 operations total. In 11
|
|
|
|
|
operations, this version interleaves bits of two bytes (rather than
|
|
|
|
|
shorts, as in the other versions), but many of the operations are 64-bit
|
|
|
|
|
multiplies so it isn't appropriate for all machines. The input
|
|
|
|
|
parameters, x and y, should be less than
|
|
|
|
|
256.
|
|
|
|
|
|
|
|
|
|
unsigned char x; // Interleave bits of (8-bit) x and y, so that all of the
|
|
|
|
|
unsigned char y; // bits of x are in the even positions and y in the odd;
|
|
|
|
|
unsigned short z; // z gets the resulting 16-bit Morton Number.
|
|
|
|
|
|
|
|
|
|
z = ((x * 0x0101010101010101ULL & 0x8040201008040201ULL) *
|
|
|
|
|
0x0102040810204081ULL >> 49) & 0x5555 |
|
|
|
|
|
((y * 0x0101010101010101ULL & 0x8040201008040201ULL) *
|
|
|
|
|
0x0102040810204081ULL >> 48) & 0xAAAA;
|
|
|
|
|
|
|
|
|
|
Holger Bettag was inspired to suggest this technique on October 10, 2004
|
|
|
|
|
after reading the multiply-based bit reversals
|
|
|
|
|
here.
|
|
|
|
|
|
|
|
|
|
static const unsigned int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF};
|
|
|
|
|
static const unsigned int S[] = {1, 2, 4, 8};
|
|
|
|
|
|
|
|
|
|
unsigned int x; // Interleave lower 16 bits of x and y, so the bits of x
|
|
|
|
|
unsigned int y; // are in the even positions and bits from y in the odd;
|
|
|
|
|
unsigned int z; // z gets the resulting 32-bit Morton Number.
|
|
|
|
|
// x and y must initially be less than 65536.
|
|
|
|
|
|
|
|
|
|
x = (x | (x << S[3])) & B[3];
|
|
|
|
|
x = (x | (x << S[2])) & B[2];
|
|
|
|
|
x = (x | (x << S[1])) & B[1];
|
|
|
|
|
x = (x | (x << S[0])) & B[0];
|
|
|
|
|
|
|
|
|
|
y = (y | (y << S[3])) & B[3];
|
|
|
|
|
y = (y | (y << S[2])) & B[2];
|
|
|
|
|
y = (y | (y << S[1])) & B[1];
|
|
|
|
|
y = (y | (y << S[0])) & B[0];
|
|
|
|
|
|
|
|
|
|
z = x | (y << 1);
|
|
|
|
|
|
|
|
|
|
// Fewer operations:
|
|
|
|
|
unsigned int v; // 32-bit word to check if any 8-bit byte in it is 0
|
|
|
|
|
bool hasZeroByte = ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F);
|
|
|
|
|
|
|
|
|
|
The code above may be useful when doing a fast string copy in which a
|
|
|
|
|
word is copied at a time; it uses 5 operations. On the other hand,
|
|
|
|
|
testing for a null byte in the obvious ways (which follow) have at least
|
|
|
|
|
7 operations (when counted in the most sparing way), and at most 12.
|
|
|
|
|
|
|
|
|
|
// More operations:
|
|
|
|
|
bool hasNoZeroByte = ((v & 0xff) && (v & 0xff00) && (v & 0xff0000) && (v & 0xff000000))
|
|
|
|
|
// OR:
|
|
|
|
|
unsigned char * p = (unsigned char *) &v;
|
|
|
|
|
bool hasNoZeroByte = *p && *(p + 1) && *(p + 2) && *(p + 3);
|
|
|
|
|
|
|
|
|
|
The code at the beginning of this section (labeled "Fewer operations")
|
|
|
|
|
works by first zeroing the high bits of the 4 bytes in the word.
|
|
|
|
|
Subsequently, it adds a number that will result in an overflow to the
|
|
|
|
|
high bit of a byte if any of the low bits were initialy set. Next the
|
|
|
|
|
high bits of the original word are ORed with these values; thus, the
|
|
|
|
|
high bit of a byte is set iff any bit in the byte was set. Finally, we
|
|
|
|
|
determine if any of these high bits are zero by ORing with ones
|
|
|
|
|
everywhere except the high bits and inverting the result. Extending to
|
|
|
|
|
64 bits is trivial; simply increase the constants to be
|
|
|
|
|
0x7F7F7F7F7F7F7F7F.
|
|
|
|
|
|
|
|
|
|
For an additional improvement, a fast pretest that requires only 4
|
|
|
|
|
operations may be performed to determine if the word may have a zero
|
|
|
|
|
byte. The test also returns true if the high byte is 0x80, so there are
|
|
|
|
|
occasional false positives, but the slower and more reliable version
|
|
|
|
|
above may then be used on candidates for an overall increase in speed
|
|
|
|
|
with correct output.
|
|
|
|
|
|
|
|
|
|
bool hasZeroByte = ((v + 0x7efefeff) ^ ~v) & 0x81010100;
|
|
|
|
|
if (hasZeroByte) // or may just have 0x80 in the high byte
|
|
|
|
|
{
|
|
|
|
|
hasZeroByte = ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
There is yet a faster method — use [`hasless`](#HasLessInWord)(v, 1),
|
|
|
|
|
which is defined below; it works in 4 operations and requires no
|
|
|
|
|
subsquent verification. It simplifies to
|
|
|
|
|
|
|
|
|
|
#define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
|
|
|
|
|
|
|
|
|
|
The subexpression (v - 0x01010101UL), evaluates to a high bit set in any
|
|
|
|
|
byte whenever the corresponding byte in v is zero or greater than 0x80.
|
|
|
|
|
The sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes
|
|
|
|
|
where the byte of v doesn't have its high bit set (so the byte was less
|
|
|
|
|
than 0x80). Finally, by ANDing these two sub-expressions the result is
|
|
|
|
|
the high bits set where the bytes in v were zero, since the high bits
|
|
|
|
|
set due to a value greater than 0x80 in the first sub-expression are
|
|
|
|
|
masked off by the second.
|
|
|
|
|
|
|
|
|
|
Paul Messmer suggested the fast pretest improvement on October 2, 2004.
|
|
|
|
|
Juha Järvi later suggested `hasless(v, 1)` on April 6, 2005, which he
|
|
|
|
|
found on [Paul Hsieh's Assembly
|
|
|
|
|
Lab](http://www.azillionmonkeys.com/qed/asmexample.html); previously it
|
|
|
|
|
was written in a newsgroup post on April 27, 1987 by Alan Mycroft.
|
|
|
|
|
|
|
|
|
|
We may want to know if any byte in a word has a specific value. To do
|
|
|
|
|
so, we can XOR the value to test with a word that has been filled with
|
|
|
|
|
the byte values in which we're interested. Because XORing a value with
|
|
|
|
|
itself results in a zero byte and nonzero otherwise, we can pass the
|
|
|
|
|
result to `haszero`.
|
|
|
|
|
|
|
|
|
|
#define hasvalue(x,n) \
|
|
|
|
|
(haszero((x) ^ (~0UL/255 * (n))))
|
|
|
|
|
|
|
|
|
|
Stephen M Bennet suggested this on December 13, 2009 after reading the
|
|
|
|
|
entry for `haszero`.
|
|
|
|
|
|
|
|
|
|
Test if a word x contains an unsigned byte with value \< n.
|
|
|
|
|
Specifically for n=1, it can be used to find a 0-byte by examining one
|
|
|
|
|
long at a time, or any byte by XORing x with a mask first. Uses 4
|
|
|
|
|
arithmetic/logical operations when n is constant.
|
|
|
|
|
|
|
|
|
|
Requirements: x\>=0; 0\<=n\<=128
|
|
|
|
|
|
|
|
|
|
#define hasless(x,n) (((x)-~0UL/255*(n))&~(x)&~0UL/255*128)
|
|
|
|
|
|
|
|
|
|
To count the number of bytes in x that are less than n in 7 operations,
|
|
|
|
|
use
|
|
|
|
|
|
|
|
|
|
#define countless(x,n) \
|
|
|
|
|
(((~0UL/255*(127+(n))-((x)&~0UL/255*127))&~(x)&~0UL/255*128)/128%255)
|
|
|
|
|
|
|
|
|
|
Juha Järvi sent this clever technique to me on April 6, 2005. The
|
|
|
|
|
`countless` macro was added by Sean Anderson on April 10, 2005, inspired
|
|
|
|
|
by Juha's `countmore`, below.
|
|
|
|
|
|
|
|
|
|
Test if a word x contains an unsigned byte with value \> n. Uses 3
|
|
|
|
|
arithmetic/logical operations when n is constant.
|
|
|
|
|
|
|
|
|
|
Requirements: x\>=0; 0\<=n\<=127
|
|
|
|
|
|
|
|
|
|
#define hasmore(x,n) (((x)+~0UL/255*(127-(n))|(x))&~0UL/255*128)
|
|
|
|
|
|
|
|
|
|
To count the number of bytes in x that are more than n in 6 operations,
|
|
|
|
|
use:
|
|
|
|
|
|
|
|
|
|
#define countmore(x,n) \
|
|
|
|
|
(((((x)&~0UL/255*127)+~0UL/255*(127-(n))|(x))&~0UL/255*128)/128%255)
|
|
|
|
|
|
|
|
|
|
The macro `hasmore` was suggested by Juha Järvi on April 6, 2005, and he
|
|
|
|
|
added `countmore` on April 8, 2005.
|
|
|
|
|
|
|
|
|
|
When m \< n, this technique tests if a word x contains an unsigned byte
|
|
|
|
|
value, such that m \< value \< n. It uses 7 arithmetic/logical
|
|
|
|
|
operations when n and m are constant.
|
|
|
|
|
|
|
|
|
|
Note: Bytes that equal n can be reported by `likelyhasbetween` as false
|
|
|
|
|
positives, so this should be checked by character if a certain result is
|
|
|
|
|
needed.
|
|
|
|
|
|
|
|
|
|
Requirements: x\>=0; 0\<=m\<=127; 0\<=n\<=128
|
|
|
|
|
|
|
|
|
|
#define likelyhasbetween(x,m,n) \
|
|
|
|
|
((((x)-~0UL/255*(n))&~(x)&((x)&~0UL/255*127)+~0UL/255*(127-(m)))&~0UL/255*128)
|
|
|
|
|
|
|
|
|
|
This technique would be suitable for a fast pretest. A variation that
|
|
|
|
|
takes one more operation (8 total for constant m and n) but provides the
|
|
|
|
|
exact answer is:
|
|
|
|
|
|
|
|
|
|
#define hasbetween(x,m,n) \
|
|
|
|
|
((~0UL/255*(127+(n))-((x)&~0UL/255*127)&~(x)&((x)&~0UL/255*127)+~0UL/255*(127-(m)))&~0UL/255*128)
|
|
|
|
|
|
|
|
|
|
To count the number of bytes in x that are between m and n (exclusive)
|
|
|
|
|
in 10 operations, use:
|
|
|
|
|
|
|
|
|
|
#define countbetween(x,m,n) (hasbetween(x,m,n)/128%255)
|
|
|
|
|
|
|
|
|
|
Juha Järvi suggested `likelyhasbetween` on April 6, 2005. From there,
|
|
|
|
|
Sean Anderson created `hasbetween` and `countbetween` on April 10, 2005.
|
|
|
|
|
|
|
|
|
|
Suppose we have a pattern of N bits set to 1 in an integer and we want
|
|
|
|
|
the next permutation of N 1 bits in a lexicographical sense. For
|
|
|
|
|
example, if N is 3 and the bit pattern is 00010011, the next patterns
|
|
|
|
|
would be 00010101, 00010110, 00011001,00011010, 00011100, 00100011, and
|
|
|
|
|
so forth. The following is a fast way to compute the next permutation.
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
unsigned int v; // current permutation of bits
|
|
|
|
|
unsigned int w; // next permutation of bits
|
|
|
|
|
|
|
|
|
|
unsigned int t = v | (v - 1); // t gets v's least significant 0 bits set to 1
|
|
|
|
|
// Next set to 1 the most significant bit to change,
|
|
|
|
|
// set to 0 the least significant ones, and add the necessary 1 bits.
|
|
|
|
|
w = (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
The \_\_builtin\_ctz(v) GNU C compiler intrinsic for x86 CPUs returns
|
|
|
|
|
the number of trailing zeros. If you are using Microsoft compilers for
|
|
|
|
|
x86, the intrinsic is \_BitScanForward. These both emit a bsf
|
|
|
|
|
instruction, but equivalents may be available for other architectures.
|
|
|
|
|
If not, then consider using one of the methods for counting the
|
|
|
|
|
consecutive zero bits mentioned earlier.
|
|
|
|
|
|
|
|
|
|
Here is another version that tends to be slower because of its division
|
|
|
|
|
operator, but it does not require counting the trailing zeros.
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
unsigned int t = (v | (v - 1)) + 1;
|
|
|
|
|
w = t | ((((t & -t) / (v & -v)) >> 1) - 1);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Thanks to Dario Sneidermanis of Argentina, who provided this on November
|
|
|
|
|
28, 2009.
|
|
|
|
|
|
|
|
|
|
[A Belorussian
|
|
|
|
|
translation](http://webhostingrating.com/libs/bithacks-be) (provided by
|
|
|
|
|
[Webhostingrating](http://webhostingrating.com/)) is available.
|