Getting to the Square Root of this Function
February 8, 2008 10:21 AM   Subscribe

Getting to the source of 5 beautiful lines of Quake 3. Rys Sommefeldt traces the history of a very quick (and now infamous) inverse square-root function used in Quake 3. (via)

The function uses a clever first approximation to the Newton-Raphson method, and although Rys doesn't find a definite source, the path he traces should be interesting to programming-inclined Mefites.
posted by spiderskull (58 comments total) 26 users marked this as a favorite
 
...you get gems like the fast invsqrt function to show you that it's not all about the 3D hardware, and that software is arguably even more of a factor when analysing 3D performance.

This may be my software bias showing, but duh. Hardware improvements are O(n) (until quantum computers get here) while software improvements can be of any order.
posted by DU at 10:34 AM on February 8, 2008


What I've never seen explained is why the first approximation is so clever.
posted by unSane at 10:38 AM on February 8, 2008


Infamous? Why is it infamous?
posted by nzero at 10:44 AM on February 8, 2008


Okay, I'm dumb. What is an inverse square root? I know what a square root is, and I know what squaring something is. Really, if someone would just explain what this is, I will be able to figure this out.
posted by Deathalicious at 10:44 AM on February 8, 2008


1/sqrt(x) I think.
posted by DU at 10:45 AM on February 8, 2008


Inverse square root is just 1/sqrt(x). You end up calculating that a lot in graphics programs.
posted by hattifattener at 10:46 AM on February 8, 2008


Oh, okay. I kept thinking, "Um, wouldn't the inverse of the square root just be squaring the thing".

See? I am dumb.
posted by Deathalicious at 10:52 AM on February 8, 2008 [2 favorites]


A much clearer explanation is available here.
posted by Blazecock Pileon at 10:58 AM on February 8, 2008 [2 favorites]


You end up calculating that a lot because you need to calculate a square root and you need to divide some number X by that square root. It takes about the same time to calculate a square root as to calculate the inverse, but it's faster to multiply X*(invsqrrt) than to divide X/(sqrrt).

But why is 0x5F3759DF used?
posted by MtDewd at 10:59 AM on February 8, 2008


That Carmack is a clever bastard.
posted by Pope Guilty at 11:03 AM on February 8, 2008


See? I am dumb.

Nah, it's just an ambiguity in the mathematical terminology being used here.

Think about this: If I give you cos2(x), it means cosine squared, but cos-1(x) means arccos(x), not sec(x). In this context, you would refer to arccos(x) as the inverse, while sec(x) would be the reciprocal, because the context is the cosine function. In the case here, the function given is the square root, so the context seems to indicate, as you assumed, the inverse is the square. The truth is the reciprocal (referred to as the inverse) is only the multiplicative inverse! So it's an inverse, but not the most logical one to call "inverse" here. The guy could have made it much more clear by calling it RecSqrt, eh?
posted by nzero at 11:04 AM on February 8, 2008 [3 favorites]


Some architectures (like the ppc) have an inverse-sqrt-approximation instruction built in. It probably does more or less what the first few lines of this function do: you can approximate a square root by dividing the exponent by two (e.g. with a right shift). It can treat the fields of the floating point number individually, though, so it doesn't have to be quite as clever as this code.
posted by hattifattener at 11:06 AM on February 8, 2008


But why is 0x5F3759DF used?

This paper explains the derivation of the magic number 0x5F3759DF.
posted by Blazecock Pileon at 11:07 AM on February 8, 2008 [1 favorite]


int i = *(int*)&x;

And they say perl is line noise.
posted by smackfu at 11:07 AM on February 8, 2008 [2 favorites]


Whoa, that betterexplained.com site is awesome.
posted by DU at 11:08 AM on February 8, 2008


I think "inverse square root" is confusing. "Reciprocal square root" seems less ambiguous to me. Of course, there is precedent. After all, "inverse square law" is common terminology. I suppose the "inverse" refers to the multiplicative inverse. I still find it confusing in this context, fwiw.

On preview, what nzero said.
posted by Crabby Appleton at 11:13 AM on February 8, 2008


Thanks, BP.
I had just found that from your previous link.
It will take a while to understand it, though.
posted by MtDewd at 11:19 AM on February 8, 2008


The best way to calculate an inverse square root is to just call pow(x,-2). Then it's the operating system systems's problem to make it fast. And if they don't, you have someone to sue.
posted by DU at 11:19 AM on February 8, 2008


This is why we can get away with calling it computer science. I both love and hate this sort of thing—love it because it's awesome, what with the gorgeous low-level hack on the floating point storage and the magic constant correction (the betterexplained.com writeup BP linked is indeed very readable); hate it because I just can't get my brain to do good work in that area, much as I enjoy programming and much as some other areas really click for me.

Nothing is quite so viscerally dismaying as 3D code that doesn't work. Abstract logical bugs are hair-tearing, sure, but when your triangle normal calculations are effed to the point where your crude Batman model has a spike stretching from his forehead to render infinity, you and everyone within visual range knows your shit is fucked in a very, very direct way.
posted by cortex at 11:31 AM on February 8, 2008 [12 favorites]


Clicking on this thread was like stepping in to the deep end at the swimming pool. It's over my head, and I don't know how to swim.
posted by blue_beetle at 11:32 AM on February 8, 2008


I don't even have those little inflatable arm things. I'm drowning here.
posted by TwelveTwo at 11:39 AM on February 8, 2008 [1 favorite]


This method is as old as the hills and most likely was derived by one of the old Fortran optimisation hackers in the 70s. Numerical analysis was all the rage back then. These things were mostly team efforts anyway and trying to find some "hero" misses the point entirely. Try asking this guy what text book he found it in.
posted by DirtyCreature at 11:52 AM on February 8, 2008 [1 favorite]


You remember when you were young, and you caught Akira on Sci-Fi, or the first time you saw Star Wars, and thought they were pretty keen? You consume some more of each and decide you like it just fine. When someone asked, you'd say that you liked Anime and Star Wars. You consider yourself an "Anime Fan" or a "Star Wars Fan"

Then in your teens or after, you meet some Anime fans. or Star Wars fans... and while you still like what you like, you come to realize there is a wide chasm between you enjoying Star Wars films and a Star Wars Fan.

Well, I'll level with you, MetaFilter, that's a bit how I feel reading these links and this thread. You see, up until now, I was an "IT Professional."
posted by Uther Bentrazor at 12:01 PM on February 8, 2008 [23 favorites]


If they really wanted to make it fast it would be an in-line define, without the subroutine overhead.
posted by StickyCarpet at 12:35 PM on February 8, 2008


It's not "beautiful" code, it's hacky but fast.
posted by Foosnark at 1:11 PM on February 8, 2008


Nothing is quite so viscerally dismaying as 3D code that doesn't work.

I don't know about that. Making a motor bash expensive things into each other is a little better.
posted by mkb at 1:20 PM on February 8, 2008


DU, i think you mean pow(x,-1/2). :)
posted by apathy0o0 at 1:33 PM on February 8, 2008 [1 favorite]


If they really wanted to make it fast it would be an in-line define, without the subroutine overhead.

Hey man, easy on the icache.
posted by bigschmoove at 1:39 PM on February 8, 2008


DU: the reason you don't just ask the operating system for it because the function name is slightly wrong. Instead of inverse_square_root, it probably should be close_enough_to_inverse_square_root_for_our_purposes_but_is_much_faster. The linked function doesn't give you exact answers, but any wrongness is small enough that it doesn't affect the layout of pixels on the screen.
posted by cschneid at 1:48 PM on February 8, 2008 [1 favorite]


It's not "beautiful" code, it's hacky but fast.

Sufficiently clever optimization in the face of a bottleneck is one measure of beauty; there is beauty in the understanding that makes the hack possible.
posted by cortex at 1:57 PM on February 8, 2008 [4 favorites]


If they really wanted to make it fast it would be an in-line define, without the subroutine overhead.

No duh.
posted by docpops at 2:01 PM on February 8, 2008


This may be my software bias showing, but duh. Hardware improvements are O(n) (until quantum computers get here) while software improvements can be of any order.

Can you explain that? One of the first problems in Feynman's CS lecture series is to show how mulitprocessor hardware can perform comparisons and additions in ln(n) time. As far as I know, software running on a single processor can't do that.
posted by kid ichorous at 2:09 PM on February 8, 2008


kid ichorous: Oh god, I didn't even know there was a Feynman Lectures on Computation. Do you know if there's a video or audio version? I must have it.
posted by jmhodges at 2:22 PM on February 8, 2008


I'd still love if someone could take a stab at explaining why the "magic" constant that was chosen works better than any other random number, like 42 or 17, and what sort of genius process was used to find it.
posted by straight at 2:28 PM on February 8, 2008


kid ichorous, I'm guessing the intention is that bare-metal improvements to hardware are limited to incremental improvements in speed—you can improve a processor design to make it clock twice as fast, and you've produced a two-fold improvement; butyou can find a more efficient algorithm for solving a problem and produce an arbitrarily large improvement free of charge.

In that sense, it's right; if the hardware is already doing what it does in the most efficient manner possible—the braintrust, low-level gate logic and memory management and so on—there's no room to move on reducing complexity. We're already logically pegged, even if we can refine some of the sticking points and improve the meta game a bit.

On the other hand, you can make a chip faster; but a calculation is a calculation, and once you've got it as good as you can make it, you're never going to get anywhere except with better hardware. And better hardware over the last sixty years has been a pretty goddam significant factor in the march of computer science itself and more significantly all the new information technology that's grown out of it. So hardware and software kinda need to make with the hugs and not get all catty.
posted by cortex at 2:34 PM on February 8, 2008


Infamous? Why is it infamous?

Yeah, I dunno. I guess because it's well-known, and while it's appreciated, it's also still confusing to a lot of people who encounter it. It's a poor choice of words on my part, no doubt.

I'd still love if someone could take a stab at explaining why the "magic" constant that was chosen works better than any other random number, like 42 or 17, and what sort of genius process was used to find it.

I recommend checking out the links above (namely the one Blazecock Pileon links to)
posted by spiderskull at 3:15 PM on February 8, 2008


I'd still love if someone could take a stab at explaining why the "magic" constant that was chosen works better than any other random number, like 42 or 17, and what sort of genius process was used to find it.

If you're willing to read (or at least skim) some genuine math, this paper by Chris Lomont (linked to from the main link above, and elsewhere in the thread) walks through the whole thing. There's are some CS 101 ideas in there that aren't really too hard to approach but which he (along with every other writeup we've seen in this thread) pretty much takes as a given and doesn't explain; I'll take a shot at a rough overview below, but barring that I'd recommend a spirited slog through the paper.

So, the short (ha!) and simplified and don't-quote-me-on-this explanation:

A computer typically stores integer numbers and floating point numbers in two distinctly different ways. Integers have just a sign bit (1 = negative, 0 = positive) with the rest of the bits representing the actual number; floating points have a sign bit, an exponent, and a mantissa (where each of the latter is a sort of miniature integer inside a single string of bits) representing a scientific-notation value like x * ey. The three parts of the floating point number occupy specific parts of the string of bits—say, in a 16-bit float value, four bits for the exponent and eleven bits for the mantissa (with one left over for the sign bit to tell us if it's positive or negative).

So if you take a given string of bits, say "0110100100011011", and interpret that as an integer, you get a very, very different value than if you interpret it as a floating point number, even though it's the same string of bits in both cases. Generally speaking, mistaking a floating point value (a "float") for an integer (an "int"), or vice versa, is a really, really bad thing to do; it's the kind of eye-gouging bug that people hate to write, and which people write all the time anyway.

Visually, here's how the two interpretations might be handled by the computer:

Integer:
0 110100100011011
- 0 is the sign bit: this is a postive number.
- The rest is a binary representation of the number 26,907.
- Ergo, the value is +26907.

Floating point value:
0 1101 00100011011
- Again, that first 0 is the sign bit. Positive number.
- The second string is our four-digit exponent, binary representation of the number 13.
- The third string is our 11-digit mantissa, binary for 283.
- Ergo, the value is +283*e13, or +125,202,990.

So you can see how much of an interpretive difference int vs. float makes.

But this method depends on a very clever way to intentionally take a float, treat it like an int, do some computationally fast operations to the faux integer, then turn it back into a float again and have an answer very close to what a normal (and computationally much slower) inverse-squareroot function would return. It's not a perfect calculation, but it's so close that it doesn't cause problems: this is a video game, not rocket science.

Part of the clever method involves literally shifting the whole string to the right: each digit moves to where the digit to its right used to be, and one digit gets dropped off the far right, and a new digit gets added on the left. Before and after for a right-shift of our madeup string:

0110100100011011
0011010010001101

Doing this to integers is pretty normal stuff: it's actually (with a couple of caveats) a very fast way to divide by two:

0 110100100011011 = 26907, as above, becomes
0 011010010001101 = 13453. (Rounding error is one of those caveats).

But it's a strange thing to do to a floating point value, because the exponent and the mantissa occupy a fixed positions in the bit string, and all of a sudden you're shoving a digit that used to belong to the exponent into the mantissa:

0 1101 00100011011 = 283*e13 = ~125,202,990, becomes
0 0110 10010001101 = 1165*e6 = ~469,994.

There's no obvious mathematical relationship between the shifted values—certainly nothing as simple as "divide by two" as in the case of integers.

So that magic constant! Where does it come in, and why did they choose the one they chose? The trick is that it's a very carefully engineered value that makes the bizarre float-as-int calculation work: it is constrained by the requirement that when you take a floating point number and shift its bits around, nothing breaks.

I found the actual math behind determining the value a bit dizzying, but the conclusion in Lomont's paper is that the specific value, 0x5F3759DF (which is just a hexadecimal string; the decimal value is 1,597,463,007 and the unsigned binary value is 1011111001101110101100111011111) is one of a number of values that could be used to make this work. The neighborhood of constants that would work—that meet the weird constraints the float-as-int trick introduces—is pretty small, so the alternative constants would look pretty similar to this one.

Lomont in fact finds one that works a bit better than this mysterious value, and others that work about as well with minor variations in certain contexts. It's not a unique single constant, just one in a family.
posted by cortex at 3:16 PM on February 8, 2008 [35 favorites]


jmhodges, here is a link to the Feynman computation lectures at Google Books. It looks like a full scan of the same edition I own.
posted by kid ichorous at 3:28 PM on February 8, 2008 [4 favorites]


I doubt the constant was originally derived by the method described in the Chris Lomont paper. Here's my guess as to how it was originally chosen.

Hmmm we need to find a good approximation for the inverse square mantissa = 1/sqrt(1+M)) with M between 0 and 1. Let's look at the curve. Gee wouldn't it be nice if we could fit a straight line with neat binary arithmetic properties to the curve. Gee that function *does* look like it would nicely fit a line with a gradient of -1/4. Ok let's assume we have a straight line with that gradient and work out the y intercept of the straight line with the minimum total area between it and the inverse square function (using numerical integration).

Then perhaps by doing this we get 0.966215 - M/4 which just happens to be the function that the code represents (after appropriate adjustment).

This is more like how numerical analysts would have thought.
posted by DirtyCreature at 4:20 PM on February 8, 2008 [1 favorite]


How do you get 0.966215 - M/4 DirtyCreature?
posted by snoktruix at 5:03 PM on February 8, 2008


(I mean how do you figure that the code represents that 'after appropriate adjustment').
posted by snoktruix at 5:04 PM on February 8, 2008


snoktruix: "How do you get 0.966215 - M/4 DirtyCreature?"

Let's check it. For values of M between 0 and 1, this function has values less than 1 so let's get them back to being greater than 1 for our new mantissa representation. Instead write the function as ( 1.93243 - M/2 ) / 2 . The floating point representation of 1.93242 is 0x3ff759df. Subtracting 1 from the exponent (as per the divide by 2 in this approximation), you get ......OMFG....... 0x5f3759df !

I haven't done the curve fitting myself but just by inspection, that line above looks like a pretty good absolute area-minimising fit.
posted by DirtyCreature at 5:22 PM on February 8, 2008


This is stupidly clever code. It's not guaranteed to work on every compiler -- because the C++ standard doesn't guarantee that your floating point representation won't move between 80 bits and 64 bits as the compiler sees fit to get best performance and if it happens somewhere around those casts then you are sunk.

There are two rules for optimization:

Rule 1: don't.
Rule 2 (for experts only): don't, yet.
posted by lupus_yonderboy at 5:42 PM on February 8, 2008


(not that I don't love this sort of thing. and not that this might not have made a difference in the actual speed of the code...)
posted by lupus_yonderboy at 5:44 PM on February 8, 2008


Great little article, thanks spiderskull!
posted by intermod at 6:27 PM on February 8, 2008


snoktruix: "(I mean how do you figure that the code represents that 'after appropriate adjustment')."

Actually playing with the graphs a bit more, it doesn't look like the chosen line minimises error according to absolute area difference. It looks it minimises the maximum absolute error in the interval for any M, which is perhaps even more sensible. To convince yourself, go here and look what happens to the maximum difference as you change the y intercept.

I'm fairly convinced now that this is how it was derived.
posted by DirtyCreature at 6:53 PM on February 8, 2008


There are two rules for optimization...

That's fine if the unoptimized code is fast enough. But if it isn't, then fuck the rules. Nobody wants your elegant mpeg codec that only runs at 29 fps.
posted by ryanrs at 7:57 PM on February 8, 2008


It's not guaranteed to work on every compiler

This is commercial code. It only has to work on their compiler, since they only sold it as binaries. They just open-source it for the heck of it, and I bet a lot of this optimization gets pulled out nowadays, since it's not needed.

Speaking of optimization, I remember the days when floating point was just crazy slower than integer math. Too slow to use, unless you had the math coprocessor. Lots of fun techniques to deal with that problem, like getting one decimal point by multiplying by 10 then dividing it out later. Or using precalced arrays to get sin/cos. Not as fun now that everything just runs fast, or it runs slow but it's in someone else's code so you can't help it.
posted by smackfu at 10:19 PM on February 8, 2008


I also had to walk to school. Kids today!
posted by smackfu at 10:20 PM on February 8, 2008


Speaking of optimization, I remember the days when floating point was just crazy slower than integer math. Too slow to use, unless you had the math coprocessor.

The 8087 coprocessor was kick-ass. Totally elegant stack based instructions--like a tiny rpn calculator. In fact, the instructions were more comprehensive then the ones offered by the C math library, so I wound up writing composable wrappers for the assembly code, so I wouldn't need to pull the numbers off the stack registers.

Lots of fun techniques to deal with that problem, like getting one decimal point by multiplying by 10 then dividing it out later. Or using precalced arrays to get sin/cos. Not as fun now that everything just runs fast, or it runs slow but it's in someone else's code so you can't help it.

Some of us, at least the ones doing fancy math on tiny embedded processors, still do this crazy shit.

Worship the fixed point math!
posted by cytherea at 2:34 AM on February 9, 2008


In homage to cortex's excellent explanation above, I hereby provide the OED's etymology of the excellent word mantissa:
< classical Latin mantissa, mantīsa, of doubtful meaning and unknown origin; said by the Latin grammarian Festus (as preserved in a later epitome) to be an Etruscan word meaning ‘makeweight’, elsewhere in classical Latin app. meaning ‘sauce’, and ‘fuss or profit’.
posted by languagehat at 8:25 AM on February 9, 2008 [1 favorite]


Ah, so this is all about the special sauce.
posted by alloneword at 11:37 AM on February 9, 2008 [1 favorite]


I remember Quake Three. That was Arena, wasn't it? I used to log in to that; called myself "Idiealot" cuz, well, I did die a lot. I don't understand about 90% of this thread, but for the rest of my life I'm gonna say this bit of code you guys are talking about is the real reason why I was no good at DeathMatch. That piece of code just had it in for me.
posted by ZachsMind at 1:11 PM on February 9, 2008


Heck, I still do the fixed-point stuff on desktop machines — not for speed, nowadays, but for numerical sanity. Floating point is good enough most of the time, but there are still a ton of cases where it'll just get you into trouble.

brandishes cane, chases floating-points off his lawn
posted by hattifattener at 2:49 PM on February 9, 2008


cortex: 6.02 x 10^23 thanks! That was *exactly* as much detail as I wanted / could handle.
posted by straight at 9:44 PM on February 9, 2008


I love that about mantissa. A little bit of Etruscan surviving despite everything.

Oh, I got a good chunk of the math too. But the float/int/float transform blows my mind.
posted by dhartung at 9:56 PM on February 9, 2008


This information is really old and obsolete. Modern Intel and AMD processors have a single instruction to do this properly now (it's in the SSE instruction set). All PowerPC processors since the G4 also have such an instruction.
posted by w0mbat at 10:17 PM on February 9, 2008


For the pedantic, note that this (extremely clever!) code isn't legal ANSI C as written, and might not operate correctly with modern compilers: the problem is the line "x = *(float*)&i;", which breaks strict aliasing rules. See, e.g., here for more info. (The "legal" way to perform the type-punned dereference would be via a union of int/float types.)
posted by puppygalore at 3:32 PM on February 10, 2008


« Older C'est une beauté de chemin à faire...   |   Kiss your free time goodbye. Newer »


This thread has been archived and is closed to new comments