“An Efficient Representation for Sparse Sets”
March 15, 2008 8:09 PM   Subscribe

An Efficient Representation for Sparse Sets. Or, Using Uninitialized Memory for Fun and Profit

Yes, this may be of interest to only a subset of Mefites, but so are posts about niche bands. And even non-programmers should be able to appreciate how clever this is.
posted by orthogonality (82 comments total) 35 users marked this as a favorite
 
A bit of background for non-programmers: a computer's memory is reused, over and over. To save time, when that memory is (re-)used, any previous value at that memory location is not changed.

So generally, before a program uses memory, it should "initialize" that memory, setting it to some known value. Usually, that's not a problem.

But if the memory is going to be used to record values over time, the values to set will not be known at initialization time. If the amount of memory used to record those value is very large, it can be time-expensive to set all those values to, say, zero.

The article addresses a special case where not only is the memory area large, but also most of the values will never be set to something other than zero. An example might be to note which people out of 100,000 are left-handed red head Zoroastrian midget lion-tamers. (Or which Mefites find this article interesting?)

The article shows a clever way to avoid setting 100000 "NOT a left-handed red head Zoroastrian midget lion-tamer" flags, while still noting correctly the few people who are left-handed red head Zoroastrian midget lion-tamers.
posted by orthogonality at 8:11 PM on March 15, 2008


*immediate coma*
posted by geodave at 8:25 PM on March 15, 2008 [3 favorites]


Too many arrows.
posted by Jeff_Larson at 8:30 PM on March 15, 2008


I like how you lead into the FPP with a defense of its validity and an assurance that some MeFi, somewhere will enjoy this.
posted by wfrgms at 8:34 PM on March 15, 2008 [1 favorite]


well I liked it.
posted by cowbellemoo at 8:37 PM on March 15, 2008


Russ Cox Translated:

1. Russ Cox, Using Uninitialized Memory for Fun and Profit, and The Oppressed: Re-marking Univocal Community

2. The Silence of Capitalism and the Phallic in Russ Cox's Using Uninitialized Memory for Fun and Profit

3. The Labial (Re)producing The Other: Russ Cox, Using Uninitialized Memory for Fun and Profit and Collusion

4. Voicing the Deviant Legacy in Russ Cox: Using Uninitialized Memory for Fun and Profit and Subtext

5. Theory and Authority in Using Uninitialized Memory for Fun and Profit: Russ Cox Be-guiling Alien Materialism

6. Gendering, Objectifying, Seducing: Mythos in Russ Cox and the Lesbian Degeneration of Illness in Using Uninitialized Memory for Fun and Profit

7. Reclamation and Inscription in Using Uninitialized Memory for Fun and Profit: Russ Cox Mourning Homosocial Gentility

8. Complicating Community: Resistant Addiction in Russ Cox's Using Uninitialized Memory for Fun and Profit

9. Hybridizing the Convicted Opposition in Russ Cox: Using Uninitialized Memory for Fun and Profit and Corporeality

10. Representing the Exploited Rage in Russ Cox: Using Uninitialized Memory for Fun and Profit and Corporeality


(tomorrow...orthogonality)
posted by mrmojoflying at 8:38 PM on March 15, 2008 [12 favorites]


Something we don't think about much, now that computer programming is an "engineering" discipline, is how amazingly complex computers are. It's only been the steady accretion of years and years of research and work at a very low level that makes *all this* possible. There's layer upon layer of old code and ideas between my keyboard and your screen. The pieces that modern programmers fit together came from somewhere. When the wizards of the mid-20th century were inventing the digital computer, it wasn't just a technological advance. A science was born: not quite math or logic, not quite linguistics or philosophy, but something new and exciting. And also confusing as hell.
posted by tew at 8:46 PM on March 15, 2008


You can never have too many arrows!

I'm a category theoretician, so maybe I'm biased.
posted by ErWenn at 9:00 PM on March 15, 2008 [2 favorites]


The memory use is prohibitive.
posted by tkolar at 9:05 PM on March 15, 2008


(although I should admit that I find it a cute trick)
posted by tkolar at 9:23 PM on March 15, 2008


I wish I could use that uninitialized memory to cope pipe without a jig.
posted by mendel at 9:24 PM on March 15, 2008 [5 favorites]


I dove into this assuming I'd love it, but within three paragraphs I went blind.

I am getting too old to think
posted by davejay at 9:30 PM on March 15, 2008 [1 favorite]


tkolar writes "The memory use is prohibitive."

It's explicitly trading space for time. But yeah, it's 2 * n * sizeof(int*) vs. n / CHAR_BIT for a bitvector.
posted by orthogonality at 9:31 PM on March 15, 2008


Okay, that is a very cute solution. tew's right; this is good, old-fashioned, pre-"process" CS. It's also a great example of why programmers should internalize time analysis; if you don't, you'll never really understand your constraints or be able select an appropriate solution.
posted by phooky at 9:35 PM on March 15, 2008


Hmmmm. Yes, yes I... understand. Completely. I'm not pretending or anything.
posted by Justinian at 9:39 PM on March 15, 2008


Layman's attempt to digest: isn't this just a description of indices?
posted by msalt at 9:44 PM on March 15, 2008


this is so not for mefi.
posted by bonaldi at 9:51 PM on March 15, 2008


This entire thread already existed when mathowie bought the server... all we needed was an internal link to it.
posted by chips ahoy at 9:54 PM on March 15, 2008 [9 favorites]


Layman's attempt to digest: isn't this just a description of indices?

The unobvious / unusual aspect of the technique is that you don't need to bother initializing the values in the sparse array.
posted by Armitage Shanks at 9:56 PM on March 15, 2008


I've wondered for a while how this works, but haven't had a reason to look it up. Spiffy.

This trick only trades space for time if you expect to populate most of the data set. Imagine you have some giant 106x106 matrix where most rows have three or five or a dozen nonzero columns. With this trick you only need to allocate space for 107 or 108 data points, rather than 1012. That might make the difference between a tractable and an intractable problem.
posted by fantabulous timewaster at 9:57 PM on March 15, 2008 [1 favorite]


msalt writes "Layman's attempt to digest: isn't this just a description of indices?"

No, I don't think so. In an index (which would generally be some sort of tree structure, allowing a fast binary search), all the index nodes would point to actual data. To discover that something isn't in the index, we'd go to where the pointer should be, and not any pointer to it. No pointer, no data.

Here we're actually pointing (in the "sparse" array) to something. But since sparse is uninitialized, it could be pointing anywhere in the "dense" array. We go to that anywhere, and look to see what is there in the "dense" array . If what's there is not the value we expect, it's not "there". A pointer, but the wrong (inconsistent) data.

Again, a very specialized structure for a very specialized need: we need very fast speed, no initialization, and we're willing to pay, say 16-64, times the space to get that speed.
posted by orthogonality at 10:02 PM on March 15, 2008


This is cool. I see what they do there. Now I have to go back and finish writing this class that will use 90% of your processor and 2 gigabytes of ram to show your favorite youtube videos.
posted by Dr. Curare at 10:02 PM on March 15, 2008


But since sparse is uninitialized, it could be pointing anywhere in the "dense" array. We go to that anywhere, and look to see what is there in the "dense" array . If what's there is not the value we expect, it's not "there". A pointer, but the wrong (inconsistent) data.

Sort of. There's also the nifty 'size' indicator so it doesn't actually have to go to memory if the sparse array is pointing somewhere strange.

I'm a graduate student in computational science and a big thing we deal with is compressed sparse matrices, which are a little different from sparse sets. This seems like another neat tool to have in the box though and I hadn't heard about it until I saw it linked in reddit the other day.

Good post orthogonality! Now go write me an SQL query!
posted by onalark at 10:18 PM on March 15, 2008 [1 favorite]


Now go write me an SQL query!

SELECT * FROM my_ass WHERE 'justification' LIKE 'plausible';
posted by eriko at 10:21 PM on March 15, 2008


Now go write me an SQL query!

SELECT * FROM my_ass WHERE 'justification' LIKE 'plausible';
posted by eriko at 10:21 PM on March 15, 2008


Oops. Should have used a UNIQUE there.
posted by eriko at 10:22 PM on March 15, 2008 [13 favorites]


This trick only trades space for time if you expect to populate most of the data set. Imagine you have some giant 106x106 matrix where most rows have three or five or a dozen nonzero columns. With this trick you only need to allocate space for 107 or 108 data points, rather than 1012. That might make the difference between a tractable and an intractable problem..

Yeah, but you could also use a Binary tree or hashtable, and you'd only lose log2(1012) time, which is a lot but it doesn't make things impossible.
posted by delmoi at 10:36 PM on March 15, 2008


So, running this operation an infinite number of times, at some point couldn't it occur that dense[n] == sparse[i], just through random chance? Though what are the odds of that? I suppose it depends on the size of the integers being stored in the array positions. 1/[integer size] * 1/[integer size], at least. Which means it wouldn't happen often, granted. But when it does happen -- KAZAM! It'll erase your TiVo.
posted by barnacles at 10:40 PM on March 15, 2008 [1 favorite]


Pretty cool, thanks. It reminds me of Appel and Li's virtual memory tricks.
posted by equalpants at 10:40 PM on March 15, 2008


at some point couldn't it occur that dense[n] == sparse[i], just through random chance?

No, because sparse *is* initialized.
posted by tkolar at 10:45 PM on March 15, 2008


tkolar,

sparse is not initialized, that's the whole point. sparse could point at anything. *dense*, by contrast, is initialized. dense[0] through dense[n] are guaranteed to contain legitimate values. If the value at dense[sparse[i]] is indeed i, there is no bug.

Provided n is unsigned.

If n is a signed int, all hell breaks loose :)
posted by effugas at 10:50 PM on March 15, 2008


sparse is not initialized, that's the whole point. sparse could point at anything. *dense*, by contrast, is initialized.

Doh! That's what I meant.

Tired now. Bed.
posted by tkolar at 10:53 PM on March 15, 2008


Ah, okay, very cool. Carry on!
posted by barnacles at 10:56 PM on March 15, 2008


barnacles writes "So, running this operation an infinite number of times, at some point couldn't it occur that dense[n] == sparse[i], just through random chance?"

Yes, it could. But we're not checking for that. we're checking for dense[ sparse[ i ] ] == 1.

And we only do that check if sparse[i] <>is initialized. If dense[ sparse[ i ] ] == i, our element of sparse is one that we actually pointed at that element on dense. Otherwise, it's some sparse that hasn't been initialized and only coincidently points at that element of dense, which means i is not a value in our set.
posted by orthogonality at 11:07 PM on March 15, 2008


Er, type. "we're checking for dense[ sparse[ i ] ] == 1." --> We're checking for dense[ sparse[ i ] ] == i.
posted by orthogonality at 11:08 PM on March 15, 2008


Cool. I used to work 4 doors down the hall from Preston. I'll have to point him over here.

To put this in a bit more context, this is the sort of operation that becomes critically important on a machine with randomized global memory. For example, the MTA (multithreaded architecture), developed by Tera Computer Company and currently incorporated into certain Cray Inc. platforms.
posted by Araucaria at 11:12 PM on March 15, 2008


I saw this before on reddit. Of course, I love this sort of thing but it's also sort of impractical. But that I am forbid to tell the secrets of my prison house, I'd give you the gory details, but suffice it to say that these days the limiting factor for computations is nearly always RAM and so this sort of thing is almost never needed.
posted by lupus_yonderboy at 11:14 PM on March 15, 2008


(Not that sparse matrix tricks aren't totally useful at times!)
posted by lupus_yonderboy at 11:14 PM on March 15, 2008


As a note, this is one of the best blogs I've ever seen posted to the blue. There are some incredible posts here, such as:

Buridan's Ass
Play Tic-Tac-Toe with Knuth
Leaping Years and Drawing Lines
Crabs, The Bitmap Terror!

[this is good]
posted by effugas at 11:26 PM on March 15, 2008 [2 favorites]


To me, this seems a lot like a hash table, using sparse[] as the hash function (with the odd but convenient property that you can modify the hash function's results of specific hashes in order to make it be a perfect hash for your data). Just as in a hash table, you deal with hash collisions by looking at the entry in dense[] and checking whether it's the entry you expected. Interesting post.
posted by hattifattener at 11:40 PM on March 15, 2008


Layman's attempt to digest: isn't this just a description of indices

Optimized, yes, but still the same.
posted by sourwookie at 11:53 PM on March 15, 2008


Yeah, this blog is really, really good braincandy. Thanks!
posted by amery at 11:53 PM on March 15, 2008


...and I'm in over my head. I get it in glimpses, but....
posted by sourwookie at 11:54 PM on March 15, 2008


Flagged: no left-handed red head Zoroastrian midget lion-tamers in FPP; disappointed.
posted by not_on_display at 11:55 PM on March 15, 2008


514?! I know a ho in that area code!
posted by sourwookie at 11:58 PM on March 15, 2008 [1 favorite]


Tries are pretty cool too.
posted by Hello Dad, I'm in Jail at 12:24 AM on March 16, 2008 [1 favorite]


I was intending to get around to posting an AskMetafilter question looking to find some good programming weblogs, and here one is. Thanks, ortho, you read my mind.
posted by cmonkey at 12:56 AM on March 16, 2008


thanks for sharing, orthogonality. this was neat.
posted by episteborg at 1:26 AM on March 16, 2008


Wait, what?
posted by Fupped Duck at 1:44 AM on March 16, 2008


Thanks for reminding me why I like CS so much!
posted by Anything at 2:11 AM on March 16, 2008


So, running this operation an infinite number of times, at some point couldn't it occur that dense[n] == sparse[i], just through random chance?

It will happen if insert numbers in order, beginning with 0. Then i is stored in dense[i]. When you insert i, sparse[i] is set to the index of dense that contains the value i, which in this case will be i. Then sparse=dense on all of the valid entries.

Sooooo...I guess, if you're drawing k integers uniformly at random from {0,1,...,m} and inserting them into your set, then the probability of having sparse=dense after drawing all k is 1/choose(m,k), where choose(m,k) is the number of ways that you can draw k elements from {0,...,m} without replacement.
posted by agent at 2:14 AM on March 16, 2008


What is really amusing is that most of this thread occurred late on a Saturday night.
posted by ardgedee at 2:21 AM on March 16, 2008 [2 favorites]


Ach, I left out a term in the denominator above. That should be 1/(k! choose(m,k)), since 1/choose(m,k) is the probability of choosing the set {0,1,...,k} out of {0,1,...,m}, and 1/k! is the probability of choosing the exact sequence (0,1,...,k) out of the set {0,1,...,k}.
posted by agent at 2:25 AM on March 16, 2008


Metafilter: a big thing we deal with is compressed sparse matrices.
posted by rdone at 6:34 AM on March 16, 2008


Thanks for this, orthogonality. I have to solve just this sort of problem involving large sparse matrices very, very soon, and it's not exactly something that was covered in my undergraduate biochemistry classes.
posted by grouse at 7:16 AM on March 16, 2008


What is really amusing is that most of this thread occurred late on a Saturday night.

We all had to do *something* while compiling.
posted by tkolar at 8:38 AM on March 16, 2008 [1 favorite]


That was cool, thanks! The big a-ha moment for me was figuring out why testing for membership requires checking both sparse[i] <> and dense[sparse[i]] == i. Why not just check sparse? Because, duh, it's uninitialized and there might be a 0 or something in an unset slot. I had thought dense was just to optimize or something, didn't realize it was necessary for correctness.
posted by Nelson at 9:06 AM on March 16, 2008


I saw this briefly on reddit too but all the arrows distracted me. Now I'm going to have to actually read the thing aren't I?

More niche CS and Math posts!
posted by Skorgu at 9:45 AM on March 16, 2008


When the wizards of the mid-20th century were inventing the digital computer, it wasn't just a technological advance. A science was born: not quite math or logic, not quite linguistics or philosophy, but something new and exciting.

You mean computability theory? What always amazed me about Turing is that he envisioned computers in their theoretical fullness before there was any well-known physical analog (except perhaps the human brain). Personally, I'd say that computability theory IS specifically mathematics and logic, but it has important implications for linguistics and philosophy. Not to be picky...
posted by Edgewise at 11:16 AM on March 16, 2008


I think he means computer science, Edgewise.
posted by grouse at 11:19 AM on March 16, 2008


What always amazed me about Turing is that he envisioned computers in their theoretical fullness before there was any well-known physical analog[...]

Not entirely true; Charles Babbage put together designs for computing machines well before Turing. Not to belittle Turing's work, since it (in parallel with similar work by Alonzo Church) was fundamental in establishing the bounds on what computers can do.
posted by agent at 11:36 AM on March 16, 2008


I wouldn't disregard von Neumann's contributions to the idea of a modern computing architecture.
posted by onalark at 12:31 PM on March 16, 2008


Hmmm. This would have been good to know when I recently interviewed for a C++ position and got a bit-twiddling question: find the unique letters in a string. This would have been spot-on for answering that.
posted by A-Train at 2:27 PM on March 16, 2008


A-Train,

I generally just use the language's hash table implementation for stuff like that -- for each character, if the key doesn't exist, create the key in the table. Then list out all the keys.
posted by effugas at 3:35 PM on March 16, 2008


Hash table? You can store that in 52 bits with a single pass algorithm. 26 bits if you're willing to waste the CPU.

(Sorry, I'm stuck with the monstrously inefficient Cocoa framework right now and it's getting on my nerves.)
posted by tkolar at 3:47 PM on March 16, 2008


effugas:
[use] language's hash table
Yeah, well, my first code example did exactly that and they were not impressed. I gave an alternate solution using a sparse array (but not a dense one, too) and I think they were similarly unimpressed. They were serious bit twiddlers: they considered the initialization time of a 256 character array worth eliminating. A solution that ran in constant time and used uninitialized stack space would have impressed them.
posted by A-Train at 4:13 PM on March 16, 2008


tkloar: I can't see how you can do it in 26 bits in a single pass. In fact I'm not sure I can come up with an O(n) way of doing it using only 26 bits of extra storage.
posted by aspo at 4:18 PM on March 16, 2008


(Well ok you can do something lame like "do 2 passes, first pass = first half of the alphabet, second pass = second half of the alphabet", but that's, well, lame)
posted by aspo at 4:53 PM on March 16, 2008


For ansi character encodings, using a 26bit sized hash table with a perfect hash function:
v = [26 zeroed bits]
for letter in string:
v[tolower(letter)-'a'] = 1

If they were asking you about finding unique Unicode characters, I would just stab them in the eyes, burn down the building, and be done with it.
posted by enkiwa at 5:32 PM on March 16, 2008


Oh I guess I had a different def of unique. I thought it meant "there is one and only one of this letter in the string" not "there is at least one".
posted by aspo at 5:36 PM on March 16, 2008


Ah. I misunderstand. n and 52 bits or n^2 and 26 bits to find uniques. First way, as above, but you keep two arrays-that-are-perfect-hashtables: "seen once" and "seen more than once". Second way, you just check all remaining indicies to see if the character you're on is around more than once, and mark appropriately.
posted by enkiwa at 5:44 PM on March 16, 2008


Why is this better than a linked list?
posted by snoktruix at 6:17 PM on March 16, 2008


snoktruix,

Because you need to traverse the entire linked list every time you want to see if something is a member of the set. For this solution, you just take a look at the index to see if the value's been set.

A-Train,

The funny part is all that bit-twiddling logic is coming back with parallel computing, because the strength of each individual processor isn't much (meaning also any savings from the algorithm get multiplied by the number of processors).
posted by effugas at 6:22 PM on March 16, 2008


OK, yeah, I guess it's faster if you need to check if an element is present. It seems restricted to storing values that can be used as indices into some allocated array, so no good for pointers, floating point numbers, long integers.
posted by snoktruix at 9:01 PM on March 16, 2008


snoktruix: Welcome to hash functions. You would have to tweak things a bit, but this implementation makes sense for sparse hashtables. Especially sparse hashtables that get duplicated or cleaned out often. Plus you have an easy way to iterate through the items in the table. Double win.
posted by aspo at 9:26 PM on March 16, 2008 [1 favorite]


snoktruix: Nah, you can store pointers or floats or whatever you want in either sparse or dense, as long as you also store an index. The elements of each can be any fixed-size structure.
posted by equalpants at 9:27 PM on March 16, 2008


equalpants: really? What if I store pointers, and I want to know if element 0x1d7c40 is in the set? With this method you'd need an array of length 2^32, whose 0x1d7c40-th element contains the index into dense. As aspo mentions, maybe you could hash the pointers into an index into sparse first, but you have to deal with collisions.
posted by snoktruix at 10:55 PM on March 16, 2008


Sorry, by "storing pointers" I thought you meant having an array containing pointers, not having an array indexed by pointers.
posted by equalpants at 11:07 PM on March 16, 2008


I thought that what the point of this scheme. What you want to store is used as an index into sparse. If I want to store pointers, I have to use them as an index, so I need to allocate an array big enough to hold all possible pointers. In the original scheme.
posted by snoktruix at 11:09 PM on March 16, 2008


aspo wrote...
I can't see how you can do it in 26 bits in a single pass. In fact I'm not sure I can come up with an O(n) way of doing it using only 26 bits of extra storage.

Sorry, should have been more clear. I was saying that I could do it in 52 bits in O(n) but if I were willing to spend a lot more processing (say, O(n^2)) I could do it in 26 bits.

However, now you've got me curious and I need to think about doing it with 26 bits in O(n).
posted by tkolar at 11:25 PM on March 16, 2008


I thought that what the point of this scheme. What you want to store is used as an index into sparse.

No, it's more general than that--the interface is the same as any array. The example given uses the same values as both the index and the stored value, but that's not required. Sparse could be indexed by employee number, for example, and each entry of dense could contain the full employee record, including the employee number (i.e. index into sparse).

BTW, the paper I linked to above has an interesting uninitialized-memory sparse array implementation, if you have sufficient control over virtual memory: you just "allocate" enough address space to hold the whole thing, but you don't actually give it any pages until values are inserted. When you test the membership of a non-member, you just get a page fault. Essentially, you're making the MMU assume part of the functionality of this method's "dense" array. There's a lot of really cool stuff in that paper...
posted by equalpants at 11:40 PM on March 16, 2008


Mythos as Gentility: Dismembering Existential Opposition in orthogonality's “An Efficient Representation for Sparse Sets”
posted by mrmojoflying at 12:42 PM on March 17, 2008


« Older Digital telepathy?   |   epilogue Newer »


This thread has been archived and is closed to new comments