Hash Tables
February 11, 2024 10:26 AM Subscribe

About 70 years ago, an engineer at IBM named Hans Peter Luhn quietly changed the course of computer science ... in a 1953 internal IBM paper, he proposed a new technique for storing and retrieving information that is now built into just about all computational systems: the hash table. (article title: "Scientists Find Optimal Balance of Data Storage and Time"). A hash table is a data structure that implements an associative array, also called a dictionary, which is an abstract data type that maps keys to values. Here's a video explaining hash tables.

Earliest available paper by W. W. Peterson from 1957: "Addressing for Random-Access Storage" (best available scan that I could fine).

posted by ShooBoo (52 comments total) 43 users marked this as a favorite

Hash hash hooray! I would be interested in hearing learned Mefites' opinions on these new hash-table techniques as discussed in the first link.

But also, is 67 years "about 70"? Is Quanta just using one significant digit now? "We just celebrated my daughter's sixth birthday. She's about ten."
posted by Not A Thing at 10:54 AM on February 11 [5 favorites]

That's ok, those four years are 2/3 of her life, while the three years discrepancy here is under 5% of the total. Nobody rounds off to change value by over half the initial, but a few percent off is usually ok for casual purposes.
posted by SaltySalticid at 11:10 AM on February 11 [10 favorites]

Hash tables are amazing. Amazon's DynamoDB is basically a giant hashtable and it has more or less sworn me off of relational databases for life, with a few exceptions.
posted by grumpybear69 at 11:20 AM on February 11 [4 favorites]

Oooh, this is Relevant to My Interests. (I really need a more fluid understanding of anti-collision algorithms, and also of how hash tables work under the hood.)

Thanks, ShooBoo!
posted by palmcorder_yajna at 11:31 AM on February 11

Man, as if I didn't feel stupid enough already.
posted by y2karl at 11:37 AM on February 11 [6 favorites]

I found an implementation on GitHub from a few years ago but it doesn't appear to be flying off the shelves. It seems to be designed for PMEM (non-volatile RAM but accessed like DRAM) and is optimized to reduce writes at the expense of query performance which isn't appropriate for all workloads. But the papers survey some good hashy techniques that are understandable by rock-bashing programmers like myself.
posted by credulous at 11:37 AM on February 11

Man, as if I didn't feel stupid enough already.

Don't feel stupid. I appreciate what they've accomplished but have no idea how they did it, and I write software for a living. This is extremely specific and nerdy and, while awesome, totally OK to not comprehend.
posted by grumpybear69 at 11:47 AM on February 11 [3 favorites]

Came here looking for something to roll joints on.
Leaving disappointed.
posted by chronkite at 11:53 AM on February 11 [15 favorites]

Since many of us are techies but not all of us are (and for the benefit of those who aren't, who may still be browsing this thread and wanting to understand why so many people care about this sorte of thing..)

One of the fundamental compromises necessary when building systems that deal with large collections of information is to find a way to represent that information in a way that is efficient according to multiple criteria. You will therefore very frequently hear computing people talk about "data structures" when they are designing a system and the selection of the right data structure for storing and accessing a collection of information can have a huge effect on the efficiency of any system that uses that information. Different data structures are good at different things. For example one data structure might be very efficient in the amount of work required to add a new piece of information, but it might not be so efficient in the amount of work required to find the information once it is stored.

Hash tables have a combination of efficiencies that are very good for many uses (pretty efficient to add a new piece of information to the collection, very fast to answer the question "does this item exist in this collection" and access it if it does) at the cost of a couple of other items - namely that they are not the most efficient storage in terms of using the smallest possible amount of storage to store a collection, that one must be able to define a function (called the "hash function") that determines into which part of the table any new information must go (also used to search for a piece of information to see if it can be found in the collection), and that they are really bad at answering questions such as "if this particular item doesn't exist within the collection, what is the item in the collection that is closest in value?" without having to look at every piece of data in the collection.

But for those applications for which they are suited, hash tables are a pretty great compromise in terms of insert efficiency, storage efficiency, and retrieval efficiency.
posted by Nerd of the North at 12:13 PM on February 11 [30 favorites]

I learned about hash tables in the 1980s in a computer science class in college. For my first job out of college I implemented a hash table in PL/I an application running on an IBM mainframe. I place where I was recently working had a proprietary database that implemented a cuckoo hash table, where inserting a new key into a cuckoo hashing table may push an older key to a different location in the table.
posted by ShooBoo at 12:18 PM on February 11 [5 favorites]

My favorite Sunday brunch spot features a hash table - smothered, chunked, diced, any way you like.
posted by Greg_Ace at 12:31 PM on February 11 [5 favorites]

Not A Thing: "We just celebrated my daughter's sixth birthday. She's about ten."

It's in the 5 <= age < 10 hash-bucket, so that's okay for the hashing index to find.
posted by k3ninho at 12:47 PM on February 11 [8 favorites]

I have seen hashing done in FORTRAN-66 (a language which didn't really have a concept of text data, and stored a machine-dependent number of characters in a machine integer) for early computational linguistics. It wasn't pretty, but it did work.
posted by scruss at 12:58 PM on February 11 [1 favorite]

The headline is "Scientists Find Optimal Balance of Data Storage and Time," yet the article is not about being independently wealthy and having an extensive personal library
posted by Gerald Bostock at 12:59 PM on February 11 [2 favorites]

5 <= age < 10 hash-bucket

[computer science maitre d'hotel voice] You have contiguous keys hashing to the same value? [snobbishly turns up nose] Why, you'd might as well radix sort your data. Here, let me show you to our special tables for punch card users
posted by phooky at 1:20 PM on February 11 [10 favorites]

Don't you know who I am? Abe Froman, the B-tree king of Chicago
posted by credulous at 1:27 PM on February 11 [15 favorites]

very fast to answer the question "does this item exist in this collection"

To understand why it's fast (simplified explanation), consider a list of every Metafilter user name. To find if a name was in use (or to find user info associated with that name) you're have to scan the entire list, one at a time. On average, you'll need to look at n/2 names for a list of n items.

If you store the list in alphabetical order, you can reduce the scan to log(n), but with hashing, there's a function f(name) that gives you the result with only one search.
posted by CheeseDigestsAll at 1:59 PM on February 11 [1 favorite]

Even simpler:

There is a hotel with 10,000,000 rooms. When you get to the hotel, you give them your name and they put your name into a machine that spits out a number between 1 and 10,000,0000. That is your room! The machine will always spit out the same number every time it gets your name. And that way, if someone wants to know what room you're in, they just give the machine your name and get the number. There is no need to even keep a record of who is in which room since this machine is so reliable. Occasionally two or more people might get put in the same room. That's OK! It just means finding you consists of turning your name into a number and then looking around the room a bit. It is much, much faster than knocking on 10 million doors, or even looking you up in a room directory.

That's a hash table!
posted by grumpybear69 at 2:30 PM on February 11 [26 favorites]

Hash tables are one of those things that seem like black magic until you understand it, then it's like "Why didn't I think of that?"

Array lookups are very simple. You want the nth item in a list, you do "item = array[n]" and there's your item. Under the hood it's a simple memory lookup: Take the memory address of the start of the array, add an offset that is just the size of each item multiplied by n, and that's the memory address of the nth item.

But what if you want to do a lookup based on something other than an integer? For example, a string?

Simple: Turn the string into an integer! Then you can do an array lookup like usual. But how do you turn the string into an integer? With a hash function. Mash the bits of the string together into a randomish-looking integer that represents the string. For example: "hash = hash * 33 + str[i]". Then force the number to be within the size of the array (this is just a modulo operation: divide by the size of the array and take the remainder), and that's your n for "item = array[n]".

That's the gist of it. It's a bit more complicated because you can get collisions, where two different strings hash to the same index in the array, but there are multiple ways to handle that, such as storing a list of key/value pairs in place of a single item.
posted by swr at 2:33 PM on February 11 [3 favorites]

I've never had to deal with a hash table, having always worked with higher-level code. But a cool idea. Always assumed it was behind all the indexing and stuff in SQL.

Have just started trying to write some code, after a few years of just hooking up components and setting properties. I have forgotten it all...
posted by Windopaene at 2:53 PM on February 11 [1 favorite]

This is timely for me. I'm working on a way to query xml-like documents and, in order to avoid re-doing work, I'm using hash tables to remember which parts of the documents satisfy which parts of the query. If your queries are of the kind that my querying thingy can handle, it can't take any longer than (number of terms in your query) times (number of elements in your document).

There is a standard hash table in the .NET collections library that I am using for now. However, in my application, it's never necessary to delete individual items from the table unless it's time to delete everything and start over, and it's likely that reads will outnumber writes by quite a bit. That means that I can use a hash table that prioritizes speedy reading over speedy writes, and I don't have to care about deletes at all. Taken all together, it may be worth it to use a Robin Hood hash table.
posted by a faded photo of their beloved at 3:03 PM on February 11 [1 favorite]

Ya ya, now go down the Skip List rathole.
posted by sammyo at 3:28 PM on February 11 [1 favorite]

Or if you're a botanist, go for the... Bloom Filter.
posted by sammyo at 3:33 PM on February 11 [2 favorites]

There is a hotel with 10,000,000 rooms.

Then Hilbert’s Bus Tours shows up, and you have to build a new wing?
posted by GenjiandProust at 3:33 PM on February 11 [6 favorites]

SpookyHash
posted by sammyo at 3:35 PM on February 11 [1 favorite]

I've always found hash tables to be a good way to contain data that can change. Like instead of a hotel room, you can track the number of comments that a metafilter user makes over the course of a day or whatever.

If the data is static, you can hold it in a regular array, even with two data points. like "user1.50 comments" as an array element, and then split when you use the data. But instead the number of comments changes so you have a hash that collects the number of users and comments, and continually adjust the number of comments, without having to use database table nor complex single array operations, which involves finding the data element you are looking for, "user1.50" comments, and then changing to "user1.85 comments" instead you can use a hash to find user1, and change the value pair to user1,85.

Or to use the hotel analogy, you have room1, user A checks out and user B checks in. Very easy to manipulate the data in the hash vs an array with one data point.
posted by The_Vegetables at 4:26 PM on February 11

Hash tables aren't that complicated, but the explanations I'm seeing here assume a fair amount of programming/computer science understanding. It's basically similar to a flat filing system in a filing cabinet.

Say you have a bunch of records with people's name on them, and you want to file them. One way you could do it is alphabetically by the first letter of their name. In a hash table, that would be like your hashing function. It's just a way of saying, for any record that could go into the filing cabinet (hash table), here's the folder it would go under. So a record belonging to "Simon" would go under "S", and I'd look for Simon's record under the "S" folder, instead of having to rifle through all of the records to see which one belonged to "Simon".

Now, lets say you had a lot of S names, and even though you've narrowed down the number of records you're looking at, you're still looking at a bunch of records to find "Susan" or "Stella" or "Sam", since there's no real order for the stuff inside the folder (most hash tables just put new stuff at the end of the folder, conceptually). So you make an improved "hash function" that looks at the first two letters and files things based on that, and now you have a separate folder for "Si-" names, "Su-" names, and so forth, so you're back to looking at a smaller set of records inside each folder.

Now pulling back to how computers do this, neither of these hash functions ("first letter or two of a name") are particularly good, since some letters (like S) have a lot of names under them, while others (like X) have a lot fewer. So ideally, you come up with a hash function that distributes the records a lot more evenly across your filing system, and that's something computers are really good at doing. In this case, it'd convert the names into a number, do some math on that number, and use that number to determine where to file the record. And to retrieve the record, all it has to do is run that same conversion on the name to get that number and pull the record back out of the folder. The beauty of this and the reason it's so important to computer science is that it makes storing and finding stored data very quick, since the computer doesn't have to spend a bunch of time looking at each individual record to find the one it's looking for, it can just do this one operation to locate the record.
posted by Aleyn at 5:32 PM on February 11 [8 favorites]

About 70 years ago, an engineer at IBM named Hans Peter Luhn quietly changed the course of computer science ... in a 1953 internal IBM paper...

But also, is 67 years "about 70"?

2024 - 1953 = 71

Seems pretty 70-ish.
posted by Reverend John at 5:51 PM on February 11 [2 favorites]

So in the design of this one, you get assigned a suite number based on your name. If there are no free rooms in your suite, you're moved to a second smaller hotel -- and you get *two* new suite numbers based on your name, and you get assigned whichever suite is less full. If both of those suites are full, you get to put up a tent in the parking lot, and your parking space is the same as your first suite number (they're really long parking spaces). If either hotel gets above a certain % capacity, the size of the hotel is doubled, and half of the suites are relocated.
posted by credulous at 6:07 PM on February 11 [1 favorite]

2024 - 1953 = 71

D'oh, good point, I fail at reading.
posted by Not A Thing at 6:16 PM on February 11 [1 favorite]

Hash tables are all you really need, too — it’s my understanding that the Lua programming language uses them exclusively (its arrays are secretly hash tables too).
posted by dbx at 7:23 PM on February 11 [1 favorite]

Always assumed it was behind all the indexing and stuff in SQL.

Behind the scenes these usually use a combination of B+-Trees. More distributed database systems like BigTable also use Bloom Filters and Log Structured Merge Trees.

As for the finding itself, I think the key thing people are leaving out of the ELI5 posts is that hash tables sorta work like library shelves. There's a ton of wasted space in them, because if you tried to get as close as possible to completely full, adding one more book to would require you to take one out of the shelf it belongs to, and move that one to the next shelf down. Which would require you to move a book out of that shelf etc, in order to maintain the correct sorting. You do not want to have to move a million books every time a patron returns one!

Hash tables deliberately add slack to the space* so that you can easily add and remove items without having to undertake so much work to maintain sorting. Ironically they rely on keeping things as unsorted as possible instead!

that they are really bad at answering questions such as "if this particular item doesn't exist within the collection, what is the item in the collection that is closest in value?" without having to look at every piece of data in the collection.

As I understand it, one of the fun use cases for hash tables is exactly this: a bloom filter is a hashtable where looking up k yields a boolean value. Pair this with a variety of distance functions and you can build a locality sensitive hash. But sure, for "find the book in the collection right next to the one I know of" seeking, they are bad.
posted by pwnguin at 7:56 PM on February 11

Hash tables are all you really need, too

I would gently disagree. Hash tables are great, whatever the implementation, but sometimes you want to write to or read from a contiguous chunk of memory for certain performance characteristics, which is what a proper array usually does.

I'm sure the Lua people are smart but there is a reason Lua is compared against (much faster) interpreted languages like Python or Javascript, as opposed to C, which is much faster than all of them.
posted by They sucked his brains out! at 7:57 PM on February 11 [1 favorite]

a bloom filter is a hashtable where looking up k yields a boolean value

Kind of. There are booleans and then there are probabilistic booleans. A Bloom filter is useful in, for instance, genomics work for k-mer searches, specifically where you might care more about a true negative ("definitely not in the set") than a false positive ("might be in the set, not sure").

A k-mer is a short stretch of DNA sequence. Organisms have unique patterns of these sequences. A Bloom filter lets you use the k-mers you have from a sequencing experiment to filter out sources of sequence data that could be contaminating your experiment.

False positives make their use a bit more complicated. One trick to reduce the false positive rate is to use multiple different hash functions that are all unlikely to collide. Or fall back to conventional hashing methods.

In addition to added speed in finding true negatives, another advantage of a Bloom filter is that the hash table takes up a lot less memory. This can be advantageous in situations when k gets larger than, say, 32 and other more canonical approaches start to take up exponentially larger amounts of memory.
posted by They sucked his brains out! at 8:11 PM on February 11 [2 favorites]

Lua uses a data structure it calls a "table", which is backed by both an array and a hash table. So you can use it as an array or as a hash table, and it's mostly fast either way.
posted by Monday, stony Monday at 8:11 PM on February 11

I've never had to deal with a hash table, having always worked with higher-level code. But a cool idea. Always assumed it was behind all the indexing and stuff in SQL.

When a language has a built-in dictionary/map/associative array that’s usually a hash table (or at least involves one).

The most common approach for DB indexing is tree-based but some DBs do offer a hash index type. Also they may build and use a hash a for a join when there isn’t an index available.
posted by atoxyl at 9:01 PM on February 11

I will say this: Nerd of the North is a well earned user name. As are the rest of you brainiacs dumbing it down for us Oh, the innumeracy! types here but just sayin'...
In the words of the late Bob Marley --
Who the cap fit, let them wear it..
posted by y2karl at 9:54 PM on February 11 [1 favorite]

scruss: I have seen hashing done in FORTRAN-66 (a language which didn't really have a concept of text data, and stored a machine-dependent number of characters in a machine integer) for early computational linguistics. It wasn't pretty, but it did work.
I/we used to process DNA and protein sequences using FORTRAN in the early 90s. My code was lumpy and inefficient but I was confident that I really understood how it worked, so I was happy publishing the results - incl the "first investigation of the long range primary structure of a eukaryotic chromosome" (yeast Chr III).

Man, as if I didn't feel stupid enough already.
grumpybear69. Don't feel stupid. I appreciate what they've accomplished but have no idea how they did it, and I write software for a living. This is extremely specific and nerdy and, while awesome, totally OK to not comprehend.
In the 00s, I was back trying to make sense of genomic data. In the intervening 7 years, genomic datasets had grown 10⁴-fold from yeast Chr III to the human genome. I gave up on FORTRAN and learned Perl because that was the local standard. The youngsters in the lab would tell me to use hashes, but my brain melted and I clanked through whatever using reg'lar arrays. If associative arrays were a black-box to my 'mind', I couldn't stand over the results; so I wasn't going to use them. I also stopped doing genomic analysis.
posted by BobTheScientist at 12:47 AM on February 12

There are booleans and then there are probabilistic booleans.

As a computer nerd, I know what you are a talking about, but I just have to make this joke:

enum megabool {
   TRUE,
   FALSE,
   NEITHER,
   BOTH,
   MAYBE,
   TRUEISH,
   FALSEISH,
   IT_DEPENDS,
   OSCILLATING,
   ITS_COMPLICATED,
   DOUBLE_TRUE,
   DOUBLE_FALSE,
   ...
posted by DreamerFi at 3:31 AM on February 12 [9 favorites]

I'm sure the Lua people are smart but there is a reason Lua is compared against (much faster) interpreted languages like Python or Javascript

Aside: if the statement that Python and javascript are much faster than lua is coming from looking at the Debian benchmarks game, note that they have removed LuaJIT from their comparisons. v8/Node is indeed really fast, but to my (potentially outdated) knowledge it is at best tied with LuaJIT.
posted by a faded photo of their beloved at 5:32 AM on February 12

Thank you Mr. Luhn. For aesthetic reasons, which is totally illogical, I've found certain data structures more pleasurable to work with than others. Hash tables and stacks being the top two.
posted by falsedmitri at 6:41 AM on February 12

Back when I did those horrible engineer interviews at Google, basically every system design question I asked could be answered by "use a hash table". It's a good way to divide the classical academic computer science education from practical hackery. CS classes are all about O() complexity and fancy sorts and B-trees and complex stuff that work nice in math land. Meanwhile, when building actual working code a big ol' array with a prime number modulus is pretty much going to handle most data you can throw at it.

The cheat with hash tables is you get around the usual O() complexity measures by adding the calculation of a hash function which is fast / ignorable in many circumstances. It also helps that most data fits in RAM so the non-locality of a hash table doesn't really matter.

Memcached (and now Redis) became even bigger arguments for using hash tables. Because you could get a really fast, easy to manage in memory hash table for free with simple open source software. (Arguably MySQL and now sqlite have done the same for B-Trees.)
posted by Nelson at 6:53 AM on February 12 [3 favorites]

Also the inventor of the 'Luhn number' or 'Luhn algorithm', used in a huge number of citizen- and consumer-facing applications, see the section marked 'Uses' below:

https://en.wikipedia.org/wiki/Luhn_algorithm
posted by northtwilight at 9:43 AM on February 12

The cheat with hash tables is you get around the usual O() complexity measures by adding the calculation of a hash function which is fast / ignorable in many circumstances. It also helps that most data fits in RAM so the non-locality of a hash table doesn't really matter.

On the other hand, somebody who does more low-level programming than me is probably complaining as we speak about CS grads not knowing when a small-N, cache-local loop over an array is faster than a hash lookup.

Arguably MySQL and now sqlite have done the same for B-Trees

It’s interesting how many companies seem to have build some kind of key-value, schemaless DB solution on top of MySQL at one point or another.
posted by atoxyl at 10:14 AM on February 12 [2 favorites]

If you're looking for random things from a chunk of memory, like a cache, and you have lots of memory to spare, then use a hash table. Then you don't need to worry too much about speed. Just watch out that you don't clobber data with inserts.

If you're retrieving some data from a chunk o'memory, and that chunk is either ordered in some way known to you, or you know what indices you're retrieving over, then an array is almost always more efficient. Especially so when storage is at a premium; ordinary hash tables use a fair bit more space than an array, there are no free lunches.

Other data structures can be useful for specialized purposes (B-trees for databases or file systems, tries for prefix searches, etc.) but generally people can use a thoughtfully-structured array or get away with just throwing a hash table at a problem most of the time.
posted by They sucked his brains out! at 12:28 PM on February 12

→ (its arrays are secretly hash tables too)

awk's arrays are all associative. It's a lovely language, and can be screamingly fast for text work. More people should know about it
posted by scruss at 6:41 PM on February 12 [1 favorite]

Or, "about 70 years ago, Luhn reinvented an idea that librarians had been using since the 19th century." (Yes, I know they're not quite the same, but saying a Cutter number is a hash is the easiest way to explain it to programmers.)
posted by djfiander at 8:46 AM on February 13 [1 favorite]

I suspect there’s a lot of prior art for hash tables because, as somebody else said, it’s a filing system. I think there are some claims to independent invention of some variations. I think the idea of the output of the mapping function being uniformly, apparently randomly distributed is pretty core to hash tables as we know them now, though, and I would imagine this property to be more desirable in computing than in libraries.
posted by atoxyl at 10:03 AM on February 13

Although non-adjacency of related data does have downside for computers, too, because of caching.
posted by atoxyl at 10:09 AM on February 13

The Pick database/OS is based on open bucket hash tables as are all of it's modern progeny.
https://en.wikipedia.org/wiki/Pick_operating_system

https://www.rocketsoftware.com/products/rocket-multivalue-application-development-platform/rocket-universe
posted by cmdnc0 at 11:28 AM on February 13 [1 favorite]

Poor Dick Pick died at 56 in 1994 but was fortunate in a way that he wasn’t around for the smartphone era.
posted by atoxyl at 11:40 AM on February 13 [5 favorites]

I owe part of my current livelihood to on-going work with the hash-table-based UniVerse database and will probably do so to some extent until retirement! However, I wasn’t aware of the history of this data structure until now so my thanks to the OP.
posted by Lesser Spotted Potoroo at 10:16 PM on February 14

« Older Come for butterflies, stay for philosophize | after a year of conversation, the concept of... Newer »

This thread has been archived and is closed to new comments

MetaFilter

Hash Tables
February 11, 2024 10:26 AM Subscribe

Tags

Share

Hash Tables February 11, 2024 10:26 AM Subscribe

Tags

Share

Hash Tables
February 11, 2024 10:26 AM Subscribe