Encrypted database queries
December 20, 2011 2:08 AM   Subscribe

CryptDB executes database queries over encrypted data without ever decrypting it.

CryptDB uses a substantial fragment of SQL for the query language and penalizing query times by only about 15-26%. (pdfs : journal, tech report, applications)

“The insight we had, the cool idea, is that SQL queries in a database are composed of relatively few types of operations: equal to, less than, summing up, sorting. .. For each operation, we were able to find an encryption scheme that is quite efficient at computing on encrypted data.”

Initial applications would resemble securing a datacenter against hackers and insider threats, outsourcing database hosting to an untrusted cloud provider, such as Amazon or IBM, or rigorously enforcing privacy obligations in an application itself. If however an application permitted query construction on a user's machine, then potentially even the application's provider need not be trusted.

Craig Gentry has another approach that probably leaks less user data to sophisticated attackers, but runs prohibitively slowly.

See previous thread on Homomorphic Encryption.
posted by jeffburdges (36 comments total) 33 users marked this as a favorite
 
CryptDB has its limits, the MIT researchers warn–no square roots, for one example.

Why would this be? Don't most computer languages natively support this function? Does it impose some special penalty on the query times of this method that would make these claims far less impressive if included?

Nice that there are still some things out of reach despite the massive developments in computer power.
posted by three blind mice at 2:55 AM on December 20, 2011


A Cloud that Can't Leak

Researchers at Microsoft have built a virtual vault that could work on medical data without ever decrypting it.
posted by Blazecock Pileon at 3:11 AM on December 20, 2011


Three blind mice: there apparently exist very clever techniques to add encrypted number a to encrypted number b, giving the correct encrypted number c, even if you don't know what a, b, and c, actually are. Similar techniques exist for some other basic operations - but not for taking square roots, so far.

So let's say I stored my financial data on a database owned by someone I didn't trust. I could store my data in an encrypted form on the database without the owner ever being able to see the unencrypted information. I could even ask that database to add together my assets, subtract my liabilities, and give me the result. At no point would the database have to decrypt my information, and at no point would the owner be able to tell what the data actually was.
posted by Joe in Australia at 3:21 AM on December 20, 2011 [8 favorites]


Seems like there are several types of encryption used in CryptDb to support different types of operation. Deterministic for equality since the same plaintext should always generate the same cyphertext. Homomorphic algorithm used so support mathematical functions, a function applied to the cypthertext is the same as applying some function to the plainext. The homomorphic algorithm used must not be fully homomorphic, that is that there are some functions to the plaintext that cannot be applied by performing some function on the cyphertext. These algorithms are considered "somewhat homomorphic"
posted by Ad hominem at 3:23 AM on December 20, 2011 [3 favorites]


Or what Joe said.
posted by Ad hominem at 3:24 AM on December 20, 2011


Interesting. At first glance, this does avoid most of the failings of homomorphic encryption, and may actually be useful.

I note a query type as simple as "select * from foo where x>100" can't be supported. I note also that a compromised proxy should yield at least the word contents of each searchable field, since you can brute force through a dictionary. Finally, you won't be able to do multiword queries, like "foo bar".

My initial reaction is that this:

1) Is being done very nicely, with a full source release
2) Doesn't support all queries, but is supporting real world applications -- very nice
3) Is vulnerable to brute forcing of the user's passwords (but then, what isn't)
4) Could probably be done much faster by throwing away the neat crypto and just having encrypted blobs with extra columns (hashes for equality checking, floats for order assignment).

Need to think about this a bit more, though.
posted by effugas at 3:28 AM on December 20, 2011 [1 favorite]


(To be fair, "can't be supported" is a bit excessive. You can binary search the ordered table for the value that is more or less than the bound.)
posted by effugas at 3:30 AM on December 20, 2011


Effugas, you can actually do "select * from foo where x>100" if you're willing to compromise your level of encryption. You set up the column with a form of encryption that preserves order, so that even though an attacker can deduce the order of your data he/she can't get precise knowledge of it. This sounds pretty leaky to me, but presumably there's some point to it.
posted by Joe in Australia at 3:33 AM on December 20, 2011


Probably a good idea to cite Translucent Databases.
posted by effugas at 3:36 AM on December 20, 2011 [1 favorite]


Joe,

Yes, but just because you have order doesn't mean you know the values (which is the point). You need to binary search to find the order bounds, using decryption at each query. I believe their present implementation would require the proxy to expand a single query to quite a few to implement this, though the discovered bounds could be cached in the proxy.
posted by effugas at 3:37 AM on December 20, 2011


If an attacker knows a columns' meaning and order, they could deduce the values after defining their probability distribution from survey data.

If for-example you had an marijuana usage column, then authorities could compare the ordering with a known distribution for marijuana usage, which they might take as probable cause for other crimes, like dealing.
posted by jeffburdges at 3:48 AM on December 20, 2011 [1 favorite]


From the quoted bit in the post, there appears to be less than and equal to support, so I don't see why greater than can't be implemented?
posted by vanar sena at 4:37 AM on December 20, 2011


I do not know how this works.

I don't like not knowing how things work, so this upsets me. But every explanation I've read is either itself encrypted in high-end mathematics which, not being a mathematician, is unhelpful, or so simplistic as to not have anything useful I can think about. I get that there are operations that let you Do Things to encrypted data, and that they are somewhat like the operations we know and love that Do Things to unencrypted data, but between there and arbitrary depth circuits that utilise ideal lattices... unfathomable chasms.

Recommendations for something that informs, most welcome.
posted by Devonian at 4:50 AM on December 20, 2011


If for-example you had an marijuana usage column, then authorities could compare the ordering...

I don't know of any databases that won't let you encrypt column names. For instance, you could use the "NOT!" encryption where the column name "marijuana usage" would be encrypted to "marijuana usage...NOT!". Another common one is the "terrible programmer" scheme where all column names are named "x", "y", "z" etc. (Alternate version: "x", "xx", "xxx").
posted by DU at 5:42 AM on December 20, 2011 [2 favorites]


Devonian,

Imagine you have a column, where instead of:

a
b
c

You have, for each a, b, and c:

Encrypt(A,Password)
Hash(A)
Position(A) (some floating point value between 0 and infinity)
Encrypt(A/Word1)
Encrypt(A/Word2)
Encrypt(A/Word3)...

(I'm oversimplifying slightly.) Now you can do queries, in-database, for anything with the same value as A (by comparing hash), greater or less than a (by looking at Position), and even the words that A contains, without actually knowing the value of a or the full phrase within.

This hack gets even more clever, by having a very limited amount of arithmetic operations it can run on numeric columns. Basically it's possible to have crypto functions that add or multiply plaintext blindly. They seem to have some trick where they can average a large number of encrypted blobs quickly, without learning the contents of that blob. Sort of neat, if it works (and they released their source).

To be fair, this is my first read of the paper. I'll need to go back over it.
posted by effugas at 5:44 AM on December 20, 2011 [1 favorite]


Yes, but just because you have order doesn't mean you know the values (which is the point). You need to binary search to find the order bounds, using decryption at each query.

Couldn't you encrypt the bound and have the database operate on that? If you've got an encryption scheme that supports ordering, you could do select * from foo where x>StRx3xpq
posted by atrazine at 5:54 AM on December 20, 2011


Ok, stupid question. If less than and equal to in the operation set, can't the untrusted cloud provider brute-force the data pretty easily? SELECT WHERE salary > 50000 AND salary < 50000;
posted by Leon at 5:57 AM on December 20, 2011


< 55000; I mean
posted by Leon at 5:57 AM on December 20, 2011


atrazine,

I suppose you could encrypt every integer from min to max, and then store the position that integer was greater than...
posted by effugas at 5:57 AM on December 20, 2011


(basically, this is the same problem DNSSEC has, where there's a finite number of records but an infinite number of nonrecords)
posted by effugas at 5:58 AM on December 20, 2011


Leon,

Yeah, that's kind of what we're discussing. The order preserving encryption (which, I think, should really just be a float followed by a traditionally encrypted blob) doesn't allow greater than or less than; the proxy seems to need to actually binary search along the ordered column, decrypting values until it finds the bound (which is cacheable).
posted by effugas at 6:00 AM on December 20, 2011


A Cloud that Can't Leak

Researchers at Microsoft have built a virtual vault that could work on medical data without ever decrypting it.


Blazecock Pileon, an encryption layman here. That article sounds a lot to me like it's the same basic idea as the source of this FPP. True? Or different, and if so, how?
posted by IAmBroom at 6:21 AM on December 20, 2011


IAmBroom,

Fully homomorphic encryption will almost certainly never work.

This is more a hodge-podge approach that gets close enough for your average real world app. Think of it like taking individual slices of SQL and saying, "Yeah, we can encode those concepts in a magic index".
posted by effugas at 6:56 AM on December 20, 2011


If you can do operations on numbers without decrypting them, do you think it would be possible to perform a set of operations to determine what those numbers were originally? What if you have some restrictions on the space of numbers? (credit card numbers, for example?)
posted by empath at 7:08 AM on December 20, 2011


The issue with this is that you can do lots of sequential queries that narrow down the value you care about. So you can't ask "What is Sally's age?," but you can ask "Give me all people who are 18," "give me all people who are 19", etc., and look for Sally. There are formal proofs that this is impossible to prevent in encrypted databases.

So it's an interesting result, but it has no practical ramifications for computer security. If you care about your database, don't let anyone query it!
posted by miyabo at 7:09 AM on December 20, 2011 [1 favorite]


empath,

I think the idea is that the database is usually only handed keys for equality and position, and neither is enough to brute force with (despite the CC# space being small).

miyabo,

I'm not quite sure there's no practical ramifications here. I'm not sure there are, but they've done the rigor to require more than just a trivial brushoff.
posted by effugas at 7:45 AM on December 20, 2011 [1 favorite]


One thing to realize is that the proxy basically implements something like a "least privilege" key distribution policy, whereby it only sends the DB what is necessary for a given query. So if only equality is required, the order preserving encryption key doesn't get sent.

The idea is to avoid the situation where a single god-mode query dumps the database. Instead, the attacker must (at minimum) acquire some creds from the front end, push arbitrary queries through the proxy, and examine the replies. Now, in the real world, data like CC#'s just isn't going to manage the security model we have here. But this is still cool.
posted by effugas at 8:05 AM on December 20, 2011


The issue with this is that you can do lots of sequential queries that narrow down the value you care about.

The purpose is not to hide data from the user executing the query, but from the system performing the query. At no point does the database system decrypt anything: It performs encrypted operations on an encrypted database, producing an encrypted answer, which the query user can then decrypt.

This is hugely relevant for cloud computing. Many organizations would love to make use of cheap hosted virtual servers (e.g., Amazon EC2), but can't trust their database to a third-party hosting service. With a traditional database, you need to have absolute trust in the server, because even if the database is stored encrypted parts of it need to be decrypted to execute queries. Anyone with access to that server is a threat, and that includes the server administrators and any other hosting clients who happen to be using a virtual machine on the same host.
posted by qxntpqbbbqxl at 8:26 AM on December 20, 2011 [1 favorite]


Eh, sounds pretty impractical to me. If a lot of records share keys, so if a bunch of things cost $3, you'll be able to see that they all cost the same thing. If you have people's names in the DB, couldn't you do a frequency analysis and then a dictionary attack?
posted by delmoi at 8:45 AM on December 20, 2011


I suppose you could encrypt every integer from min to max, and then store the position that integer was greater than...

I'm not sure that's necessary though. If you're using an Order Preserving Encryption scheme, then the SQL server can maintain a sorted index, the same as it would for unencrypted data. You're right that finding where a number falls in the list requires some kind of search through the index, but isn't that also true for an unencrypted inequality?

As I understand it, if you have a sorted list like so:

30
45
65
66
67
70
100
120
150

and you want to find all numbers greater than 50, you can just do a binary search. The first iteration compares 50>67, the second 50>65, the third 50>45 and from there the result is all entries in the sorted list coming after the 45 entry.

As long as you have a sorted list (which you can sort using any sorting algorithm as long as inequalities work), you can do a search through the list again using inequalities and find the list of indices to return.
posted by atrazine at 10:03 AM on December 20, 2011


Crap, once they know that 2fcab58712467eab4004583eb8fb7f89 = 2fcab58712467eab4004583eb8fb7f89 then we are all c69821bcb6a8839396f9652b6ff72a70!
posted by furtive at 10:17 AM on December 20, 2011 [2 favorites]


That article sounds a lot to me like it's the same basic idea as the source of this FPP

The basic idea is no new at all (see the links at the end of the FPP). What's new and interesting is that Popa et al. have actually implemented it in a way that demonstrates it could useful for real-world applications.
posted by hattifattener at 10:24 AM on December 20, 2011


DU : An attacker can analyze your queries to divine column meaning, i.e. don't mistake security through obscurity for security.

delmoi : There are an awful lot of opportunities for screwing this up in deployment, especially since it asks database designers to compromise their application's features with cryptographic necessities.

Your dictionary attack should not work though because you aren't sharing the cryptographic keys with the database hosting provider. You might for-example deploy trusted webheads that communicate with untrusted databases, which makes sense if your data storage requirements far exceed your access requirements, ala medical records.

Devonian : CryptDB isn't user-friendly cryptography like Tor, SSL, off-the-record messaging, or GPG/PGP. You'll need mathematical expertise to build an application around this system, but you won't need quite so much mathematical expertise and database optimization expertise all rolled up into individual developers.

You cannot really escape understanding the mathematics when using Hadoop's Mahout for machine learning either, but building systems sure requires less effort. Btw, check out the Hadoop MapReduce Hello World by Common Crawl.
posted by jeffburdges at 11:16 AM on December 20, 2011


Devonian : CryptDB isn't user-friendly cryptography like Tor, SSL, off-the-record messaging, or GPG/PGP.

User-friendly cryptography, eh?
posted by effugas at 5:04 PM on December 20, 2011


You got me on GPG/PGP, fair enough, but SSL, Tor, and OtR are extremely user-friendly. And I meant that GPG/PGP is user-friendly relative to correctly implementing a CryptDB applications.
posted by jeffburdges at 5:48 PM on December 20, 2011


SSL has some issues. Client side, do you know what the failure rate is of the HTTP to HTTPS upgrade lots of sites use?

By failure rate, I mean how often does the user detect that a MITM has prevented the upgrade from happening at all.

Well, Moxie Marlinspike checked. He ran a malicious Tor node that sslstrip'd everything going through. He had a success rate of...100%.

Seriously. Not a single user failed to log in after the downgrade occurred.

And lets not even talk about the server side issues with deploying SSL. The fact that vhosting is still a disaster in 2011 in hilarious; SNI exists, but isn't universally supported. So you still have to blow one IP per domain, or deal with the CA's to get a massive multiparty cert.

That's the other problem. CA's.

As for Tor and OtR, do they even count in this space? Neither solve problems for server operators. They're all about extra needs for the client.
posted by effugas at 10:24 PM on December 20, 2011


« Older Rock Me, Falco   |   my baby left me, start'd me drinkin' on christmas... Newer »


This thread has been archived and is closed to new comments