Living in the database, database
October 14, 2015 6:34 AM Subscribe

Why SQL is neither legacy, nor low-level, nor difficult, nor the wrong place for (business) data logic, but simply awesome! (allegedly)
posted by MartinWisse (132 comments total) 61 users marked this as a favorite

Literally yesterday, someone asked me to write a MongoDB query that would have take 30 seconds in SQL, but needed several layers of tricky loops and a couple hours work to do in MongoDB. Argh.
posted by miyabo at 6:39 AM on October 14, 2015 [19 favorites]

Anybody who thinks SQL is low-level is 19 years old and has never seen assembler and needs to get used to the concept of the right tool for the right job, as opposed to whatever is cool on the internet this month.

NoSQL is great for many uses cases. We'll even be able to get a report out of it for LOB applications soon.
posted by GallonOfAlan at 6:40 AM on October 14, 2015 [11 favorites]

Similarly, would you prefer to work with SQL or with Hibernate?

Uh, is that a trick question? SQL.
posted by thelonius at 6:44 AM on October 14, 2015 [15 favorites]

oh god, I get the title reference, nerrrrrrrrrrd
posted by maryr at 6:59 AM on October 14, 2015 [7 favorites]

NoSQL is great for many uses cases. We'll even be able to get a report out of it for LOB applications soon.

I love it when you talk acronym; it makes me shiver.
posted by GenjiandProust at 7:01 AM on October 14, 2015 [2 favorites]

My experience with Mingo is it works astonishingly well with some things and then dies horribly on others, usually the kind of fiddly SELECT stuff that SQL is good for, so it's very much a tool for specific jobs.
posted by Artw at 7:01 AM on October 14, 2015

This comes out at the same time as this, an article pointing out that flux (the basic dataflow pattern/implementation for reactjs sites, the current hotness in JS site development) is basically WndProc--the pattern for gui development in all Windows OSes from 95 through 7.

Uh, is that a trick question? SQL.

So fast your hard drive would spin again.

someone asked me to write a MongoDB query that would have take 30 seconds in SQL

I wasn't serious about Mongo-punching until I tried to write the equivalent of "set Y = Y + 1" a couple weeks ago. What do you mean I can't refer to the value being updated in the update?

That said, the article doesn't mention several real issues with doing it all in SQL that are much more difficult problems: harder to do development with source control; deployment stories that aren't nearly as advanced; tight coupling/lack of independent scaling between layers; and SQL had callback hell in the form of triggers long before you suffered with Node.
posted by fatbird at 7:02 AM on October 14, 2015 [16 favorites]

With apologies to Churchill, SQL is the worst form of query language, except for all the others.

Personally I try to avoid SQL like the plague, but working on the entity layer is something that gives me the jibblies anyway. And don't get me started with stored procedures - SQL-as-middleware makes me want to punch kittens.
posted by the painkiller at 7:04 AM on October 14, 2015 [5 favorites]

There is no single solution that works the best. SQL and relational databases are fantastic for large datasets which require multiple views and lots of calculation. Hibernate is great for CRUD stuff, terrible for reporting. Stored procedures are really efficient and fantastic for complicated data crunching but require the requisite staff to create and maintain the code and, in larger orgs, administrative barriers between the DBA and app dev teams can make maintaining and changing a codebase that involves stored procedures as well as non-SQL code something of a nightmare.

MongoDB is also a great fit for CRUD, especially when paired with NodeJS. It is not a great fit for a bank or a data warehouse.

The real problem when dealing with various approaches to solving software problems is that people feel compelled to take sides and advocate for one methodology and against another. It is like Team Hammer vs Team Wrench.
posted by grumpybear69 at 7:06 AM on October 14, 2015 [23 favorites]

Counteraction:
☑ Rising
☐ Not Rising

With the ever worsening clusterfuck that is our firm's databases, this article has convinced me to learn SQL. Anyone got a recommendation for learning better than W3Schools?
posted by charred husk at 7:08 AM on October 14, 2015 [4 favorites]

I have a number of disconnected objections.

1. Who even calls SQL "low-level"? I've never heard that.
2. "objects = xml = json" is no more true than objects = sets of sets of relations. Objects are graphs; they can be encoded in any of the other three (xml and json need a convention about allowing one node in the tree to refer to another node and some way of interpreting that, whereas to use sets of sets of relations you need to know how far to follow the foreign keys before you've got the data for "one" object).
3. How do you intend to demonstrate the correctness of business logic in the DB?

Only tangentially related, but I also am a little grumpy that relational databases just use SQL as an interface despite it having a dumb syntax.
posted by a snickering nuthatch at 7:14 AM on October 14, 2015 [3 favorites]

I feel like these misconceptions would not be had by anyone who took database classes in school alongside their Java programming classes. My university did not require us to study databases; I just lucked into it because I was doing some web programming and it seemed like it would be useful to know more about the DB layer.

Like, maybe you don't need to throw 6 nodes, a cache layer, and incomplete record cleanup at a problem that could be easily solved by a small, well-designed database leveraging transactions, its own cache, some appropriate indexes, and a trigger or two!
posted by mantecol at 7:14 AM on October 14, 2015 [2 favorites]

relational databases just use SQL as an interface despite it having a dumb syntax

Relational databases use SQL because it is a standard. And thank the FSM for that.
posted by grumpybear69 at 7:18 AM on October 14, 2015 [17 favorites]

For learning SQL try GalaxQL.

It's a fun little interactive tutorial, and it's about stars and galaxies instead of products and sales lines or whatever most tutorials are about.
I used it and I am now a professional SQL doing person.
posted by Just this guy, y'know at 7:18 AM on October 14, 2015 [36 favorites]

AdventureWorks fo lyfe.
posted by mantecol at 7:22 AM on October 14, 2015 [3 favorites]

I've only been working with SQL Server for the last three years. I'm astounded to learn that there might be people -- technical people even -- who would think that SQL is low-level. It's practically English. The only thing less complicated than SQL would be if I could just make a series of mean faces at my CRM database to make it return data.
posted by Strange Interlude at 7:25 AM on October 14, 2015 [20 favorites]

"Why SQL is neither legacy, nor low-level, nor difficult, nor the wrong place for (business) data logic, but simply awesome! "

You might want to redo that with fewer joins.
posted by srboisvert at 7:34 AM on October 14, 2015 [36 favorites]

SQL is fine and I've been using it a long time but I have some beefs with it, stuff that could be much improved.

1. update and insert syntax is different for no good reason

insert into mytable(col1, col2, col3) values ('val1', 'val2', 'val3')
update mytable set col1='val1', col2='val2', col3='val3' where foo=1

there's no reason they couldn't be the same and I don't even care which is which (well, actually I think I prefer the col=val format because otherwise you do a lot of counting to see which is which)

2. order depended phrases that don't actually depend on order. This is just nit-picking but for example who care where I put the "having ..." phrase? It only has one potential meaning. A lot of times when I am constructing SQL by hand for a one-off query I have to move that to the right place because I forget.

3. I would really like it if group by semantics had an... auto operator? Sooooo many times you are going to select, like, 3 columns or functions of columns, and then a few things that are operating on the group. You pretty much *have* to group by all the columns except the ones operating on the group (sum, count etc) so making me specify them again in the group by is a pain in the ass.

I can probably come up with more. It doesn't really matter, like everyone else I've just programmed around these (i.e. made update and insert generating functions that have the same syntax, functions that automatically handle the group bys, etc). It just seems like they could stream line it a little
posted by RustyBrooks at 7:35 AM on October 14, 2015 [4 favorites]

With apologies to Churchill, SQL is the worst form of query language, except for all the others.

Exactly.

My problem with SQL is that it's perfectly possible to write a query that gives the correct answer, but abuses the DB badly -- and alas, developers who are not truly expert at SQL and thus don't understand the ramifications of certain queries end up doing just that repeatedly.

"Why is my query slow? "Well, you're joining against 470 million rows..."
posted by eriko at 7:35 AM on October 14, 2015 [5 favorites]

Does anyone have any good reading on common pitfalls that can be avoided by "leveraging transactions?"
posted by I-Write-Essays at 7:37 AM on October 14, 2015

"Why is my query slow? "Well, you're joining against 470 million rows..."

That might account for some of the popularity of Mongo and the like, which tend to be an upside down land where SELECT * FROM ENTIREUNIVERSE type operations are preferable to picking something specific.
posted by Artw at 7:38 AM on October 14, 2015 [6 favorites]

select * from craps where given > 0;

(0 row(s) affected)

posted by graymouser at 7:40 AM on October 14, 2015 [21 favorites]

his is just nit-picking but for example who care where I put the "having ..." phrase? It only has one potential meaning.

Well, it makes sense to me where it is now, because it filters results after aggregation, it filters them based on the results of that aggregation, and it comes (as a clause) after WHERE and GROUP BY.
posted by thelonius at 7:40 AM on October 14, 2015 [1 favorite]

UPDATE and INSERT being different used to drive me crazy and whenever I try to revive my dormant SQL skills mixing u the two is always trips me up.
posted by Artw at 7:41 AM on October 14, 2015 [1 favorite]

See also: “Cache is the New RAM” which contains the choice phrase

“To paraphrase Paul Graham’s unbearably smug comment about Lisp: once you add group by, filter, & join, you can no longer claim to have invented a new query language, only a new dialect of SQL. With worse syntax and no optimizer.”

In other words: Those who do not use SQL are doomed to re-invent it...poorly.
posted by pharm at 7:42 AM on October 14, 2015 [15 favorites]

MySQL allows you to do inserts like update. But then of course MySQL has a bunch of non-standard "shortcuts", for better or worse.
posted by kmz at 7:44 AM on October 14, 2015 [2 favorites]

But seriously. I have spent a lot of time with SQL Server and the more time I spend with it, the more I like the idea of NoSQL data stores. Having a pattern where developers create stored procedures with 75 optional parameters to execute a search is nightmarish. Things like user-defined functions being used to parse XML passed in as query strings. That's just wrong and there are better ways to do it.

A good SQL solution should work with its strengths, which doesn't involve JOINs against massive tables. Database normalization has led to massive amounts of data in SQL databases that is royally fucked beyond repair. I don't think that SQL is a good solution for a lot of data needs you see today, like being able to search against a very large data set using an open-ended wildcard. That doesn't mean that it should be done away with, but it is good that we are rethinking these things.
posted by graymouser at 7:47 AM on October 14, 2015 [2 favorites]

I really don't get people's aversion to SQL. It's reasonably easy to learn. The main issues are RDBMSes that only support a subset (or worse, a bastardization) of the language and keeping track of differences in performance for certain operations between different RDBMSes.

I think the real problem is that it is seen as old and stodgy. Never mind that it is often the best tool for the job when you want to store and query data reliably.
posted by wierdo at 7:48 AM on October 14, 2015 [2 favorites]

You pretty much *have* to group by all the columns except the ones operating on the group (sum, count etc) so making me specify them again in the group by is a pain in the ass.

Maybe it won't work on your database but try "group by 1, 2, 3". Changed my life.
posted by mantecol at 7:48 AM on October 14, 2015 [1 favorite]

Well, it makes sense to me where it is now, because it filters results after aggregation, it filters them based on the results of that aggregation, and it comes (as a clause) after WHERE and GROUP BY.

I'm not saying it doesn't make sense where it is, I'm saying it doesn't matter where it goes. It's like a checkbox. You have or do not have a "having" clause, so you ought to just be able to put it anywhere. Really the same goes for most of the clauses, insisting on them being in some particular order seems like pedantry.
posted by RustyBrooks at 7:50 AM on October 14, 2015

Yeah, upsert-type statements are now implemented on all the major databases I believe, after enough people complained loudly and bitterly enough. Naturally they all behave slightly differently: Plus ça change, etc.
posted by pharm at 7:50 AM on October 14, 2015 [2 favorites]

I'll have to try group by 1, 2, 3 because a lot of times the group bys are going to be like... a date column formatted to some specific format or a mathematically expression and man I hate duplicating those.
posted by RustyBrooks at 7:51 AM on October 14, 2015 [1 favorite]

"And another huge advantage of this approach is that the new business rule is implemented only at a single place, if you’re doing it right:" Uh... the application logic level? The author has, and links to, a number of good sources, but I deal with people who think it's a great idea to put application logic in the database, and they clearly can not be trusted after seeing the fruits of their labor (by which I mean disastrously difficult to maintain T-SQL spaghetti code). There's no reason not to leverage the SQL server to provide better data, just don't try to encode application logic there. The transaction example is weak, especially when you realize that a non-trivial use case involves multiple transactions across multiple database from different vendors - there's no way to embed application logic in 'a' database in that situation, and that scenario is not at all uncommon.

The first presentation on using modern additions to SQL? Fantastic! I've been forcing that presentation on as many software engineers as possible.

And I will end my rant with a distressingly accurate portrayal of NoSQL
posted by combinatorial explosion at 7:51 AM on October 14, 2015 [13 favorites]

> Anyone got a recommendation for learning better than W3Schools?

One of the last paragraphs in the article has a link to to an SQL tutorial that in hindsight the article might be a thinly-veiled advertisement for.
posted by Gev at 7:53 AM on October 14, 2015

I'm not saying it doesn't make sense where it is, I'm saying it doesn't matter where it goes. It's like a checkbox. You have or do not have a "having" clause, so you ought to just be able to put it anywhere. Really the same goes for most of the clauses, insisting on them being in some particular order seems like pedantry.

When reading someone else's SQL with nested subqueries 4 layers deep, it's nice to know exactly where to look for things.

I may occasionally be the someone writing nested subqueries 4 layers deep and hoping my coworkers can read it without too much trouble.
posted by mantecol at 7:55 AM on October 14, 2015 [2 favorites]

Reached many of the same conclusions myself a few months ago. I work on a smallish, data heavy webapp. SQL works very well for our particular needs. It iss only lack of understanding it's implementation that leads to performance problems. Only lack of study that makes is seem low level. It is a decent DSL. Of course, it's be cool to have a modern QUEL or D based system to compare. I have had the crazy idea of writing a database using Clojure... its set operators are a pretty nice expression of the relational algebra. No one complains about prefix operators when it comes to query languages.

What I really want from RDBMSes are better ways to compose queries. I don't like jamming together query fragment strings and remembering to always use bind variables. Makes me want to look at Datalog and Datomic more closely.
posted by Mister Cheese at 8:00 AM on October 14, 2015 [4 favorites]

-- P.S. comment your SQL! It won't hurt the database, but might help your coworkers avoid bashing their heads on their desks.
posted by mantecol at 8:00 AM on October 14, 2015 [7 favorites]

When reading someone else's SQL with nested subqueries 4 layers deep, it's nice to know exactly where to look for things.

I had co-workers who created a database schema with few foreign keys, but then built something akin to them using character-delimited string fields (occasionally delimited by commas, semicolons, or spaces - no consistency). I created a query that was 75 lines long of fairly clear SQL code (well, as clear as it could be made) to build weekly/monthly aggregated reports. I also created a beautiful tapestry of swears.
posted by combinatorial explosion at 8:01 AM on October 14, 2015 [2 favorites]

The following things, I think, are true:

- SQL is very good, if not great. MySQL also.
- You should almost never be writing SQL unless you're writing libraries. Use libraries.
- Sanitize your inputs, fool.
- If your SQL queries are not very fast you are almost certainly doing something important wrong.
- Programmers are not DBAs, who are in turn not sysadmins. There are subtle and pernicious problems in each of those problem domains that are not easily resolved with a superficial understanding inherited from the others.
- If you have less than 100k things to index, you don't need a DB. If your dataset fits in RAM you don't need a DB.
- You cannot understand performance if you don't understand statistics.
- If you really do need a DB, you really do need a DBA. "DevOps" doesn't cut it.
- If you anticipate scaling challenges, start with Postgres.
- Oracle is Satan.
posted by mhoye at 8:01 AM on October 14, 2015 [28 favorites]

I feel like when people talk about SQL vs. NoSQL, there are really closely related issues involved. One is the language and the other is the underlying technology. SQL is one of the few languages where I have to pause and think before writing a single line of it, pretty much every time, but it is quite powerful. Meanwhile the NoSQL API ("language") is typically just a glorified hash table (get and set).

The technology underlying SQL is correspondingly more complex. You typically get data integrity (assuming you set up the necessary constraints), query planning (which works best if you set up indexes correctly), and you can join information from one table against another to aggregate the things you really want to look at as the result of one query. It's pretty awesome. If you use NoSQL you forfeit a lot of that and have to build that functionality yourself.

Don't get me wrong, I've used NoSQL for various things, like Elasticache, Redis, and Amazon's Dynamodb, but for the first two just storing data short-term, and for Dynamo for data where constraints (and to a lesser extent indexes) don't matter so much. They work quite well for those purposes, it's a matter of picking your tools.

As for database model objects, they do give me a similar feel to say C++ versus assembly. In some cases it's pretty damn convenient to just pull a single row out of a table and put it in an object in the programming language you're working with. But I've seen the other end of the continuum too, with a 50 line SQL query jammed into some Ruby code because the developer was maybe overly concerned with efficiency of a bulk query. It was almost as jarring as seeing assembly code there. So there's a balance to find, and it's not always easy.
posted by A dead Quaker at 8:04 AM on October 14, 2015 [3 favorites]

MetaFilter: a beautiful tapestry of swears.
posted by wenestvedt at 8:05 AM on October 14, 2015 [1 favorite]

My problem with SQL is that it's perfectly possible to write a query that gives the correct answer, but abuses the DB badly -- and alas, developers who are not truly expert at SQL and thus don't understand the ramifications of certain queries end up doing just that repeatedly.

"Why is my query slow? "Well, you're joining against 470 million rows..."

Isn't this the problem with high level languages in general? The abstraction lets people mindlessly implement O(n!) operations everywhere and the system has to just throw up its hands and cry.

What we need is a "Are you sure you want to do that Dave?" to pop up when the query optimizer hits something that borked.
posted by pan at 8:07 AM on October 14, 2015 [3 favorites]

mhoye: Oracle is Satan.

Let's be clear here: Larry Ellison is satan, Oracle is only Satan's handiworks.
posted by wenestvedt at 8:08 AM on October 14, 2015 [11 favorites]

"Huh," he said to the cat in an otherwise empty room, "you know it would be kinda cool if I could run SQL queries against that gigabyte of voxel terrain data...I wonder if there's a reason I've never heard of anyone attempting that?"

So today might be interesting.
posted by Ryvar at 8:10 AM on October 14, 2015

MongoDB is also a great fit for CRUD

Well, anything is a fine fit for CRUD if you use it correctly. The problem I have isn't that NoSQL is bad, it's that people think it means you don't need to plan out your data relations because "We can change them on the fly." Learning SQL helps you learn how to plan things out. You may not ever support 3rd Form (or whichever ) Normalization in your final production apps, but understanding why it matters makes a huge difference when you want to avoid the 470 million row problem mentioned above. I think the most important part for developers (IMO) from the article is, "Furthermore, it is based on relational algebra, a theory that is even older than SQL itself. "

SQL is very good, if not great. MySQL also.

That would be joined with an OR clause in my world. Go with your later advice and just use Postgres. Or pretty much anything else.
posted by yerfatma at 8:12 AM on October 14, 2015 [2 favorites]

I don't think that SQL is a good solution for a lot of data needs you see today, like being able to search against a very large data set using an open-ended wildcard.

Isn't that just poor db/process design though? I mean, we have all these rules discussions and then someone makes up a list of (plainly incomprehensible) category values off the top of their head and implements it. "Oh, we can just do text searches instead."
posted by sneebler at 8:12 AM on October 14, 2015 [1 favorite]

This could probably make for a very nerdy FPP, but whenever I evaluate a new database technology I make sure to check if Aphyr has first seen whether it sucks or not. Perhaps you are okay with losing some of your data during a network partition. I prefer for data loss to come from my own code and not the database itself.
posted by Mister Cheese at 8:12 AM on October 14, 2015 [4 favorites]

It is stunning how many professional developers have no understanding of SQL, beyond that their ORM spits out queries somehow "behind the scenes".

I have the opposite problem. Every time I try to work with code that uses ActiveRecord or something of its ilk, I get stuck because I'm trying to understand what it's really going to do behind the scenes, and there's too much magic for me to grok it. All my Rails-loving friends tell me to "just trust it to do the right thing" but I have way too many scars from technologies I was supposed to "just trust" to make that mistake again. I'm ok with tools that help me get a job done, but I have a very hard time using them if I can't first understand what they're actually DOING.
posted by primethyme at 8:13 AM on October 14, 2015 [6 favorites]

Lately I've been using SQLite with Python to do some local data-munging. It doesn't suck at all. It lacks a lot of features of grown-up databases, but the performance is often surprising, and it beats rolling your own cPickle'd data structures. (The documentation for SQLite is stellar, too.)
posted by RobotVoodooPower at 8:14 AM on October 14, 2015 [4 favorites]

I don't think that SQL is a good solution for a lot of data needs you see today, like being able to search against a very large data set using an open-ended wildcard.

Do you mean a fuzzy/ free-text search? That's really not what it's for, but of course you can pipe your database right into your favorite search tool with some SQL.

Every time I try to work with code that uses ActiveRecord or something of its ilk, I get stuck because I'm trying to understand what it's really going to do behind the scenes

I use Django, not Rails but I'm guessing the same thing is possible: update your logging settings to make it dump all the SQL queries to the console or a file.
posted by yerfatma at 8:14 AM on October 14, 2015 [2 favorites]

"Huh," he said to the cat in an otherwise empty room, "you know it would be kinda cool if I could run SQL queries against that gigabyte of voxel terrain data...I wonder if there's a reason I've never heard of anyone attempting that?"

PostGIS might help you there.
posted by skymt at 8:16 AM on October 14, 2015 [3 favorites]

Also, I'd guess SQLite is the most widely-deployed SQL implementation in the world, being an integral part of over 1 billion Android and iOS smartphones and tablets. It solves different problems than an online transaction-processing database, but there are still SQL queries bouncing around under the glass.
posted by RobotVoodooPower at 8:22 AM on October 14, 2015 [2 favorites]

I use Django, not Rails but I'm guessing the same thing is possible: update your logging settings to make it dump all the SQL queries to the console or a file.

Then prepare to weep and look up how to make it not do 100 single queries where it should be doing a join
posted by RustyBrooks at 8:22 AM on October 14, 2015 [7 favorites]

If your dataset fits in RAM you don't need a DB.

In the modern world of “OK, so we have a 10Tb working set, so lets shard that across 10 1Tb machines” I don’t think this is really true any more. You still need a DB, even if your data fits in memory, for many reasons including but not limited to: the DB provides a query interface to your 10Tb dataset, the DB lets you migrate to a new schema without losing several years of your life to stress, the DB provides a consistent backing store in case of power failure, in a perfect world the DB also takes care of availability (hah!) . etc etc.
posted by pharm at 8:23 AM on October 14, 2015 [14 favorites]

You still need a DB, even if your data fits in memory, for many reasons including but not limited to:

Sometimes the best reason to use a database is the built-in support for transactions, rollback, etc.
posted by Edgewise at 8:33 AM on October 14, 2015 [5 favorites]

fatbird, you can use the $inc operator for that. db.collection.update( { "y" : 6 }, { $inc : { "y" : 1 } } ) where "6" is an arbitrary number that you wanted to increment if you found it.
posted by Xoder at 8:36 AM on October 14, 2015 [1 favorite]

Then prepare to weep and look up how to make it not do 100 single queries where it should be doing a join

Not sure if that's Rails-specific or not, but Django does provide some options to work around N+1 queries and I recently came across a middleware that will flag them automatically to help run them down.
posted by yerfatma at 8:48 AM on October 14, 2015

Thanks for this article. It and the wonderful GalaxQL tutorial mentioned above may have saved me in the nick of time from embarking on months of doing stupid things with JSON documents.
posted by johngoren at 8:49 AM on October 14, 2015 [4 favorites]

Sanitize your inputs, fool.

Oblig. XKCD. I also recommend that 'Cache is the new RAM' post.
posted by eclectist at 8:49 AM on October 14, 2015 [1 favorite]

SQLite is the most widely-deployed SQL implementation in the world, being an integral part of over 1 billion Android and iOS smartphones and tablets.

Fun fact: it ships in Windows 10, too.
posted by Slothrup at 8:52 AM on October 14, 2015 [3 favorites]

Then prepare to weep and look up how to make it not do 100 single queries where it should be doing a join

Are you thinking something like .includes that eager-loads associations? ActiveRecord defaults to logging its queries to the console. For the most part, I think it's pretty straightforward. If you want to go under the hood, you can, but that's kinda defeating the purpose of an ORM. Bullet is the name of the game to avoid N+1 in Rails/Sinatra apps. I drop it into the dev gem group on all of our apps.

Back on topic, stored procedures can go to hell.
posted by protocoach at 8:54 AM on October 14, 2015 [1 favorite]

You should almost never be writing SQL unless you're writing libraries. Use libraries.

I guess the truth value of this statement really depends on what type of work you (as a developer) do. If you're writing a webapp that rarely ventures beyond simple CRUD functionality, sure. But if you're writing complex one-offs to get an exploratory look at your data, and your data happens to be massive, writing SQL is probably way faster than any of:
- cooking up a full application to interface with the data
- teaching your data model to a 3rd-party data exploration tool and hoping it can figure out how to write an efficient query
- offloading the SQL-writing (but not the analytic ownership of the problem) to someone else

It's worth it (to me, anyway) to know SQL backwards and forwards.
posted by mantecol at 9:07 AM on October 14, 2015 [6 favorites]

Database normalization has led to massive amounts of data in SQL databases that is royally fucked beyond repair.

I don't think that word means what you think it means.

You must mean 'not normalizing', because normalizing to 3NF, then enforcing with primary key and foreign key constraints, is what you do to achieve data integrity.

stored procedures can go to hell

What's the alternative that (a) doesn't require ad-hoc sql catenation in the client [the definition of tight-coupling between tiers] and (b) mitigates sql injection vulnerabilities consistently?
posted by j_curiouser at 9:13 AM on October 14, 2015 [6 favorites]

The Punchline:
☑ Ready
☐ Not Ready

Thanks, Just this guy, y'know. GalaxQL looks fantastic.
posted by charred husk at 9:20 AM on October 14, 2015 [1 favorite]

Third Normal Form: a mythical approach to database design said to be used by future advanced civilizations to avoid data redundancy and improve performance. Depicted primarily in works of high science fiction such as E. F. Codd's "Further Normalization of the Data Base Relational Model," IBM Research Report RJ909 (August 31st, 1971).
posted by Joey Buttafoucault at 9:24 AM on October 14, 2015 [9 favorites]

You have a problem. You use a stored procedure to solve it.

Now you have two problems.
posted by clvrmnky at 9:24 AM on October 14, 2015 [4 favorites]

In theory, I can agree with most if not all of what this article says. In practice, I think it's a mess to implement business logic in stored procedures. It can probably work if you do it right, but I've never seen people doing it right. Typically, you get a situation where the people writing the stored procedures aren't the same people writing the non-SQL code, and things end up getting partitioned in non-intuitive ways. Debugging this code can be a nightmare, and I've even seen problems making sure the stored procedures in the database are in sync with source control. There are tools and techniques to manage these processes...I just haven't seen people using them. So the author might be right, but for me, this is currently all theory.

I certainly do agree that RDBMS technology isn't going anywhere, soon. NoSQL databases cannot match it for performance when you're trying to query across tables. So far, NoSQL seems to be a good approach for certain specialized applications, like text searches on Elasticsearch and the like. I really like RDBMS technology, though I think SQL syntax could stand for an UPDATE.
posted by Edgewise at 9:31 AM on October 14, 2015

What's the alternative that (a) doesn't require ad-hoc sql catenation in the client [the definition of tight-coupling between tiers] and (b) mitigates sql injection vulnerabilities consistently?

Trading some coupling for version control, testing, and keeping your logic in one place is, in my opinion, worth it. And input scrubbing and parameterized SQL inputs in ORMs mitigate the SQL injection vulnerabilities. Stored procs have a place, but I think they're usually unnecessary and add an extra layer of complexity to apps that don't need it.
posted by protocoach at 9:38 AM on October 14, 2015 [1 favorite]

"...you want to do away with the database? And what, have our team build JSON objects and just dump them around application memory? What does that even mean?"

Desktop RAM, I should have specified, as in if you're under the 8GB mark.

Yeah, if you've got 2TB of data you should have a DB, but if you've got 2TB of RAM you already know that.
posted by mhoye at 9:43 AM on October 14, 2015

As a developer turned dba turned 'big data' guy, these discussions fascinate me.

Mostly because "a database" means so many wildly different things to different people. (not to mention to different teams working on relatively similar problems)
posted by DigDoug at 9:45 AM on October 14, 2015 [4 favorites]

I guess the truth value of this statement really depends on what type of work you (as a developer) do.

My experience has been that there's no such thing as a one-offs, only first times. Odds are excellent you'll need to run that "one-off" query again in six months or a year, so do the right thing, wrap it up nicely and check it in to version control.
posted by mhoye at 9:46 AM on October 14, 2015 [5 favorites]

I find a lot of SQL allergics have largely experienced MSSQL or MySQL. For developers who use Postrgres, SQLite and Firebird opinions seem to be a bit different.

Also, I wonder how many SQL disdainers realize how central SQLite is to both Firefox and Chrome?
posted by Matt Oneiros at 9:54 AM on October 14, 2015 [2 favorites]

There's a lot of nitpicking about SQL syntax, which is indeed ugly, but you can get over it. There's a deeper problem that annoys me much more: SQL's syntax actually hides a lot of the functionality of an RDBMS.

There are 3 basic operators in relational algebra: reduce (group), filter, and join (cross product). In a single SQL query without subqueries, you can ONLY do join, then filter, then group. With subqueries, you can express any arbitrary sequence of these operators, and the optimizer in an RDBMS is totally capable of optimizing an arbitrary sequence of these operators. But the only way to do that is to do crazy messy stuff with subqueries and views. People get frustrated, and write code when they should be writing queries.

So even though the superficial functionality exposed by SQL is pretty great, there's actually a lot that an RDBMS database can do that's hidden by SQL.

What's the alternative that (a) doesn't require ad-hoc sql catenation in the client [the definition of tight-coupling between tiers] and (b) mitigates sql injection vulnerabilities consistently?

Parameterized queries.
posted by miyabo at 9:55 AM on October 14, 2015 [8 favorites]

After data integrity and consistency, the main selling point for trusting and using a database correctly is that people really care about their data. They use it in ways that will surprise you, ways that your application doesn't or could not handle. Walk through a business office and you'll see people with spreadsheets, 4GL tools, and yes, even SQL interpreters looking at their data.

When data is done correctly applications can come and go and be rewritten in different languages, or even cooperate with each other without knowing that other applications even exist.
posted by alpheus at 10:02 AM on October 14, 2015 [4 favorites]

j_curiouser: "What's the alternative that (a) doesn't require ad-hoc sql catenation in the client [the definition of tight-coupling between tiers] and (b) mitigates sql injection vulnerabilities consistently?"

I've been itching to use JOOQ in a project, because it does almost exactly this (without the overhead and awkwardness of an ORM like Hibernate). Typesafe, structured SQL queries that still look like SQL.

I've also heard many serious arguments from extremely-talented developers that LINQ singlehandedly justifies developing on top of Windows and .NET.
posted by schmod at 10:03 AM on October 14, 2015

If you have less than 100k things to index, you don't need a DB. If your dataset fits in RAM you don't need a DB.

SQLite would like a word with you.

(Seriously, SQLite is great and I wish I had more excuses to use it. Except I'm a front-end dev these days* so I don't get to mess with SQL at all anymore...)

* frontend is the new backend
posted by neckro23 at 10:05 AM on October 14, 2015 [3 favorites]

whenever I evaluate a new database technology I make sure to check if Aphyr has first seen whether it sucks or not

Jepsen is an amazing and amazingly useful project, but the one problem with it is (SFAICT) is that aphyr's managed to prove that every single db technology will lose data under some circumstances. Which lessens its basic utility somewhat (although it's still great for helping you figure out if you're likely to hit those particular circumstances in your application).
posted by asterix at 10:06 AM on October 14, 2015 [1 favorite]

This morning, I wrote a recursive query, and I haven't geeked out so much about new technology in a long time.

You can traverse tree-like structures in almost-vanilla SQL, it's not painfully awkward to write, and it's reasonably performant!

Yet another thing I can cross off of the "okay, well maybe NoSQL makes more sense for that" list.
posted by schmod at 10:09 AM on October 14, 2015 [4 favorites]

fatbird, you can use the $inc operator for that.

Unfortunately, what I wanted to do was some string manipulation on the existing value, so passing a function to an interator over the collection is what was required. Which, okay, I've done enough Javascript to follow the logic there, but jeez guys, SOLVED FUCKING PROBLEM.
posted by fatbird at 10:14 AM on October 14, 2015

Jepsen is an amazing and amazingly useful project, but the one problem with it is (SFAICT) is that aphyr's managed to prove that every single db technology will lose data under some circumstances. Which lessens its basic utility somewhat (although it's still great for helping you figure out if you're likely to hit those particular circumstances in your application).

Jepsen mainly tests databases for partition-resistance (the "P" in CAP) , and has also demonstrated that many distributed systems advertising themselves as CA can be neither consistent nor available under extreme circumstances.

However, if your database isn't distributed (ie. you are running a single-node PostgreSQL server, possibly with a failover), CAP becomes a lot less important.
posted by schmod at 10:14 AM on October 14, 2015

This morning, I wrote a recursive query

like this?
posted by thelonius at 10:16 AM on October 14, 2015 [5 favorites]

The only use cases where relational databases really still make sense is when doing lots of joins (which suck and are slow, even for RDBMS) or when doing aggregate calculations on smaller scale. If you're scaling up, or looking at situations with pre-existing data variety (basically everywhere else in the world) you de-normalize and use a different technology. What if you need to integrate across a bunch of trusty pre-existing systems that you're not about to re-build/re-write, all with their own schemas or even RDBMS vendors or even different versions from the same vendor? Hint: yet another RDBMS instance to glue them all together is going to be a bad time.

There's a reason the big web properties write their own instead of using off the shelf databases. There's also a reason Amazon announced a schema migration service last week - and that's targeted solely at RDBMS users (though SQL is allegedly already standard).

Any technology looks perfect against one dude's use case. The test is how it does when trying to fold in against a bunch of other still valid use cases. If the only way to make it all work is to re-build/re-write everything, you're backing the wrong horse.
posted by NoRelationToLea at 10:17 AM on October 14, 2015

Both Facebook and Twitter use MySQL as a component of larger distributed systems.
posted by RobotVoodooPower at 10:43 AM on October 14, 2015

The only use cases where relational databases really still make sense is when doing lots of joins

This reads to me like "The only use cases where cars still make sense are where you need to drive on roads." Joins are everything -- you're either doing them at the database level or in your application, but they still need to get done.
posted by miyabo at 10:44 AM on October 14, 2015 [10 favorites]

Third Normal Form: a mythical approach to database design...

well..you can choose not to use 3nf just like you can choose not to have 100% test coverage or choose not to enforce no-globally-scoped-vars. That doesn't mean there's a problem with 3nf. There's a problem with your design trade-offs.

3nf is simple enough to implement, and anyone in this thread could learn to do a join across an n:m junction table in 30 minutes or less. if you don't want to, i get it, but that doesn't make the idea itself unreasonable.
posted by j_curiouser at 10:47 AM on October 14, 2015 [2 favorites]

I've also heard many serious arguments from extremely-talented developers that LINQ singlehandedly justifies developing on top of Windows and .NET.

as an alternative approach to SP, i probably mostly agree. depends on if your working 'tables' in ram are not insanely large - so there is hw coupling involved. also, i'd caveat that db-tier constraints still need to be implemented. but i agree the linq syntax and implementation is a huge deal.
posted by j_curiouser at 10:54 AM on October 14, 2015

That doesn't mean there's a problem with 3nf. There's a problem with your design trade-offs.

To be clear, that was kind of my point: everybody complains about their databases performing poorly, or being cluttered with redundant data, or requiring 6000 character queries, but then it turns out they've never heard of normalization, or are thoroughly misunderstanding the term, or have just never bothered despite knowing and understanding the concept. Basically if your experience with an RDBMS sucks it's often because your approach sucks, not because of anything inherent to the DBMS itself (or SQL).
posted by Joey Buttafoucault at 11:08 AM on October 14, 2015 [3 favorites]

Joins are everything -- you're either doing them at the database level or in your application, but they still need to get done.

That's really the issue here: SQL implementations have had people thinking in terms of normalized databases for 15 years or so now. The question that NoSQL has raised is whether or not normalization is worth the cost of JOINs*, which are bulky and expensive. There are data patterns that are not normalized, for example background tasks, that have the same long term impact and different trade-offs.

* and foreign keys, and indexes, and an optimizer, and a DBA to keep it all straight ....
posted by graymouser at 11:21 AM on October 14, 2015 [1 favorite]

This reads to me like "The only use cases where cars still make sense are where you need to drive on roads." Joins are everything -- you're either doing them at the database level or in your application, but they still need to get done.

Joins are a downstream consequence of decisions made in the 70s.

Given bell-bottoms and a big-ass mustache, I look around and see that disks are expensive. Therefore we need to normalize our data to only have one copy of any given bit of data. That means we also get data integrity. You also get serious performance issues as requests need to queue up for reads or writes on that single copy. You also get the wonder of joins, as your data is now scattered across a bunch of different normalized tables.

So yeah, in that environment, joins are really important.

40-50 years later, disks are crazy cheap. That means you don't have to normalize any more. That means your data can be localized/grouped in units that make more sense in terms of data retrieval/query instead of scattershot over a bunch of tables requiring a computationally expensive scavenger hunt to try and retrieve it all. That means you don't really have to do joins any more unless you screwed up how you're localizing/grouping your data. That means you can now optimize on your users' search/query/retrieval times instead of the cost of drive space.

Short version: if you're stuck with legacy database technology, with legacy schemas, then yes, joins are everything. If you're scaling out horizontally, or looking to go fast, you de-normalize like crazy to get rid of as many of your joins as you can. Your analogy to cars is an apt one. Joins and cars are absolutely necessary given a certain state of technology - what happens once the airplane is invented?
posted by NoRelationToLea at 11:30 AM on October 14, 2015 [1 favorite]

That's really the issue here: SQL implementations have had people thinking in terms of normalized databases for 15 years or so now. The question that NoSQL has raised is whether or not normalization is worth the cost of JOINs*, which are bulky and expensive.

Word.
posted by NoRelationToLea at 11:31 AM on October 14, 2015

NoRelationToLea: "That means you don't have to normalize any more."

Eponysterical?
posted by WaylandSmith at 11:39 AM on October 14, 2015 [5 favorites]

you don't really have to do joins any more unless you screwed up how you're localizing/grouping your data.

honest question...where can i find a reference for these failure modes and the patterns used to avoid them? because i mostly see junior devs pushing nosql as a way to use flat unconstrained tables, with lip-service to search performance, which let's face it, is mostly never, ever an issue for small & medium web resources.
posted by j_curiouser at 11:44 AM on October 14, 2015 [1 favorite]

the cost of JOINs*, which are bulky and expensive

This is just... not true. It depends on the scale of the joins, the structure of the data, indices, etc., and how the query optimizer can handle it. And you will be doing joins. If you relocate those to your code, you're just throwing away decades of effective research into how best to do those joins.

disks are crazy cheap. That means you don't have to normalize any more

Normalization wasn't about the costs of disks, it was about logical consistency.
posted by fatbird at 11:51 AM on October 14, 2015 [27 favorites]

RDMBSs provide astonishing importance in the all-important area of hours I have to spend dealing with race conditions and synchronization failures.
posted by The Gaffer at 12:04 PM on October 14, 2015 [3 favorites]

Third Normal Form: a mythical approach to database design said to be used by future advanced civilizations to avoid data redundancy and improve performance.

This is both funny and sad because of how true it is. I am not a highly-trained or credentialed professional. I am an autodidact (read: dilettante) and philosophy major dropout who has been taken on a career trajectory that has taken me from a ski-bum and retail store manager looking for a better way of tracking inventory to someone who is now procrastinating by writing this comment about SQL on Metafilter instead of trying to figure out how a transaction managed to get half-posted to a client's DB.

Here are a few lessons I learned the hard way, and I see people constantly making and remaking the same mistakes, mostly because of tight timelines and developers and project managers designing tables without involving the expertise of dedicated DBAs:

Even if you have the client demanding huge changes to the requirements at the last minute, and are pushing for the same deadline, take the time and effort to make sure your database is normalized enough to prevent inconsistencies. Creating a relationship diagram is something you do before the first CREATE TABLE, either that or something you do to plan an outrageously complicated set of writes to fix production data while your phone is ringing off the hook. Then it lives on as a stored procedure everyone forgets about until the problems the stored procedure introduced are discovered.
If normalization means that common queries on large numbers of records are going to require a ton of joins, and you don't want the performance hit, then that means you need an indexed view. Sure, a little more work at first, but now you have something that will maintain consistency while still being reasonably fast.
A corollary to the axiom about sanitizing your inputs: never assume the uniqueness of a value that isn't a necessary consequence of your database design. Identity properties for your primary keys are your friend. For an extreme example, take a situation I found that had been created by a certain entity that rhymes with 'grittycoop': credit card numbers are unique, right? No? Well, active credit card numbers are certainly unique, right? No? Well, the combination of credit card number and expiry date would never be shared by two otherwise completely unrelated accounts, right? Not necessarily. Now what happens when relying on that incorrect assumption when posting transactions; I wonder why those people refused to pay their bills?

As for NoSQL, as I said above, I'm not really what you would call an expert, just someone who has managed to muddle through somehow. That said, I can see that these new technologies offer tremendous promise, and are designed for the hardware constraints and use cases common today, not in the 1970s. I remain open to learning how these crazily denormalized beasts can handle transaction-processing applications where consistency and accuracy matter most, while preserving their inherent advantages.
posted by [expletive deleted] at 12:06 PM on October 14, 2015 [1 favorite]

A whole lot of real-world business applications fundamentally depend on normalized and consistent data.

Additionally, the "disk space" argument rings hollow, because most big databases have a few indices, which can take up quite a bit of disk storage. (In fact, I've seen this taken to the extreme where the size of the indices exceeded the size of the actual data)

Admittedly, document stores are actually rather nice when you want to break out of this model. If you've got lots of unstructured data, want to do version-control, or want your data to exist as a "snapshot" from a given time, a document store will do all of those things much better than an RDBMS.

If you've got a system that doesn't need to do lots of cross-cutting or ad-hoc analytics, microservices and NoSQL might be a good choice for you, and I envy your ability to reduce complexity and eliminate inter-dependencies! In practice, this ends up being really, really difficult.

IMO, I'd love to see an RDBMS that integrates seamlessly with a document store that can scale horizontally. There's room for both paradigms, but it's currently really difficult for them to coexist. I'd love to be able to store a product catalog in a traditional RDBMS, and then spit out invoices as documents that we (hopefully) never have to write to again. Postgres' JSON support is an interesting first step, but there's a lot more that could be done.
posted by schmod at 12:06 PM on October 14, 2015 [1 favorite]

Jepsen mainly tests databases for partition-resistance (the "P" in CAP) , and has also demonstrated that many distributed systems advertising themselves as CA can be neither consistent nor available under extreme circumstances.

No, Jepsen tests for consistency and availability under network partitions. The only meaningful CAP "options" are AP and CP -- meaning, respectively, the system retains the ability for every node to accept writes in the event of a network partition, or that the system retains linearizable consistency* in the event of a network partition. CA is not a coherent option, and not one that I've ever heard a distributed database lay claim to, since it means that the system retains those two attributes but is not partition tolerant, aka its consistency and availability semantics are completely undefined in the event of a partition.

Anyway, as someone who thinks that the affordances of RDBMSes are being thrown away way too lightly, I hope that F1 comes into the limelight at some point. It takes clustering into account from the very beginning and incorporates data locality into its flavor of SQL, which does foist a certain amount of complexity on the developer, but otherwise lets you keep all of the wonderful relational and transactional semantics you usually get in an RDBMS context.

* It's precisely the strength of this requirement that makes CAP a poor tool for analyzing real-world distributed system performance, since many databases can and do inforce other, weaker consistency constraints under partition. The options are not linearizable consistency or nothing, but folk reinterpretations of CAP often construe it that way.
posted by invitapriore at 12:09 PM on October 14, 2015 [7 favorites]

This is just... not true. It depends on the scale of the joins, the structure of the data, indices, etc., and how the query optimizer can handle it.

But how are the things you're talking about not costs of normalization? I'm better than a lot of developers at figuring out the situation of a SQL Server database and its indexes, but the iterative process of tuning a heavily normalized database is a cost. It's not free and it takes a lot of work to get it right. Like most technologies, this is really a story of tradeoffs, and the thing that noSQL solutions have made people realize is that there are other approaches and different tradeoffs you can make.

Normalization wasn't about the costs of disks, it was about logical consistency.

That's true. It's also not the only pattern that can achieve it.

Look at it this way. Let's say I could change my Metafilter username to "greymouser." In a normalized database, I change one record from "graymouser" to "greymouser" – that's efficient. But every time one of my posts gets updated the server has to join my record from Users to Comments. In a denormalized database, a background job has to update my 1808 (well, 1809 after this) comments to "greymouser" - but it never hits the Users table. In the aggregate, the latter process might actually be cheaper. And that's where we need to be thoughtful about our data storage.
posted by graymouser at 12:18 PM on October 14, 2015

I hope that F1 comes into the limelight at some point.

That's a great point. There are distributed, fault-tolerant relational databases out there -- F1 is Google's internal thing, Teradata is the famous commercial one. They're just really really complicated to design and implement, so they aren't free and open source. Maybe someday though.
posted by miyabo at 12:32 PM on October 14, 2015

Using SELECT * across the board, against tables consisting mainly of VARCHAR(MAX) columns.

I read that and my body couldn't decide whether to throw up or get hives or both. Or throw up hives.

It wasn't pretty.
posted by Mr. Bad Example at 12:46 PM on October 14, 2015 [1 favorite]

a background job has to update my 1808 (well, 1809 after this) comments to "greymouser"

And all the Mefi mail. And the posts table. Etc. And did you bother to key all of those so we know they're all the same person or are we just trusting the code somewhere perfectly enforces the requirement? I mean, we are denormalizing now, right?

But every time one of my posts gets updated the server has to join my record from Users to Comments.

It seems to me a lot of people who argue for only NoSQL instead of the saner "There are uses for both" are basically arguing for the database as a front-end cache. If doing a join from users to comments is a problem in an RDBMS, I think you have larger issues. Having been forced to work with Mongo a lot in my current gig, I've come to the conclusion I really like NoSQL for temporary data, denormalizing stuff with a time-to-live, etc. I'm happy to use that as a datasource but I can't sleep at night without knowing my "real" data is safe in the database.

The other conclusion is not to use Mongo. Like MySQL the popularity seems to be down to availability or name recognition or something and people wind up overlooking all the features/ safety belts that aren't there.
posted by yerfatma at 1:04 PM on October 14, 2015 [4 favorites]

Mongo and Redis are about equally suitable as canonical backing stores.
posted by invitapriore at 1:08 PM on October 14, 2015

I use mongo a lot at work, and for what we use it for, it's fine (we are mostly storing... "documents" with varied structure and keys and stuff). When we try to use it like a sql database, it sucks (i.e. a bunch of records all with the same "columns"). We are storing a very large number of documents with essentially no problems.

I do think the mongo query language sucks. Insisting that your query be a valid json document leads you to some contortions when it comes to structuring queries. I find reading them to only be kind of a pain but putting them together to be very frustrating.

But mostly, we are not doing complex queries on our mongo data. Usually there is a combination of key values that gets the small number of documents we are interested in "right now"

A real problem in our organization, imo, is that the SQL databases are locked down - getting them changed is difficult, and requires intervention of several people. But the mongo dbs are wide open. I can go create a new database or collection right now myself in our production database. The end result is that when people want to try something fairly fast... they do it in mongo.

Whether it makes sense to do it in mongo or not.
posted by RustyBrooks at 1:26 PM on October 14, 2015

But how are the things you're talking about not costs of normalization?

They're costs of normalization. No one said it was free. But neither is NoSQL--you've just shifted the data modeling overhead to partitioning and runtime checking instead of index building and tuning for the query planner.

Given the decades behind RDBMS research and practical experience, I tend to trust that body of work over the Mongo guys deferring write guarantees to the hard drive controller not lying to you.

That's true. It's also not the only pattern that can achieve it.

Absolutely. I was reacting to the somewhat absolute feeling of the statements by you and NoRelationToLea.

You raise data locality as an advantage of NoSQL and something that obviates the need for a lot of indices, but this short of sharding is well established in the relational world already as a normal tactic for scaling a database.

But every time one of my posts gets updated the server has to join my record from Users to Comments. In a denormalized database, a background job has to update my 1808 (well, 1809 after this) comments to "greymouser"

Except this is exactly the case where denormalization isn't a great strategy for performance--good caching is. When you update your username, the update operation poisons the cache so that the next page hit that pulls up your comments gets them fresh from the database--which isn't a problem because most comments for recent threads are already in cache because people aren't constantly changing their username.

You're right that there's a cost to all this. I think the fundamental good NoSQL databases are doing is forcing a re-evaluation of the necessity of those costs. Part of the result is that a lot of use cases are being established where eventual consistency is good enough, for example. But part of it is also that removing your schema from the database and putting it in the code is a false economy. You still enforce constraints, you still face schema migration issues, you lose system wide consistency, and you're further behind overall.
posted by fatbird at 1:31 PM on October 14, 2015 [7 favorites]

As yerfatma implies, there is a cost to denormalization as well, and that cost is maintainability (ever the enemy of performance). A normalized database goes hand in hand with well-factored code, while a denormalized database goes hand-in-hand with spaghetti code.

And it will be spaghetti code; don't tell me you've got a framework that keeps track of all your relations and generates all the denormalized code for every background job you need.

Normalized Database [bijection] Factored Code
posted by I-Write-Essays at 2:04 PM on October 14, 2015 [1 favorite]

Architect: We're putting all the state information in a SQL database.
Dev: I want to put this stuff here in an XML object.
Architect: No. "All state in SQL DB" means, well, that. Put it in a table.
Dev: <makes a table with one row of one column: VARCHAR[SHITLOADS] in which he stores aforementioned stuff as XML>
Architect: LOL WUT
Dev: No, this is better because I can use LINQ.
Everybody, including lost undiscovered tribes in the Amazon: LINQ to SQL is a thing. Also, what Dev did is bad and dumb.
Universe: Oh, hey, it's in production now. Too late to fix it!

I wish this were not a true story.
posted by sourcequench at 2:15 PM on October 14, 2015 [17 favorites]

LINQ to SQL is a thing.

LINQ and WebAPI are the reasons I am not campaigning for my shop to give up on Microsoft technologies entirely.
posted by graymouser at 2:21 PM on October 14, 2015 [1 favorite]

I'm not sure whether it's good or bad, sourcequench, but Microsoft SQL Server has been an XML data store for a long time, and Postgres has made great strides in adding fields that are basically JSON objects indexed in various ways.

I think that denormalization is a performance strategy that's lost importance over the years and caching of various kinds has replaced it, mainly because good caching will always be faster than a datastore--relying on a denormalized model doesn't get you as much as a well-tuned cache layer. I saw a Facebook storage-layer engineer describe their database as a giant, geographically distributed, in-memory key-value store representing their graph DB, that's occasionally dumped to MySQL so it's on disk at some point.
posted by fatbird at 2:21 PM on October 14, 2015 [1 favorite]

Joins and cars are absolutely necessary given a certain state of technology - what happens once the airplane is invented?

You still have a shitload of cars around because they are appropriate to different use cases?
posted by atoxyl at 2:55 PM on October 14, 2015 [2 favorites]

primethyme: "It is stunning how many professional developers have no understanding of SQL, beyond that their ORM spits out queries somehow "behind the scenes".

I have the opposite problem. Every time I try to work with code that uses ActiveRecord or something of its ilk, I get stuck because I'm trying to understand what it's really going to do behind the scenes, and there's too much magic for me to grok it. All my Rails-loving friends tell me to "just trust it to do the right thing" but I have way too many scars from technologies I was supposed to "just trust" to make that mistake again. I'm ok with tools that help me get a job done, but I have a very hard time using them if I can't first understand what they're actually DOING."

Rails dumps all of its SQL queries into $RAILS_ROOT/log/development.log (or test.log for the test environment.)

You can (and should) peek behind the curtain. That's how you can tell if ActiveRecord is optimizing your queries properly. ActiveRecord does a fairly good job of abstracting queries, but there's no law that says you can't roll your own SQL into an ActiveRecord query. Sometimes you have to.

You can even mix ActiveRecord and raw SQL into the same query. I have a voodoo doll* of the guy who did that (and worse) at our company.

* - in my mind
posted by double block and bleed at 4:37 PM on October 14, 2015 [2 favorites]

And, no, you shouldn't trust that Rails will always do the right thing. It usually does, but it's easy to forget that even with a great framework, computers are really, really dumb and neither know nor care what your real intentions are.
posted by double block and bleed at 4:41 PM on October 14, 2015 [1 favorite]

People like people who do SQL thing?

This is amazing! I do SQL thing! Data-driven surveys using transitive closures!

I love SQL! It is the * in my life.
posted by mikurski at 6:36 PM on October 14, 2015 [2 favorites]

LINQ and WebAPI are the reasons I am not campaigning for my shop to give up on Microsoft technologies entirely.

That conversation you're quoting happens all over, all the time.

Programmer: "I want my data in format X, because of read() and reasons."
Architect: "You really want to put your datas in a database, that is what databases are for and good at."
Hardware Vendor: "We support DBs."
Programmer: "I don't want a DB, I don't like DBs"
DBA: "You're only saying that because you don't know what a DB is or does or how they work"
Programmer: "you're not the boss of me and I want my data in format X"
Architect: "You're a moron. Use a database."
Hardware Vendor: "We support DBs."
Programmer: "Fine I'll use a database, here's a single table full of encoded strings that when you decode them are oh hey look what a surprise it's format X what a convenient coincidence what are the odds of that golly"
Architect: "you dogfucker what have you done"
Client: "This is great we love it ship it!"
Programmer: "Yay for me I shipped!"
Architect: "Me, you say?"
Client: "Oh hey why is everything so slow just before close of business?"
Architect: "Because...."
Client: "that's not OK"
Programmer: "It must be the database's fault!"
DBA: "you dogfucker what have you done"
Client: "Well we need this go go faster in time for close of business we need to change something."
Hardware Vendor: "You have a pretty mouth."
DBA: "You're going to tell everyone this project failed because of the database part aren't you?"
Programmer: "Yes."

That conversation's not unique to Microsoft tech, not even a little. An awful lot of the world is built to that exacting standard, where "database" secretly means "a spreadsheet full of the structured-data-format flavor of the month."
posted by mhoye at 6:40 PM on October 14, 2015 [35 favorites]

( ^ Now that I've recovered from hysteria )
I've seen it asked whether a database supports more than 65535 columns. I can't even imagine...
posted by I-Write-Essays at 7:08 PM on October 14, 2015 [3 favorites]

With apologies to Churchill, SQL is the worst form of query language, except for all the others.

Nope.

Good golly, nope.

Not even close.

KDB+/Q wins *that* dubious prize by a mile.
posted by 43rdAnd9th at 7:10 PM on October 14, 2015

I haven't actually seen that pattern. I have seen architects say "we don't need a real DB for that, we'll just use a binary flat file." And then you have 250 different threads spinlocking on access to that flat file and 3 other processes mmaping it (some in rw for some random reason) and an attempt at backing it up using a cron job and a shell script and it all ends in disaster.
posted by miyabo at 7:36 PM on October 14, 2015

SQL 4 L1F3!
posted by putzface_dickman at 7:39 PM on October 14, 2015

I'll add my own vote that I'd like to see wider adoption of the difference between relational databases and the SQL language—not long ago I was pressed into teaching the DB class here for a semester, and between seeing their scores and talking with my students it was clear they were much better at writing SQL queries than relational algebra expressions. And that surprised me a little because I've always found the functional feel of relational algebra far more understandable: SQL feels like this by-the-numbers declarative template where, if your query matches that SELECT...FROM...WHERE template well then it seems all handy and nice, but if not then you're looking up whether this DBMS supports WITH clauses or debugging think-o's with nested grouping. To me there's no good reason for WHERE and HAVING to be different clauses other than to fit filtering after grouping into this one-level declarative template format.

And since I'm voting for clearer differences I'd like to distinguish between the "structured database" versus NoSQL heap-o-data dichotomy and the relational versus other data model dichotomy. Some people want object stores, they need object stores, and fitting any other DBMS model to that role is possible (and widespread) but there will be hidden pitfalls. (I've got another neat anecdote about the Large Hadron Collider on that subject). And I really disagree about object databases being fundamentally hierarchical just like XML or JSON data—they clearly resemble graph databases more than anything.

Really I think the only person who would embrace relational-everywhere would be an ORM masochist. Oh wait—this guy cut his teeth on Hibernate, and now has a company that's selling their own fancy Java ORM solution! Color me, not surprised, but perhaps a shade of Pepsi blue?
posted by traveler_ at 11:56 PM on October 14, 2015 [1 favorite]

Look at it this way. Let's say I could change my Metafilter username to "greymouser." In a normalized database, I change one record from "graymouser" to "greymouser" – that's efficient. But every time one of my posts gets updated the server has to join my record from Users to Comments. In a denormalized database, a background job has to update my 1808 (well, 1809 after this) comments to "greymouser" - but it never hits the Users table. In the aggregate, the latter process might actually be cheaper. And that's where we need to be thoughtful about our data storage.

And how about uniqueness? You'd (hopefully) want to ensure your name change does not conflict with an existing name. Now you have to scan the entire comments table twice, once to ensure the name was unique and once to do the update. To ensure no conflict with another user proposing the same name change, you'd have to lock the table while all this was going on. And of course, there are lots of other affected tables which will all need updating. Regardless of the approach you'd still need transactions. Which ever way, this starts to look massively inefficient, it may of course pay off if there are many more reads than writes.

Its about more than data storage it is also about the impact on other users and the time taken to complete an operation, while maintaining correctness and consistency.
posted by epo at 2:08 AM on October 15, 2015 [3 favorites]

Caching has an impact obviously but if your entire data set fits into cache then it hardly matters what you do, so long as you trust that it will be up to date and consistent on disc in the event of a crash.
posted by epo at 2:20 AM on October 15, 2015

Oh wait—this guy cut his teeth on Hibernate

Tangentially related: I actually dipped my toe into the ORM space first with ATG's Relational Views, a bespoke ORM solution for their (bespoke) Dynamo appserver. This was back before JSP or J2EE were around. After moving over to WebLogic and encountering the horror of EJBs, I collaborated with a coworker to create another bespoke ORM solution called Relational Beans, based loosely off of my (mis)understanding of a T-Spaces whitepaper I had recently read. That was for a site called rentanything.com which I'm sure nobody has ever heard about, but was, in 2000, attempting to invent what we now call the sharing economy. When I moved on to work at a large financial institution which was transitioning off of ATG to JBoss I was tasked with doing a best-of-breed analysis around existing ORM solutions. Actual contenders in the space, aside from my home-brew solution, were Hibernate and iBatis. I ended up recommending Hibernate because configuring it was less of a Sysiphean nightmare, and that became the ORM standard at the company.

This, of course, resulted in lunacy where developers were Hibernating entire databases. I was lucky enough to get appointed to the architecture board and developed a (mostly toothless) policy around the use of ORM, which basically went: if the data is mutable, use Hibernate. If it is read-only (in the context of the webapp), use our JDBC libraries, which essentially alllowed execution of raw SQL, prepared statements and stored procedures and returned result sets as List and multiple result sets as >. With simple caching in place for the read-only data and limited activity for the mutable data (this was a marketing site with no e-commerce component) performance was borderline stellar.

For BI reports the company started using Sybase IQ, which is a type of RDBMS which handles denormalized data and ad-hoc queries very efficitently. To move data between the main DB (which was just regular Sybase, aka Microsoft SQLServer) we used an enterprise ETL tool called Informatica which is kind of like Pentaho. Tools like that are a godsend when you have multiple incompatible datasources which need to synchronize.

For the last couple of years I've been working with ATG again, utilizing their loosely-typed key-value-based ORM solution called, somewhat generically, the Repository. The loose typing, which amounts to String username = (String)repositoryItem.getPropertyValue("userName"); can be frustrating but allows for a lot of flexibility in terms of building business rules.

ORM: it comes in many flavors.
posted by grumpybear69 at 6:41 AM on October 15, 2015

The last time I wrote code for a living every day was just a LEETLE bit before web frameworks with automatic ORM and CRUD took over. We worked with a perl-based tool called Mason, which was awesome for site control, but which also did absolutely nothing for ORM. We had to build that part on our own, which to us make sense because have you SEEN the kind of bullshit SQL you get when that gets automated?

Imagine my surprise to discover, a year or two later, a whole plethora of web frameworks that did exactly that and LA LA LA LA JUST THROW HARDWARE AT IT. I mean, Christ, even with our hand-coded CRUD layer, we were forever re-tuning things at the DB and SQL layer to get optimum perf as we better understood the use cases. Deciding you'll just ramp up the hardware to fix shitty SQL seems like malpractice.

In my current company, SQL is a given. The data burden we have in our problem domain is so large that only an enormous fool would try to use anything else, so we don't have to fight this battle. And since I've been here for 8 years, it's allowed me to shake my head in disbelief from a nice, safe distance every time some fresh-faced 22 year old starts yammering about how SQL is dead.
posted by uberchet at 7:44 AM on October 15, 2015 [2 favorites]

mhoye:
> [...] where "database" secretly means "a spreadsheet full of the
> structured-data-format flavor of the month."

You are absolutely right. Just to clarify: That is not the pattern I was complaining about.

Having a database table be a spreadsheet with rows and columns, where some cells are an XML document or JSON object or binary blob can be evil and dumb, but isn't necessarily always horrific in and of itself. There are times where it's justifiable.

Having a (relational) database table with two columns called "key" and "value" should trigger a gut check about your design, but isn't a guarantee your application will open a portal to hell.

The case I was griping about was where there's one row and one column. The spreadsheet has one single cell, which is an XML document the size of the moon, and every time you want to do anything with it you have to parse the whole thing, and every time you want to modify it, you have to lock the whole thing, parse it, make your change, serialize it and undo the lock.

That is a pessimal design. It is N-trans-meta awful. It is worse than literally any other option I can think of, including things like writing your own filesystem from scratch, or building a robot to spray-paint your state data on a passing herd of cattle in base-64 encoded Deep Enochian.
posted by sourcequench at 8:03 AM on October 15, 2015 [8 favorites]

The spreadsheet has one single cell

IT'S NOT EVEN THE WEEK OF HALLOWEEN
posted by Nonsteroidal Anti-Inflammatory Drug at 8:11 AM on October 15, 2015 [5 favorites]

The spreadsheet has one single cell, which is an XML document the size of the moon

I really hope you were holding a flashlight under your chin and using a spooky voice while typing that.
posted by Mr. Bad Example at 8:36 AM on October 15, 2015 [8 favorites]

The king is forgotten, so many contenders, counteraction rising.
At the end of the story, yeah, we're all ready for the punchline.
They all got their killer apps, but there's no use with all your gimmicks.
But SQL's back baby, so check this out

oh god, an anime/j-pop/database query language mash up, wtf am i doing
posted by Eleven at 9:19 AM on October 15, 2015 [2 favorites]

Having a (relational) database table with two columns called "key" and "value" should trigger a gut check about your design, but isn't a guarantee your application will open a portal to hell.

There are actually quite a few high volume sites designed in exactly this way using conventional relational databases (reddit, for example). Others realize what they're doing and migrate off relational to key-value datastores.

The case I was griping about was where there's one row and one column. The spreadsheet has one single cell, which is an XML document the size of the moon, and every time you want to do anything with it you have to parse the whole thing, and every time you want to modify it, you have to lock the whole thing, parse it, make your change, serialize it and undo the lock.

That is a pessimal design. It is N-trans-meta awful. It is worse than literally any other option I can think of, including things like writing your own filesystem from scratch, or building a robot to spray-paint your state data on a passing herd of cattle in base-64 encoded Deep Enochian.

Happens all the time - can't you just solve this with joins?

Sorry, I mean first you'd need to shred the document into the appropriate tables (assuming they exist, assuming the XML or JSON or whatever format doc doesn't require adjustments to your schema, etc.), then do joins - both to query as well as to get the original XML back out? At least, that's what you'd have to do if your only option is a relational database. Well, either that or just shove the whole shebang into a single field, retrieve the whole thing to the client, parse on the client, etc.

But hey, SQL is awesome. Everything is cool when you're part of a team.

In all seriousness, the reason you get useful data crammed into a single field like that is the same reason you get applications in the enterprise that make zero sense to end users that don't also know 3NF (i.e. - Salesforce, Netsuite, etc. etc. etc.). Which is, the number of use cases for which a relational databases fits well are dwindling. "Fits well" btw is different than "made to fit by lots of banging with hammers and/or unholy acts of contortion." Just cause people have done it in the past via pain and impedance and stubbornness doesn't mean they should continue to do it that way.

There are better tools today, and most of the world's data would be better served by those new tools than continually trying to cram the data that's not tabular and not relational into relational data models.
posted by NoRelationToLea at 9:37 AM on October 15, 2015

Deciding you'll just ramp up the hardware to fix shitty SQL seems like malpractice.

That seems to be a common sentiment in this thread, but hardware is often cheaper than developer time.
posted by a snickering nuthatch at 12:12 PM on October 15, 2015 [1 favorite]

I've worked with people who are so concerned about conserving hardware, they won't allocate more than 500 megs to log files, which forced developing and maintaining completely useless log archiving and cleaning jobs. The databases are terabytes large, but we can't even afford a gig for the logs! There's more than one cargo cult on the market.
posted by I-Write-Essays at 12:18 PM on October 15, 2015 [2 favorites]

I work in a place with no less than four different relational databases - SQL Server, Postgresql, MySQL and SQLite, at least two no-SQL stores (mongo and redis), and maybe a couple others I don't know about. In the SQL Server (which is our main database) sphere, we have different databases for development, qa and production and they sometimes have different tables named differently. In theory the development database gets loaded with recent data from time to time so we can use it to play around in, but in practice this rarely happens. Just to make it more fun, the various db machines have a variety of interesting and unpredictable names (like one will be "abc101" and another "x-1-35") most of which are undocumented and which change from time to time to keep us on our toes. Just finding where things are stored can be a challenge.

The two billion rows (or is it several times that) of what should be ephemeral log data make queries a pain. More than once the DBA has stopped my exploratory queries and rebuked me for not using selects from the proper columns - the ones that had indexes. We can't, it seems, index on the natural primary key because it is built in such a way that inserts would be very expensive.

Still, I do like relational databases for the kinds of data that they do well. We could, of course go completely to triple stores, but I can only imagine what that would look like at scale.
posted by Death and Gravity at 12:56 PM on October 15, 2015 [1 favorite]

different databases for development, qa and production and they sometimes have different tables named differently

/weeps quietly
posted by yerfatma at 5:40 AM on October 16, 2015

Death and Gravity: "I work in a place with no less than four different relational databases - SQL Server, Postgresql, MySQL and SQLite, at least two no-SQL stores (mongo and redis), and maybe a couple others I don't know about...we have different databases for development, qa and production and they sometimes have different tables named differently..."

Do you relax after work by hitting yourself in the head with a brick?
posted by double block and bleed at 7:58 PM on October 18, 2015 [1 favorite]

« Older The Red Drum Getaway | RIP Carey Lander, keyboardist Newer »

This thread has been archived and is closed to new comments

MetaFilter

Living in the database, database
October 14, 2015 6:34 AM Subscribe

Tags

Share

Living in the database, database October 14, 2015 6:34 AM Subscribe

Tags

Share

Living in the database, database
October 14, 2015 6:34 AM Subscribe