ENORMOUS SCALABILITY
April 25, 2012 10:52 AM   Subscribe

Leverage the power of CLOUD COMPUTING using standard Unix tools!
posted by kenko (31 comments total) 23 users marked this as a favorite

 
I'm trying to decide if this is comically useless or epic genius.
posted by Tomorrowful at 10:55 AM on April 25, 2012


I'm going to go with somewhat useless and also hilariously genius.
posted by mikeh at 10:57 AM on April 25, 2012


I'm going with "kind of silly."
posted by swift at 10:58 AM on April 25, 2012


Hehe, yeah, the comments are hilarious.

It's not a *completely* useless tool -- You'll find, for example, that tools like the sun grid engine are essentially wrappers around this basic idea. :-)

The difficulty is not invoking the command remotely on a server. The difficulty is keeping the servers up and running, provisioning the servers correctly with the right software, figuring out the best way to schedule the tasks (how many to schedule per server!), how data is migrated, etc.

But if you have your own wee cluster, a tool like this isn't beyond the pale when you're doing lots of command line work with tools that are cpu-bound (e.g. a lot of research tools) -- it's quick and dirty -- even if, yeah, you could certainly find other more robust solutions for scheduling them. :-)
posted by smidgen at 11:06 AM on April 25, 2012 [1 favorite]


I'm trying to decide if this is comically useless or epic genius.

Both.
posted by empath at 11:18 AM on April 25, 2012


One caution with this approach. Suppose, hypothetically of course, that you're running a highly parallel distributed task built in Java. Further suppose that you're lazy and didn't build a fancy distributed architecture, so this task involves a bit of network IO on each worker thread. Finally, suppose that you, or perhaps one of your partners, failed to close said IO sockets after use. For the sake of this example, we'll state that you've used a script rather like the one linked here to quickly task all 16 machines of your CS department's high-end cluster to run this job with a couple dozen worker threads on each machine. Smile as you watch the results pour in.

Expected result? Realize that you, nor anyone else, can no longer ssh to any of the departmental machines. Further realize that the error message makes it look suspiciously like every machine in the cluster has run out of available global file descriptors. Come to the conclusion that no one ever anticipated such silliness so they never locked down any such resource limits. Come to the further conclusion that it's late on a Friday and no one will be around to actually fix the problem. Send off email to department sysadmin that reports the problem without exactly taking full responsibility, as you're not entirely sure what happened. Smile and nod as your classmates complain that they can't get their work done for the better part of the weekend.

This is a strictly hypothetical caution, mind you.
posted by zachlipton at 11:26 AM on April 25, 2012 [20 favorites]


cluster ssh is probably my favorite tool along these lines. It presents N terminal windows, and one command window, and commands typed into that command terminal are sent to all servers.

On a more automated side, there's fabric, chef, and puppet.
posted by fragmede at 11:28 AM on April 25, 2012 [3 favorites]


I really think a law should be passed allowing the murder of anyone who uses "leverage" as a verb. Maybe torture, too. Something has to be done.
posted by Decani at 11:28 AM on April 25, 2012 [2 favorites]


Everything old is new again.
posted by sourcequench at 11:50 AM on April 25, 2012


I wonder if you would get your results faster if you renamed the script "fog"
posted by mmrtnt at 11:56 AM on April 25, 2012


Leverage the power of CLOUD COMPUTING using standard Unix tools!
This sounds way too much like something from a 'Town Hall Meeting' at my work.
I believe it's a corporate requirement to use leverage as a verb in such meetings.
Over and over...
posted by MtDewd at 11:58 AM on April 25, 2012 [1 favorite]


I'm leveraging the power of cloud computing to post the comment. LEVERAGING!
posted by blue_beetle at 12:02 PM on April 25, 2012 [3 favorites]


Also not laughing here. In the 90s my employer had a much more complicated version of this type of thing to distribute compiles on a cluster of machines of different architectures. I wrote scripts to distribute mp3 encoding over my local network. Even today, "distcc" is an important tool for fast software builds on unix networks…
posted by jepler at 12:02 PM on April 25, 2012


Tomorrowful: "I'm trying to decide if this is comically useless or epic genius."

That's a description of all of the best Unix command line tools. So: success!
posted by Plutor at 12:33 PM on April 25, 2012 [1 favorite]


I wish distcc existed for ghc.
posted by jeffburdges at 12:37 PM on April 25, 2012


I wrote my own version of this in, oh, 1992. Between rsh and a common NFS server it was quite easy to farm out computational jobs, do parallel builds, etc. The best part was where we gave our Sequent parallel Unix box "itchy" multiple hostnames: "itchy-1", "itchy-2", etc. That way it was easy to treat it as 10 totally separate computers for purposes of parallelism, instead of all this hideously complicated really impressive multitasking stuff they had innovated.
posted by Nelson at 12:38 PM on April 25, 2012


Lame. It's not clear what exactly he is trying to satirise, or why.
posted by Joe Chip at 2:14 PM on April 25, 2012


It's satirising the overusage of the word "cloud" to describe anything and everything. Which is justified, in my opinion. At this point you have to resort to clunky neologisms like "infrastructure as a service" in order to get across what a cloud company actually does.
posted by miyabo at 2:36 PM on April 25, 2012


Huh, I never knew sort had a -R option.
posted by These Premises Are Alarmed at 3:20 PM on April 25, 2012


Huh, I never knew sort had a -R option.

It's pretty neat, and like shuf it uses machine noise to generate uniformly random permutations of lines. I often use it as a quick way to permute genomic data on the command line.
posted by Blazecock Pileon at 3:28 PM on April 25, 2012 [1 favorite]


I admit I find the -R option to sort conceptually confusing, but it is convenient.
posted by kenko at 3:34 PM on April 25, 2012


jepler: "Also not laughing here. In the 90s my employer had a much more complicated version of this type of thing to distribute compiles on a cluster of machines of different architectures. I wrote scripts to distribute mp3 encoding over my local network. Even today, "distcc" is an important tool for fast software builds on unix networks…"

distcc has the distinction of running on multiple servers. This chooses one at random, in mockery of the idea of "cloud computing" as a thing where you don't know where your computers are anymore.

BashReduce is the thing you are looking for.
posted by pwnguin at 3:37 PM on April 25, 2012 [1 favorite]


GNU Parallel can actually do this in a useful way.

GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job is can be a single command or a small script that has to be run for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. A job can also be a command that reads from a pipe. GNU parallel can then split the input and pipe it into commands in parallel.

If you use xargs and tee today you will find GNU parallel very easy to use as GNU parallel is written to have the same options as xargs. If you write loops in shell, you will find GNU parallel may be able to replace most of the loops and make them run faster by running several jobs in parallel.

GNU parallel makes sure output from the commands is the same output as you would get had you run the commands sequentially. This makes it possible to use output from GNU parallel as input for other programs.

For each line of input GNU parallel will execute command with the line as arguments. If no command is given, the line of input is executed. Several lines will be run in parallel. GNU parallel can often be used as a substitute for xargs or cat | bash.

posted by PueExMachina at 6:39 PM on April 25, 2012 [5 favorites]


Please more bash-y posts!!! Very good meditative reading!
posted by flyinghamster at 7:06 PM on April 25, 2012 [4 favorites]


I have thought that someone should extend GNU parallel to start a pool of EC2 instances and distribute processes among them. But that's just daydreaming.
posted by miyabo at 9:25 PM on April 25, 2012


There's also "rshall", which does what the description of this script implies it should... only properly.
posted by felspar at 10:20 PM on April 25, 2012


If I'm following this, I'm going to consider this a joke, because that example violates data locality horribly, making no attempt to filter data before sending it across the network.
posted by Pronoiac at 3:48 AM on April 26, 2012


Oh, hang on, it runs each command on only one server, not on every server. Never mind.
posted by Pronoiac at 3:51 AM on April 26, 2012


I was saying this is lame because it is such a woefully inadequate solution that it makes a mockery of itself rather than "cloud" computing.
posted by Joe Chip at 4:33 AM on April 26, 2012


Yeah, came in here to mention GNU parallel which does this for reals. It's also a lot less ridiculous than hadoop.
posted by DU at 7:22 AM on April 26, 2012


kenko: "I admit I find the -R option to sort conceptually confusing, but it is convenient."

See also `yes no`.
posted by Plutor at 10:15 AM on April 27, 2012


« Older With fans struggling to come to terms with David B...  |  Have you ever wondered what no... Newer »


This thread has been archived and is closed to new comments