"The data that we actually used."
December 8, 2012 8:24 PM   Subscribe

Rosalind.info is a website with bioinformatics problems inspired by Project Euler (previously, previouslier.)

From the about page: "We hope that Rosalind will inspire a new generation of bioinformatics students by attracting biologists who want to develop vital programming skills at their own pace in a unique environment as well as programmers who have never been exposed to some of the stimulating computational problems generated by molecular biology." It is named after Rosalind Franklin because of her work on X-ray crystallography with Raymond Gosling that helped the Watson and Crick discover the DNA double helix.

"Rosalind is a joint project between the University of California at San Diego and Saint Petersburg Academic University along with the Russian Academy of Sciences."

Once you have downloaded a dataset you have five minutes to upload your solution. This forces your solutions to be efficient as well as not be random chance.

Y combinator had a post about it recently as well.
posted by lizarrd (21 comments total) 46 users marked this as a favorite
I've been working on these a while and really enjoying them. I'm a Novice Amoeba according to my standings page. (Though I think my "Phage Lambda" Achievement is more appropriate, as I'm trying to solve them with Haskell.)

There isn't one fixed order to work through them. Each problem has prerequisite problems (that you can view in tree form), and as you solve the pre-req's new problems become available to you.

Be aware that they are still adding problems, so when you go back you may do a double-take. "Hey, I thought I had solved all of the first 10. Why's that one near the top unsolved"? — they've just added a new one.
posted by benito.strauss at 8:55 PM on December 8, 2012 [1 favorite]

This is excellent. I'm passing this on to colleagues. Cheers.
posted by Blazecock Pileon at 9:11 PM on December 8, 2012

I've been playing with these for a few days too (I actually saw them through lizarrd elsewhere, thanks again!), and it's an interesting contrast with Project Euler, an older and math-ier variation on this that was actually one of the inspirations for Project Rosalind. If there are similar websites for other fields, I'd love to hear about them. As for Rosalind, I definitely appreciate the problems being put in a domain-specific context and the pre-req arrangement is neat if you want to pay special attention to a specific topic. On the other hand, it would be nice to have social features that don't require you to link external accounts and if solvers who aren't part of a class were able to ignore pre-requisite requirements.

All in, I'm having a lot of fun with it and it's nice to have some good string-based problems to add to Euler's math-based problems when I'm trying out a new language.
posted by jdherg at 10:50 PM on December 8, 2012

Neat. My 13 year old is obsessed with Project Euler (he's solved over 100 problems now). I'll have to show him this.

And I agree with having some string-based complements to the math-based Euler problems to round out learning a language.
posted by DU at 5:51 AM on December 9, 2012 [1 favorite]

Each problem has prerequisite problems (that you can view in tree form)

Is tree-form not working for me or do I not understand it? I don't see a tree, I just see a bunch of text.
posted by DU at 6:38 AM on December 9, 2012

Also worth mentioning along these lines is 4clojure, which is very much in the vein of Project Euler but specifically aimed at learning Clojure (so you enter code, not solutions).
posted by invitapriore at 7:34 AM on December 9, 2012 [1 favorite]

Is tree-form not working for me or do I not understand it? I don't see a tree, I just see a bunch of text.
Maybe your browser doesn't support SVG? http://rosalind.info/problems/tree-view/ works for me.
posted by dfan at 7:53 AM on December 9, 2012

Is tree-form not working for me ...

I thought it might be because you weren't logged in to Rosalind, but I logged out and still saw a tree. Like dfan says, it's a big old SVG diagram. It shows up for me in FF 17 on Win7.
posted by benito.strauss at 9:10 AM on December 9, 2012

I'm using FF (iceweasel, actually) and it supports SVG just fine. Dunno.
posted by DU at 11:05 AM on December 9, 2012

Planning on looking at this at work tomorrow - is actual knowledge of DNA a hindrance? For example, looking quickly at the REVC example problem, I wouldn't bother with a formula, I'd just read the string backwards, because I don't need to think about ATG becoming CAT.
posted by maryr at 9:48 PM on December 9, 2012

Thanks for this. Set myself up to learn Python this holiday season. Both this and the Euler set fit in neatly with that goal. :)

Incidentally, if anyone is planning to use C# for this, I'd recommend using LinqPad instead of Visual Studio, which, as awesome as it may be, is more suited for projects, rather than standalone snippets.
posted by the cydonian at 10:11 PM on December 9, 2012

maryr, I don't know what you mean with "formula", but just look at the sample dataset, and be able to produce the sample output.

If you're saying you don't need to write a program because you can just read it off, the dataset they actually give you is usually significantly bigger than the sample. For REVC my dataset had 1000 bp.
posted by benito.strauss at 10:15 PM on December 9, 2012

Good link from that YC post: PLOS Computational Biology: An Online Bioinformatics Curriculum (2012)
posted by rollick at 3:20 AM on December 10, 2012 [1 favorite]

The thing I really like about this is the 'Python Village'.

When I've tried to get my math teacher friends interested in Euler, they've all been put off by the coding. The introduction to Python is just simple enough to be immediately accessible, and it will show people that the programming concepts they need to understand are not that tricky (at first...)

Great find!
posted by man down under at 12:10 PM on December 10, 2012

benito, yeah, that's what I was wondering - like, if all I needed to do was produce an answer to the sample, boom, done in my head. I could do a thousand bp on paper, but gods would it be tedious. There are websites for that.

If nothing else, maybe this will get me to finally re-learn programming via Python - when I aksed the bioinformatics guy at work what I could do to learn more, he just said "Learn Perl" which wasn't terrible helpful. I haven't done any real programming since 2000.
posted by maryr at 12:08 PM on December 11, 2012

PS: I started playing with this at work this evening and accidentally stayed until 11:30 PM programming for fun.

It nice to have a challenge and I'm glad for the Python tutorial, but I wish there were slightly more obvious resources for more programming tricks. For example, I'm doing a lot more with FOR loops than seems necessary - how do I check the length of a damn string? Why is my GC content calculation off by 0.3%? ARGL (as they say in Scheme) !
posted by maryr at 10:20 PM on December 11, 2012

It nice to have a challenge and I'm glad for the Python tutorial, but I wish there were slightly more obvious resources for more programming tricks.

One of the neat things about Rosalind is that once you've attempted a problem (that is, requested a dataset and submitted your attempted solution), you get access to a "Questions" board that users can use to ask clarifying questions about the problems. Once you successfully solve a problem, you get access to a "Solutions" board where people post (and comment on) solutions they're particularly proud of. You can often learn new tricks from seeing how other people chose to solve the problem you just completed.

For anyone who wants to approach the Rosalind problems as practice while they learn Python but it looking for documentation, you can find tutorials and the like at the Beginner's Guide to Python. I've found it useful to familiarize myself with the standard library, especially the built-in types and functions. For project Rosalind specifically, I've also found it useful to get comfortable with the string, collections, itertools, and math modules. I'm sure there are others but those have definitely saved me some solving time.

If you're interested in seeing examples of the Pythonic way to implement algorithms you're already familiar with, you may have some luck with Rosetta Code (previously), specifically the Python page.

If anyone is having trouble getting through the earlier problems or wants more Python resources, feel free to MeMail me. I don't have an extensive Python background, but I'm familiar enough that I might be able to answer the sorts of questions you'd have getting through the first few dozen Rosalind problems. I'm also happy to hint or help you talk through any of the Rosalind problems I've passed.

how do I check the length of a damn string?
len(s) will give you the length of any of the sequence types including strings.

posted by jdherg at 12:28 PM on December 13, 2012 [3 favorites]

Yeah, it's the Solutions that are making me mad/jealous, because I know I'm kludging my way through some of programming when there are simpler solutions.

BTW, my solution was off by 0.3% because of these fucking line breaks I hate them so much die die die. But the biologist part of me knows that this FASTA formatting is going to come up again, so I'm going to need a way to parse it. Which I would be easier if I knew more about Python - you can't change a string or something? Wha?

Anyway, maybe I'll harass you via MeMail at some point. Too bad you can't seem to add "friends" or anything in Roaslind.
posted by maryr at 12:51 PM on December 13, 2012

fucking line breaks

Don't let FASTA make you furious: How to trim whitespace (including tabs)?

You might want to try posting your questions on stackexchange. They don't seem to mind answering low-level questions, and I've gotten really quick responses sometimes.

But I agree with jdherg that you should at least glance over a short intro guide so you have some idea of what's available in Python, even if you can't immediately remember what it's called. Specifically, you might not think to use the itertools and comprehensions if you've never seen them before, and dictionaries are incredibly useful too.
posted by benito.strauss at 1:04 PM on December 13, 2012

In real life work, FASTA makes me quite happy. But then, I'm not really on the programming end of things. =)

And again, I did the recommended intro exercises on Rosalind. I'm a bit annoyed that some of the the string manipulation stuff like len() that would obviously be useful for this wasn't included there. Additionally, if you're going to tell me that Python slices strings the way it does lists and they teach a bunch of list commands, you might mention that I can't use, foo.append on a string. I'm missing the kind of information a proper lecture would teach you.

Sorry, I'll keep my whining to myself from here out. I haven't had any time to work on the problem in the past couple days because Actual Work so I'm stuck in a frustrating step at the moment.
posted by maryr at 7:37 PM on December 13, 2012

I should note - this project might also highlight my fears about bioinformatics. Which is to say: That I am left much farther behind as a biologist, who actually uses these tools, than I would be as a computer scientist, who builds these tools and doesn't use them. I use BLAST regularly at work and understand why and when I might want to, say, filter low complexity regions, but it would probably be easier to bulding a tool to do that with a solid background in string manipulation.

Similarly, I suppose, it'd be a lot easier to build a themocycler with a knowledge of electronics even if I'd never run a PCR in my life. Maybe I just need to keep that analogy in my head - that building the tool and using the tool don't come from the same place. Or that they do, but they take different routes to get there - like the 80 and the 87. (Nested metaphors?)
posted by maryr at 7:53 PM on December 13, 2012

« Older "The statistics don't matter, until they happen to...   |   Marvelous Mavelous Newer »

This thread has been archived and is closed to new comments