[TITLE] Hi, I'm Beth Skwarecki and this is "Under the hood at a bioinformatics project". [SGN] The bioinformatics project in question is SGN. (That URL is sgn.cornell.edu). If you poke around the website you'll see we have all kinds of genomic data, and lots of tools to explore it. [SOLANACEAE] SGN's focus is the Solanaceae, a family of plants that most famously includes the tomato. It also includes deadly nightshade, peppers, potatoes, eggplants, petunia, and tobacco. We also store data from a few related families like Rubiaceae, the coffee family. As a bioinformatics project, we don't just display this data - we also do a lot of data processing. I'd like to talk about our data processing pipelines. *** 2:00 [PIPELINES] These are assembly pipelines, which take short, disconnected DNA sequences and try to overlap them into longer, more meaningful strings, which hopefully represent genes. Assembly sounds like a task that's easy to understand and fun to code. In reality, though, most of the bioinformatics tools that sound fun to code have already been coded, so if you work in bioinformatics you'll spend a lot of time writing glue code and adapters to fit these pieces together, and to work around any limitations or bugs they might have. All bioinformatics projects need to run pipelines, but there's no standard way to build a pipeline. Every one has to be a little bit customized in unexpected ways. So we've developed our own pipelines in-house, and they've evolved over the years. In fact, we have three forks of what was originally the same pipeline. The first pipeline I'll describe is from a sister project of SGN. It's the most primitive fork on our pipelines' evolutionary tree, and it consists of about 15 programs. There are 5 critical steps in an assembly pipeline: (1) translating the raw output of a sequencing machine into a string of text (2) throwing out sequences that don't pass certain tests (3) Trimming the ends of the sequences (4) running the sequences through an assembly program, and (5) comparing the assembled sequences to known genes, so we can annotate each of them with a guess about its function. In this primitive version of the pipeline, even though only five important things happen, there are upwards of 15 programs that need to be individually run. Most of them do clerical work like moving or renaming files or twiddling rows in a database. A human has to run the pipeline and eyeball the output of each stage. Meanwhile, the inputs and outputs of the programs are constantly being moved around, rewritten into different formats, and sometimes deleted. When the data is good, and the filenames are predictable, and the programs don't run into any of their bugs, and the human running the pipeline knows exactly what they're doing, data can whoosh through this thing in about a day or two (which includes a couple hours of cluster time). But, as you might guess, the thing is a maintenance nightmare. It's fragile if its environment changes at all - say if you run it from a different person's home directory, or if you run it on a different machine. The incoming data also changes sometimes, requiring pieces of the pipeline to be rewritten. We fixed some of these problems in the SGN unigene pipeline, which started out as a fork of the one I just described. Over the years it got bigger and cruftier and more incomprehensible. Meanwhile, the incoming data was totally unpredictable. You'd imagine the point of a pipeline is to be modular, so you can easily swap out pieces of it, but any feature that made the pipeline more modular also made it more complex. Murphy's law applies: whatever you design to be changeable will actually remain constant, and vice-versa. The programmer who inherited this pipeline, Marty Kreuter, found he needed to rewrite big sections of the program almost every time he had to do a new build. So here's the success story: he scrapped the whole thing in favor of a short makefile. There's now very little glue code or database twiddling. The whole pipeline is in one short file. It also delays all the database inserts until the end of the pipeline, so that they can be done in one atomic transaction. He still needs to rewrite parts of the pipeline every time, but now the rewriting is easier, more like editing a config file. The third pipeline is a bit different from the other two. Rather than doing these scrappy little unigene builds, it's the Big Important Pipeline for the International Tomato Genome Sequencing Project. The programmer who developed this pipeline, Rob Buels, met with many of the scientists who are now submitting data. They all agreed on how they would upload the data, what the file formats and naming conventions would be, and exactly what processing would be done. Because the incoming data and the desired results are so consistent, Rob was actually able to automate the darn thing. Submitters upload their files, and a cron job sends them through the pipeline. [FILE FORMATS] Not only did the scientists agree on data formats, Rob's pipeline actually enforces them (imagine that!) We've learned an important lesson about data formats: they only work when the format is defined by the programmer. Submitters will ask how you want the data. NEVER say: "Oh, it doesn't matter. I can deal with whatever format." Yes, you can, but it's work to munge formats, and that's work you can't automate! Soon you'll be dragged into running programs manually, eyeballing files to see if they look right, and writing a new first step to your pipeline for every guy off the street who wants to send you some data. It's even WORSE to say: "You know what you're sending us. Why don't you decide on the format?" When they design the format, you're expecting them to stick to it. They might say something like "every experiment will have a unique number" or "every X will have exactly one Y". After you code up your database and your programs, you might get a batch of data that has two experiments called 1234a and 1234b. You say, what the heck is this? And they say, oh, that experiment had two results. Since they defined the format, they see no problem in redefining it as needed. So here's the funny thing: inflexibility makes life easier for both of you. If you define a format, and act like it's the only format you can accept, suddenly the scientists will be very concerned about giving you exactly what you asked for. They won't protest; they'll figure it's their responsibility to fit the format. (If you automate the acceptance of the file, the program can reject files that are in the wrong format, and while it would be rude for a person to say "sorry, there's a typo, try again" it's totally normal for a machine to say the same thing. *** 8:00 [FREE SOFTWARE] So what runs this massive project? Well, first of all, we run on 99% free software. Our servers all run debian gnu/linux, with web serving by apache and databases by postgresql. Programmers and interns use various flavors of linux and unix and most of our code is in Perl. [NIFTY TRICKS] Since we run debian everywhere, we've made our life easier by having one of our machines run a caching apt proxy, so the same update or install doesn't need to be downloaded from the outside world more than once. We can set up new machines quickly because we also have an apt server that serves out custom packages. One package has dependencies on necessary programs like apache, perl, and postgres; another depends on biological software like blast and crossmatch; and another depends on all the perl modules we would otherwise need to download from CPAN. If a particular package doesn't already exist in debian, we create one and serve it ourselves from this internal server. We didn't choose Columbia as an exotic faraway location; Columbia chose us. A scientist from a coffee research company there spent a summer with us learning about bioinformatics, and when he got back home his company set up a mirror of SGN. (Coffee isn't in the tomato family, but it's a close enough relative that our data is useful to them.) We gave them our site on a hard drive; every (week? month?) they svn switch to the latest release, reload their database from our latest dump, and rsync to grab any new data files. Having a mirror is a great way to be sure that it's actually possible to reconstruct your website from production data. Nifty tools: * wiki * mailman (public and bugs lists) * bug mailer * vhost generator * all_project_message * spong with pucebaboon thermometer (NORAD screen) * schemaspy * and we view our svn repository with trac * blast & cluster [OTHER TOOLS WE USE] [SCHEMASPY] [SPONG] [HARDWARE] Most of our important services have an understudy on another machine. If our database server goes down, we can use our database sandbox server (with a copy of the production data, of course, which is backed up but might take an hour to load). Our server room contains these essential elements: [photo] * one production webserver * one sandbox webserver * one failover webserver (not shown) * one production database server * one sandbox/failover database server * two clusters of blade servers, for computationally intensive jobs * one fileserver that also provides user accounts and serves the svn repository. * miscellaneous helpful machines (not shown) [THANKS] Questions?