Would you like to use same operating system as 92% of the top 500 super computers? Do you dig Japanese bullet trains, NASA, or the financial awesomeness that is the New York Stock Exchange? Do you hate having to shell out hard-earned money for new software or upgrades? Most importantly, are you interested in next-generation sequencing and bioinformatics/genomics? If you answered yes to any of these questions, then you should be eager to learn more about Linux operating systems. Linux is the free and open-source port of Unix and was created by Linus Torvalds (hence the portmanteau Linux). Because Linux is open source, I will be focusing on Linux (as opposed to Unix) systems exclusively. Below, I will elaborate on (1) the importance of Linux in next-gen sequencing projects, (2) provide some heuristic examples of the power of using Linux commands and (3) discuss Linux as an operating system for your own desktop or laptop.
The Role of Linux in Next Gen Sequencing:
I am currently using RNA-Seq data collected from an Illumina HiSeq to investigate patterns of gene expression. At Oregon State University, we are fortunate to have the Center for Genome Research and Biocomputing, which provides us with access to a cluster of high-performance computers that (you guessed it) all run on Linux. In fact, Linux is the operating system of choice for most computing clusters for a wide variety of reasons (stability and ease-of-use chief among them). Many universities have similar large-scale computing clusters and there are also cloud-computing options available through places like Amazon. Other options would be to run the analyses on your own computer – though I would recommend using a fairly new computer in order to avoid excessive run times. One of my computers is an eight year old laptop that runs xubuntu (a lightweight version of Linux), and while it runs superbly given the hardware specs, it would not be very efficient at processing large-scale genomic analyses.
One of the main attractions (and also deterrents) of Linux is the power of the command line. Using relatively simple commands, I can copy all of my next-generation sequencing files to my hard disk, compress them to save disk space, place them in new directories, create backup and archived copies, and begin processing them. All of this takes a matter of minutes and acts on multiple files simultaneously. While I am sure it would be possible to do this in Windows, it would not be nearly as convenient. Interestingly, the operating system for Macs is also based on a *nix platform. As such, Mac users can run commands that are very similar to Linux (though not always identical). If you are a Mac user, I suggest taking a look at this website and corresponding article in Molecular Ecology Resources that has a lot of useful tips. If you do not own a Mac, don’t worry: simply choose one of the many Linux distributions that can run alongside Windows (see last section).
So why is Linux like a glue? Linux allows a user to easily process and manipulate multiple files, and to easily feed output from one set of programs as input for another set of programs. Using Linux, for example, I can run Perl scripts to filter my Illumina reads, take the output and run Bowtie to align my reads to a reference transcriptome, use Perl to count the reads, and export the results to R to use statistical tests for differential gene expression. Although I could do this with a graphical user interface (many of which are rapidly becoming more user friendly, e.g., Easy Terminal Alternative), it is far more efficient to use the command line. Furthermore, Linux comes with many built-in functions and commands that can be quite powerful, which leads us to shell scripting…
Shell scripting:
Linux comes with many great commands that can speedily manipulate multiple large files. In fact, Linux users can write shell scripts to accomplish many specific tasks. I often come across complicated Perl scripts that could more easily have been written as a Linux-based shell script. Before you write a Perl or Python script, it may be worth asking whether you can do the same thing with a shell script – you may find it to be faster and more memory efficient in the long run. Rather than give specific examples, I suggest trying to solve issues with your own current data analysis. In my opinion, solving made-up problems is far less interesting than working with your own data. Also keep in mind that trial-and-error (expect lots of errors!) and the internet are your greatest allies – a few well-directed searches will reveal answers to most of your questions. The head of our computing cluster insists on the 10, 100, 1000 rule: test your new scripts on increasingly large data sets so that you can closely monitor memory usage and quickly identify and eliminate any bugs or unexpected behaviors.
TuxLife: Life with Linux
So you have decided that Linux is pretty neat. In fact, you would like to install it on your computer. Because Linux is open source, many different people and groups of people have created different versions of Linux. This can initially be confusing because it is analogous to choosing between 50 different versions of Windows or OS X. That being said, there are a much smaller number of distributions that are commonly used. I have mainly used Ubuntu on my personal computers and have been quite happy with its performance. One nice feature of many Linux operating systems is that you can install them alongside Windows, or simply run them from a cd or thumb drive. Even though I enjoy Ubuntu as an operating system, I still also use Windows. Perhaps this is heretical to say, but I really do like Microsoft Office for word processing, simple data manipulation, and creating visual presentations. Also, many population genetics programs are designed solely for Windows. I realize that I could probably use a Windows emulator, like Wine, but is it really worth it if you already have Windows installed (Not a rhetorical question – I am genuinely interested in your thoughts)? I envision in the future that it will be easier to run Linux-only PCs as greater numbers of software packages migrate to the cloud. Thoughts?