MCIC Wooster, OSU
2023-04-04
Today, I will give an overview of a typical computational environment/ infrastructure used for all kinds of genomics projects.
Of course, I won’t have time to teach you the different components, but I would like to orient you on this topic so that it’s not as much of a black box as it may seem now.
This will hopefully give you a starting point for learning more – and I will point you to some specific resources as well.
Any project in which you generate high-throughput sequencing data, e.g.:
Whole-genome sequencing – de novo assembly, pangenomics, “resequencing”
Reduced-representation sequencing (GBS, etc) for population genomics
Microbiomics – both shotgun metagenomics and amplicon metabarcoding
Transcriptomics with RNAseq
Introduction
Command-line genomics?
Ohio Supercomputer Center (OSC) overview
Command-line software
The VS Code editor and the whole game
A supercomputer – in our case, the Ohio Supercomputer Center (OSC)
(Cloud computing is an alternative, but won’t be covered today)
The Unix shell (AKA the terminal)
A text editor – I recommend and will demonstrate VS Code
A supercomputer – in our case, the Ohio Supercomputer Center (OSC)
(Cloud computing is an alternative, but won’t be covered today)
The Unix shell (AKA the terminal)
A text editor – I recommend and will demonstrate VS Code
Command-line genomics
For the purposes of this talk, I will refer to working with the above elements by running command-line programs as “batch jobs” (non-interactively) as command-line genomics.
A supercomputer – in our case, the Ohio Supercomputer Center (OSC)
(Cloud computing is an alternative, but won’t be covered today)
The Unix shell (AKA the terminal)
A text editor – I recommend and will demonstrate VS Code
Command-line genomics
For the purposes of this talk, I will refer to working with the above elements by running command-line programs as “batch jobs” (non-interactively) as command-line genomics.
A highly interconnected set of many computer processors and storage units. You can think of it simply as a network of computers.
Supercomputers are also commonly referred to as High-Performance Computing (HPC) clusters or simply compute clusters.
Your genomics dataset is likely to be too large to be handled efficiently, or even at all, by a laptop or desktop computer.
To speed up long-running analyses by using more computing power.
To speed up analyses where some part needs be repeated many times, like the independent mapping of reads for different samples to a reference genome.
It’s also a great place to store large amounts of data
A computer’s shell is also referred to as a Terminal or “the command line”, and allows you to interact with your computer by typing commands rather than pointing-and-clicking.
The Unix shell is the shell of Unix-based computers, which include Mac and Linux (but not Windows) operating systems.1
A genomics project usually involves sequentially running a whole array of bioinformatics programs (or “tools”).
For instance, an RNAseq project may include:
raw read QC => raw read trimming => trimmed read mapping => gene counting
A genomics project usually involves sequentially running a whole array of bioinformatics programs (or “tools”).
For instance, an RNAseq project may include:
raw read QC => raw read trimming => trimmed read mapping => gene counting
Many of these tools can only be run through a “command-line interface” (CLI)
Even those that have a “graphical user interface” (GUI) are more efficiently and reproducibly run through a CLI:
Efficiency: A CLI allows you to write a simple loop to run it in the same way for many samples. (In combination with the computing power of a supercomputer, this in turn allows you to process those hundreds of samples in parallel.)
Reproducibility: You can easily save all commands and scripts which would allow you to rerun a project rather straightforwardly.
If you will often do genomics projects like the ones mentioned above, it’s hard to avoid command line genomics as described.
But here are some conditions in which you might reasonably avoid it:
You’re doing a single genomics project, your main research focus is elsewhere
You have data which can be analyzed with no or a relatively small command-line-based part, such as proteomics/metabolomics/metabarcoding/RNAseq.
In such cases, you might be able to get someone else to do the command line work, or you could try Galaxy, a cloud-based bioinformatics platform with a web browser interface and no coding.
Introduction
Command-line genomics?
Ohio Supercomputer Center (OSC) overview
Command-line software
The VS Code editor and the whole game
The Ohio Supercomputer Center (OSC) provides computing resources to researchers (and others) across Ohio.
OSC has two individual supercomputers/clusters (named Owens and Pitzer), and lots of infrastructure for their usage.
Research usage is charged but at heavily subsidized rates, and most or all OSU colleges absorb these costs at the college level (!)
Educational usage is entirely free, like for the PAS2250
project you have been added to for this lecture.
The OSC OnDemand web portal allows you to access OSC resources through a web browser, such as:
A file browser where you can also create and rename folders and files, and download and upload files. 1
A Unix shell
More than a dozen different “Interactive Apps”, or programs with a GUI, such as RStudio, Jupyter, QGIS, and more.
(Project and user management goes through a separate website, https://my.osc.edu.)
Choose a folder as a starting point for the file browser:
Here you can view, create and rename folders and files, and download and upload files:
Here you can access a Unix shell on either of the two clusters:
(Since the two clusters share the file systems, and they have fairly similar capabilities, it generally doesn’t matter which cluster you connect to).
When you click on one of the shell options, a new browser tab with a shell will open. There are some welcome messages, and some storage usage/quota info, and then you get a prompt to type commands:
If you click on “System status”, you’ll get an overview of the current usage of the two clusters:
Here you can mostly access programs with a GUI that will run on a compute node. We’ll try Code Server
and RStudio Server
.
Fill out this form to start an RStudio session. This will run on a compute node and is therefore charged: for that reason, it needs the OSC account number so as to bill the correct account.
Once that top bar below is green and says “Running”, you can click “Connect to RStudio Server” way at the bottom:
Now, you’ll have RStudio running in your browser!
It looks identical to the Desktop App version:
(That is, even compared to command-line computing on a personal Linux computer.)
Software
Because you don’t have administrator rights, and because the system is shared by so many people, you can’t install and use software “the regular way”
For system-wide installed software, load it with module
commands.
If something is not installed, ask OSC or use Conda or containers.
Introduction
Command-line genomics?
Ohio Supercomputer Center (OSC) overview
Command-line software
The VS Code editor and the whole game
A useful example of a genomics tool with a CLI is FastQC, a program for quality control of FASTQ files.
It is ubiquitous because nearly all high-throughput sequencing data comes in FASTQ files, and your first step is always to check the quality of the reads.
FastQC produces visualizations and assessments of aspects of your reads such as adapter content, and, as shown below, per-base quality:
To run FastQC, you use the command fastqc
.
Command-line programs are typically run non-interactively, so we don’t fire up the program first, and tell it what to do later, like we would with a program with a GUI.
Instead, we at once issue a complete set of instructions for the program to do what we would like it to.
For example, say we want to analyze one of the FASTQ files that I put in /fs/scratch/PAS2250/data
, with default FastQC settings. In that case, our complete FastQC command would be:
So, it is simply fastqc
followed by a space and the name of the file!
FastQC is available to us at OSC1, but we first have to load it. Here is what happens when we try to run the program in a fresh shell session at OSC:
We can load the software as follows:
Now, let’s try again:
[jelmer@owens-login04 ~]$ fastqc /fs/scratch/PAS2250/data/sample1.fastq.gz
#> Started analysis of sample1.fastq.gz
#> Approx 5% complete for sample1.fastq.gz
#> Approx 10% complete for sample1.fastq.gz
#> Approx 15% complete for sample1.fastq.gz
#> [truncated]
Success!
I mentioned earlier that one benefit of running programs at the command-line is reproducibility – but how do we save the commands that we run?
We need to not just save them, but to keep a detailed digital notebook that will enable us to redo our analysis.
We also need to wrap these commands in little scripts, so that we can run programs non-interactively and in parallel.
For all of this, we will need a good text editor.
Introduction
Command-line genomics?
Ohio Supercomputer Center (OSC) overview
Command-line software
The VS Code editor and the whole game
VS Code (in full, “Visual Studio Code”) is a great modern GUI-based text editor. 1
We can use a version of this editor (often referred to as Code Server) directly in our browser via the OSC OnDemand website.
I do the vast majority of my bioinformatics work using VS Code at OSC. Because it also has an integrated terminal to access a Unix shell, this effectively combines 3 of the 4 mentioned aspects of the genomics computational environment:
That is, all except R, for which I prefer RStudio. 2
Interactive Apps
and then near the bottom of the dropdown menu, click Code Server
.PAS2250
4.8
.Runnning
, click Connect to VS Code
.
Once the session is running, you can click “Connect to VS Code”:
The main part of the VS Code is the editor pane.
Whenever you open VS Code, a editor pane tab with a Get Started
document is automatically opened.
We can use this Get Started
document to open a new text file by clicking New file
below Start
, which opens as a second tab in the editor pane.
You can open a terminal with a Unix shell by clicking => Terminal
=> New Terminal
. The great thing with this setup is that we can keep notes and write shell scripts in the same window as our shell and a file browser!
I’ve shown you the main pieces of the computational infrastructure for
“command-line genomics”:
We’ve seen a very basic example of loading and running a command-line tool at OSC.
The missing pieces for a fuller example of how such tools are run in the context of an actual genomics project are (if we stay with FastQC):
sbatch
in front of the script name.)The core skills:
Unix shell basics – the commonly used commands
Some shell scripting basics
SLURM basics to submit and manage your batch jobs
R for “downstream”, statistical and visualization tasks
When you start doing genomics projects more often:
Using conda
or containers for software
Unix data tools (grep
, sed
, awk
, etc)
When you want to become proficient in applied bioinformatics:
Version control with git
More advanced: formal workflow/pipeline management tools (e.g. Nextflow
)
More advanced: Python (or advanced R) for custom data processing
OSU courses and workshops
Jonathan Fresnedo Ramirez’s “Genome Analytics” course (HCS 7004
)
Microbiome Informatics (MICRBIO 8161
)
The online materials for the workshop “Command line basics for genomic analysis at OSC” that myself and Mike Sovic gave last August:
https://mcic-osu.github.io/cl-workshop-22/
I have a course “Practical Computing Skills for Omics Data” (PLNTPTH 5006
) that I am planning to teach in in Spring 2024. All materials for the 2021 version of this course (“Practical Computing Skills for Biologists”) are at: https://mcic-osu.github.io/pracs-sp21/
Some particularly useful books:
The Linux Command Line (William Shotts, 2019)
Bioinformatics Data Skills (Vince Buffalo, 2015)
Computing Skills for Biologists: A Toolbox (Wilmes & Allesino, 2019)
A Primer for Computational Biology (Shawn T. O’ Neil, 2019) https://open.oregonstate.education/computationalbiology/
Due to time constraints, I have skipped over the details of these steps. But an example shell script to run FastQC can be found at /fs/scratch/PAS2250/scripts/fastqc.sh
, which contains the following code:
Indicate it’s a Bash script and tell SLURM which OSC account to use
Strict settings make the script stop on failure
The script takes “arguments”, which are stored as placeholder variables $1
and $2
. This allows us to run the script for different files
Here is how I would submit that script as a batch job to analyze one FASTQ file:
fastq_file=/fs/scratch/PAS2250/data/sample1.fastq.gz
sbatch /fs/scratch/PAS2250/scripts/fastqc.sh "$fastq_file" results_jelmer
#> Submitted batch job 16323144
And how you can loop over all FASTQ files to submit as many jobs in parallel as you have FASTQ files:
for fastq_file in /fs/scratch/PAS2250/data/*.fastq.gz; do
sbatch /fs/scratch/PAS2250/scripts/fastqc.sh "$fastq_file" results_jelmer
done
#> Submitted batch job 16323145
#> Submitted batch job 16323146
#> Submitted batch job 16323147
#> Submitted batch job 16323148
#> Submitted batch job 16323149
#> Submitted batch job 16323110
#> Submitted batch job 16323111
#> Submitted batch job 16323112