Blythe Dry Lab Guide¶
- Blythe Dry Lab Guide
- Introduction/Overview
- New Users
- Using Terminal or Something Fancier
- Using the Command Line and bash
- Signing into Quest
- Login Nodes
- Home Directory
- Projects
- Running interactive jobs on Quest
- Using packages already installed on Quest
- Submitting jobs to Quest
- NU Resources for learning about Quest:
- Deciding How Much Resources to Ask For
- Running jobs in parallel, rather than in series
- Monitoring SLURM Jobs on Quest
- Transfer From BaseSpace to QUEST
- Current Approach for Routine Processing of Paired-End ChIP/ATAC-seq Data
- Using R on Quest
Many thanks to the Andersen Lab, whose own "Dry Guide" is the basis for this adaptation.
The computing cluster available to researchers at Northwestern is called Quest. Please see the link below to go to the table of contents for Northwestern IT's support docs for Quest.
Introduction/Overview¶
New users will likely have some confusion about how all of this works. This guide is meant to help you with that. As a high-level overview: we can use a remote computing cluster called Quest to run processor-intensive jobs without tying down our local computation resources. Quest is a collection of "Nodes" of many multithreaded processors, as well as disk storage space. You can think of Quest as a remote computer that you can interact with solely through the command line. The computational resources on Quest are parceled out into 'allocations'. The two most important allocations for our purposes are b1042 and b1182. b1042 is the "Genomics" buy-in allocation. Anyone at NU can sign up for access to b1042, and once they have access, they can begin submitting jobs to this system. b1182 is the Blythe Lab's personal allocation, with (at the moment) 4 TB of storage space. Jobs for b1042 can be stored in directories within b1182.
When you log in to Quest, however, you aren't 'within' b1042 or b1182, instead, when you log in, you are dropped off in your home directory (in terms of the file-system), and are using (in terms of computational resources) what is referred to as a login node. While b1042 has many cores and much memory available for computational processes, login nodes have relatively less computational power. Therefore, if we have processor-instensive jobs to perform, we don't want to use a login node, instead we want to use b1042. To do this, we have to ask Quest to do our jobs on b1042 through one of two methods: as an 'interactive session' or as a 'batch submission'. Below, I describe how to initiate both of these approaches.
Storage is another matter: each user's home directory has a limited amount of private storage that will be indefinitely maintained (and backed up). This is on the order of 80 Gigabytes, which isn't much. There are three other storage options: within the Scratch Space within b1042, within the lab's allocation in b1182, as well as within the lab's own dedicated project, p31603. This latter project is an older allocation that may eventually be migrated to b1182. Scratch Space is essentially a wild-west first-come-first-served disk space for all users of b1042 to temoprarily store files that they are using in a project. It is vast (~200 Terabytes), not backed up, and files over 30-days in age are deleted regularly. The Blythe Lab has a directory within b1042 created within this Scratch Space that lab members can use to temporarily store files if necessary.
The Blythe Lab has a dedicated "project" within Quest (p31603)that affords us 500 Gigabytes of storage space. As mentioned above, this is an older allocation and may eventually be migrated to b1182. For now, all projects should be staged on b1182, although they may make reference to resources currently housed in p31603.
In the following, I describe how to get access to Quest, how to use some of its unique features, and how to submit jobs.
New Users¶
To gain access to Quest:
Apply to be added to the lab's partition/project/allocations (p31603 and b1182) here.
Allocation manager: Shelby Blythe, shelby.blythe@northwestern.edu
Apply to be added to partition b1042 here.
Allocation manager: Janna Nugent, janna.nugent@northwestern.edu
Use the Quest Help email: quest-help@northwestern.edu if necessary to get help
Note
If any of these links are dead, please tell Shelby so he can update them, and search for "Quest access Northwestern" to find hopefully current, non-dead links.
Using Terminal or Something Fancier¶
Mac users can access Quest using Terminal. If you haven't used Terminal, search for it (command + Space, then enter "Terminal"). It is possible, however, to configure text editors to access Quest directly. This way, you can use the text editor to directly modify scripts without having to rely on Command Line text editors (like nano). I particularly like Visual Studio Code (VSCode). VSCode runs on both Mac and Windows machines.
You can then configure VSCode to connect to the ssh host (Quest) by following these instructions.
Using the Command Line and bash¶
In general, you will be using the command line and bash when you use Quest. There are many available resources to learn the basics (Google it). To get started, Northwestern has collected some tutorials here.
Signing into Quest¶
Once you have been granted access to the cluster and have sorted out how you will connect, you can login using:
ssh <netid>@quest.northwestern.edu
If you are using Terminal, to avoid typing in the password everytime, one can set up a ssh key.
You can set an alias in your ~/.bash_profile to make logging in quicker:
alias quest="ssh <netid>@quest.it.northwestern.edu"
The above line makes it so you simply type quest and the login process is initiated.
If you are not familiar with what a bash profile is, take a look at this.
Important
When you login it is important to be conscientious of the fact that you are on a login node. You should not be running any sort of heavy duty operation on these nodes. Instead, any analysis you perform should be submitted as a job or run using an interactive job (see below).
Login Nodes¶
There are four login nodes we use: quser21-24. When you login you will be assigned to a random login node. You can switch login nodes by typing ssh and the node desired (e.g. ssh quser21).
You can easily determine your current login node by reading the command prompt, which takes the form: <netID>@quser21, if --for instance-- you are on the quser21 login node.
Home Directory¶
Logging in places you in your home directory. You can install software in your home directory for use when you run jobs, or for small files/analysis.
Your home directory has a quota of 80 Gb. More information on quotas, storage, etc. You can check your storage space using the following command (from within your home directory):
du -hs *
This command du will estimate the disk usage of the current directory. -hs returns the summary of human-readable (i.e., not hidden) files.
From anywhere, you can determine the disk usage of your home directory using
homedu
More information is provided below to help install and use software.
Note
In general, we will not be saving files long-term anywhere on Quest! Most of your work will take place within the lab's allocation b1182. All files should be stored on the lab's network accessible storage device/server.
Projects¶
Quest is broadly organized into projects (or partitions, or allocations). Projects have associated with them storage, nodes, and users.
To determine what "groups" you have access to, enter groups on the command line. You should see at the very least b1042, b1182, and p31603 in the returned text.
The Blythe lab has access to one specialized allocation, b1182. This is where we have 4 TB of total storage. This is the current location to perform Quest based tasks. Please keep in mind that while this is nice storage, any important products of an analysis should be backed up to the lab server. No one but you is responsible for ensuring that your data is properly backed up!
The Blythe lab has access to one project, p31603. This is an older workspace and will eventually be migrated to b1182. This is where we have 5000 GB of total storage. It currently houses some repositories for mapping, as well as a virtual environment for running our preferred read trimmer.
b1042 - The 'Genomics' Project has 200 Tb of space and 100 nodes associated with it. This space is shared with other labs and is designed for temporary use only. The space is available at /projects/b1042/BlytheLab/. By default, files are deleted after 30 days. One can submit jobs to --partition=genomicsguestA to use this partition, with a max job duration of 48hr.
Note
Anyone who uses Quest should build your own folder under /projects/b1182 with your name. You should only write and revise files under your project folder. You can read/copy data from elsewhere in b1182 (or p31603) but don't write/save any data to locations outside of your project folder. You could also store things on /projects/b1042/BlytheLab, but storage anywhere on b1042 is for 30 days only, so do not store anything here that you hope to maintain access to over a long period. We have limited space on p31603 and b1182. This is also not a long-term location for storage, but rather a site where you import raw data, and save the output of processing, before you download the data back to long-term storage on the lab server.
Checking available storage space on b1182¶
If you would like to check the available storage on any project, including b1182, the command checkproject b1182 will report current storage usage, as well as provide other information about the allocation's status.
Running interactive jobs on Quest¶
If you are running a few simple commands or want to experiment with files directly you can start an interactive session on Quest. The command below will give you access to a node where you can run your commands.
srun -A b1042 \
--partition=genomicsguestA \
-N 1 \
-n 24 \
--mem=64G \
--time=12:00:00 \
--pty bash \
-i
Important
Do not run commands for big data on quser21-24. These are login nodes and are not meant for running heavy-load workflows. Using srun allows you to access (interactively) the correct nodes for your work. Another option (below) is to schedule a batch job.
Using packages already installed on Quest¶
Quest has a collection of packages installed. You can run module avail to see what packages are currently available on Quest. You can use module load bowtie2 or module load bowtie2/2.4.1 to load packages with the default or a specific version (module add does the same thing). If you do echo $PATH before and after loading modules, you can see what module does is simply appending paths to the packages into your $PATH so the packages can be found in your environment.
module purge will remove all loaded packages and can be helpful to try when troubleshooting.
Submitting jobs to Quest¶
If you are all set up to use the lab's resources on Quest, you should be able to run the following test script to confirm that everything is operating as expected. The lab's project allocation contains a directory (.../TestData) that we use to check that everything is configured properly.
Jobs on Quest are managed by SLURM. Scripts for SLURM submissions differ slightly from regular bash scripts in that they have an extensive header that passes commands to SLURM that helps it determine how many resources to allocate to the job. Please see the /projects/p31603/Quest_Setup_Test_Script_PE.sh for an example of a script that is set up to run via the batch scheduler. A set of brief instructions on how to run the test script is included on a later page in this guide.
NU Resources for learning about Quest:¶
There are useful guides provided by NU Research Computing on how to use Quest. Please refer to this guide for general inquiries on use of this resource.
Deciding How Much Resources to Ask For¶
This is tricky to guess. In general, you want to ask for about 2x more resources than your job actually takes. On the job scheduler, you need to request an amount of cluster up-time, a number of cores/cpus, and an amount of memory per cpu.
The trade-off is that the more resources you ask for, the longer it takes for the scheduler to start your job. On the other hand, if you don't ask for sufficient resources, the process will terminate if you run out of time, or perhaps throw an error if memory resources are exceeded.
Note
All of our jobs are only performed on a single node. The line on the job scheduler for nodes should read: #SBATCH --nodes=1
Once you have an idea of how your jobs run, you can adjust the following as necessary. In a recent run, tasks like trimming a set of paired-end reads (~20M reads/mate pair/sample) took around 10 minutes per sample. Mapping with bowtie2 (plus samtools and picard) took an additional 30 minutes. Therefore, if you have 12 samples of this type, a trimming and mapping script would take at least 8 hours of clock time. Pad this. Ask for 10 hours of compute time, the appropriate number of cores (8?), and sufficient memory (8 GB).
#SBATCH --ntasks-per-node=8
#SBATCH --time=10:00:00
#SBATCH --mem-per-cpu=8G
As mentioned above, this could take some time to schedule, but it should be sufficient to complete the job.
Determining the priority of your jobs in the scheduler¶
If you are curious about the priority of your jobs in the scheduler, the command sprio can be used to display job priorities.
sprio -n --users=yourNetIDhere
The above command would show all of your ongoing and scheduled job priorities. The -n option normalizes the displayed scores from 0 to 1, where 1 is high priority and 0 is low.
Instead of displaying all your ongoing and scheduled jobs, the -j option can be used with a specific JobID to display that information.
Running sprio without any options displays all pending jobs with their weighted priorities.
Running jobs in parallel, rather than in series¶
A typical task we perform on Quest is to trim and map Illumina reads. We will have anywhere from 1 to >12 samples that we need to perform this on, and we essentially do the same thing to each sample. An old-timey way to do this is to write for-loops in our scripts and loop over the input filenames for each stage of the procedure. This "in series" approach works, but it doesn't take advantage of the fact that we can start all of the tasks at the same time in parallel for each input file. Running samples in parallel will accomplish the task of trimming and mapping in a fraction of the time compared with running them in series, typically because we end up reserving more resources in these cases for each individual task.
For instance, mapping a complete PE150 HiSeq4000 dataset (~400 M pairs of reads) from 12 pooled libraries can take around 21 hours if performed in series, allocating 12 cores and 8 GB of memory per core. In contrast, when run in parallel (12 cores/sample, 8 GB of memory per core), the timing can be considerably shorter. In this case, an individual sample can be processed in as little as 1.5 to 2 hours, but the whole job can take longer than 2 hours since the job scheduler may penalize you for requesting so many resources (in total: 12 x 12 cores, 144 x 8 GB of memory).
Parallel processing is accomplished on SLURM by using the array option, and by carefully setting up a batch request to allocate all of the necessary resources.
Within the header for a SBATCH script, one can specify the total number of jobs to perform in parallel using the array option. The batch job then sort of operates like a for-loop, in that we can take each starting file, and perform a task on it, by addressing it by an index number in a list. Say we have 12 samples in an experiment that we want to process in parallel. Then, you specify this by using a (zero-based) array:
...
#SBATCH --array=0-11
...
What this will do is set up 12 processes to run in parallel, each with a SLURM_ARRAY_TASK_ID from 0 to 11. Then, if you make a list of file names (in this example, housed within a directory "Test"):
# assign to the object `name_list` the filenames in the Raw_Data directory
# that contain the characters "_R1_".
name_list=(/projects/b1182/Test/Raw_Data/*_R1_*)
Then, assign an input filename that will be specific to the specific array that lanuches. Use the index position of the arrays (reported as an environment variable) to select one of the list elements:
file=${name_list[$SLURM_ARRAY_TASK_ID]}
Such an array script will initiate N independent instances of your script, each with a single array task id value. The environment variable SLURM_ARRAY_TASK_ID will report that single value, and subset the name list, so that particular instance of the script will only operate on that single indexed file.
Important
When re-writing scripts to operate on an array, be mindful that all of your individual samples will be processed simultaneously, and if you write temporary files to an output directory, they need to have unique filenames. See the example script below for a way of dealing with this (in the calls to samtools sort and picard CleanSam). Similarly, scripts should be re-written to purge temporary files as soon as they are no longer needed, rather than just throwing them away at the end of the process. The reason for this is because required disk space for storage can balloon significantly when using temporary files, especially if all those temporary files are being generated simultaneously, instead of once per sample.
Running array scripts are best handled by dividing the task into two parts: a script that handles the dispensing of arrayed tasks, and a script that will be run by each of the individual arrays.
For instance, the following script handles the setup of an array for trimming and mapping 12 ChIP-seq samples. Using one node, it sets aside 12 cpus per 12 tasks (144 total cpus). Each cpu will have at most 8 GB of memory. This memory value was arrived at empirically, as described in the following section. It turns out that this is not optimal, as we discuss further below. The scheduler names the jobs based on the task ID. Unlike typical jobs, each arrayed job will show up as an independent line in the job scheduler. Both for the job name, as well as for the output logs, it is good to include identifiers that distinguish between the different arrayed instances.
#!/bin/bash
#SBATCH --account=b1042
#SBATCH --partition=genomicsguestA
#SBATCH -t 4:00:00
#SBATCH --nodes=1
#SBATCH --cpus-per-task=12
# #SBATCH --mem=0 ## For troubleshooting only
#SBATCH --mem-per-cpu=8G
#SBATCH --array=0-11
#SBATCH --job-name="ArrayTest_\${SLURM_ARRAY_TASK_ID}"
#SBATCH --output=Logs/outlog.%A_%a
#SBATCH --error=Logs/screenlog.%A_%a
#SBATCH --mail-type=ALL
#SBATCH --mail-user=shelby.blythe@northwestern.edu
# This script will run our standard trimming and mapping approach on
# all samples in parallel. It is much faster than doing them in series.
# 1) ensure that the 'array' parameter above is the size of the number
# of samples in the experiment. Start with zero.
# 2) change the 'basedir' and the path to the filenames ("name_list" definition)
# below.
# 3) Ensure that you've made a directory (Logs) in the base directory.
#
# You should be good to go. It refers to a separate script to run the
# trimming and mapping. This script just passes the files to the script.
basedir=/projects/b1182/Test
## make destinations
mkdir ${basedir}/Trimmed_Reads
mkdir ${basedir}/Mapped_Reads
name_list=(/projects/b1182/Test/Raw_Data/*_R1_*) # the length of this list must equal the array length above.
input=${name_list[$SLURM_ARRAY_TASK_ID]}
# We've packed the usual script into a new script that is designed to take
# single filename inputs (rather than looping). It has three input values:
# 1) the base directory
# 2) the value of "input", defined above
# 3) the number of cores, defined above
script_path="/projects/b1182/Paired_End_Trimming_and_Mapping_Single_Sample.sh"
. "$script_path" \
${basedir} \
${input} \
${SLURM_CPUS_PER_TASK}
# END
The script above refers to a trimming and mapping script, Paired_End_Trimming_and_Mapping_Single_Sample.sh. It does the actual trimming and mapping, as follows:
#!/bin/bash
# this is a basic script for trimming and mapping paired end reads.
# It takes as input options:
# 1) for the base directory
# 2) for the filename of read 1
# 3) for the number of cores per task, as defined in slurm
# IT ASSUMES: that the raw data is in the base directory, and that
# the destination directories will be in the base directory as well.
# It uses 4G of memory per cpu, which is hard-coded.
basedir=$1
filename=$(basename $2)
cores=$3
echo "Base directory: $basedir";
echo "Read1 Filename: $filename";
module load python/3.8.4
module load fastqc/0.11.5
module load bowtie2/2.4.1
module load samtools/1.10.1
module load picard/2.21.4
rawdir=${basedir}/Raw_Data
trimdir=Trimmed_Reads
mapdir=Mapped_Reads
index=/projects/p31603/Bowtie_Indices/dm6/dm6
source /projects/p31603/TrimGaloreEnv/bin/activate
echo "path to the first read"
echo ${rawdir}/${filename}
filename2=$(echo $filename | sed "s/_R1_/_R2_/g")
echo "path to the second read"
echo ${rawdir}/${filename2}
echo
trim_galore \
-o ${basedir}/${trimdir} \
--gzip \
--fastqc \
--cores ${cores} \
--paired ${rawdir}/${filename} ${rawdir}/${filename2}
deactivate
tfilename=$(echo $filename | sed "s/.fastq.gz/_val_1.fq.gz/g")
tfilename2=$(echo $tfilename | sed "s/_val_1/_val_2/g")
tfilename2=$(echo $tfilename2 | sed "s/_R1_/_R2_/g")
echo "Next we use Bowtie2 on the output of Trimgalore"
echo ${basedir}/${trimdir}/${tfilename}
echo ${basedir}/${trimdir}/${tfilename2}
shortname=$(echo $tfilename |cut -d_ -f1)
echo
echo "The output for this step will have a filename that contains"
echo ${shortname}
echo
bowtie2 \
-p ${cores} \
-X 2000 \
-x $index \
-1 ${basedir}/${trimdir}/${tfilename} \
-2 ${basedir}/${trimdir}/${tfilename2} \
2> ${basedir}/${mapdir}/${shortname}.bowtie.mapping.log.txt | \
samtools view \
-bS - > ${basedir}/${mapdir}/${shortname}.mapped.bam
samtools sort -@ ${cores} -m 4G -T ${basedir}/${mapdir}/temp${shortname} \
-o ${basedir}/${mapdir}/temp${shortname}1.bam \
${basedir}/${mapdir}/${shortname}.mapped.bam
rm ${basedir}/${mapdir}/${shortname}.mapped.bam
java -jar /software/picard/2.21.4/bin/picard.jar CleanSam \
INPUT=${basedir}/${mapdir}/temp${shortname}1.bam \
OUTPUT=${basedir}/${mapdir}/temp${shortname}2.bam
rm ${basedir}/${mapdir}/temp${shortname}1.bam
java -jar /software/picard/2.21.4/bin/picard.jar MarkDuplicates \
INPUT=${basedir}/${mapdir}/temp${shortname}2.bam \
OUTPUT=${basedir}/${mapdir}/${shortname}.mapped.md.bam \
METRICS_FILE=${basedir}/${mapdir}/${shortname}.dup.metrics.txt
rm ${basedir}/${mapdir}/temp${shortname}2.bam
samtools flagstat ${basedir}/${mapdir}/${shortname}.mapped.md.bam > \
${basedir}/${mapdir}/${shortname}.mapped.md.bam.FLAGSTAT.txt
samtools index ${basedir}/${mapdir}/${shortname}.mapped.md.bam
The first script would be passed to sbatch, and so long as the second one is found at the path indicated in the first, it should operate on the designated files.
Note
The above script examples are meant to provide a starting point for developing your own approaches. The resource allocation was sufficient for mapping PE150 samples with between 30M and 50M pairs of reads. If you have fewer reads (or maybe shorter reads), this will be over-specified. Similarly, larger datasets may require additional resources (most likely more RAM). The section below suggests a way to figure out how much RAM to allocate to this kind of a request.
Issues with running arrays¶
In setting this up, I've run into problems from time to time with memory allocations. When the script above is executed, it will likely set up 12 12-core tasks, with each core being allocated 8 GB of RAM. This is more resource heavy than when this is done in series. If not enough memory is allocated per core, then the task will fail with an "out of memory" error code. Sometimes you might see a more foreboding "Node Fail" error message. To troubleshoot this, the line:
#SBATCH --mem-per-cpu=8G
can be commented out and replaced with:
#SBATCH --mem=0
Specifying --mem=0 will allocate the maximum available memory for the node to the task at hand, and you can then determine how much memory the task took to run by querying the effort needed to complete the job:
seff <job_ID>
Where <job_ID> is the identification of a particular job that you can see by asking:
sacct -X
These commands for monitoring SLURM jobs are described in more detail in the next section. In the scripts above, I eventually settled on 12 cores and 8 GB per core based on trial and error, seeking to have the shortest per-sample wall clock time (as reported by seff). However:
seff reports that in the end, we only use around 25% of the allocated total 96 GB of RAM (~25 GB). It turns out that we need 8 GB per core (even though we only use 4 GB per core in the samtools sort command) because the picard MarkDuplicates commands end up needing >6 GB per core to do their job. We are running MarkDuplicates with default parameters, so we could dive in and figure out the memory management options for this program, but I have not yet done that. If we specify 6 GB or less memory per core, some of the jobs fail because picard will require more than that to complete.
Note
Another issue that comes up is that asking for 12 x 12 cores plus 8 GB RAM per core may be seen as a 'big ask' depending on how busy Quest is. If each array takes a while to transition from "Pending" to "Running", then the time savings associated with the arrayed jobs is reduced. This script above works as well with 4 cores (instead of 12) per job, and I haven't had any issues with slurm delays when asking for this amount of resources. The trade off is that it could take 5-6 hours for the job to complete. The point here is that there are a number of ways to ask for resources, and if you observed significant delays in the job running, try reducing your ask for cores.
Monitoring SLURM Jobs on Quest¶
Here are some commmon commands used to interact with SLURM and check on job status. For more, check out this page.
-
squeue -u <netid>to check all jobs of a user. -
scancel <jobid>to cancel a job.scancel {30000..32000}to cancel a range of jobs (job id between 30000 and 32000). -
sinfo | grep genomicsguestAto check status of nodes.allocmeans all cores on a node are completely engaged;mixmeans some cores on a node are engaged;idlemeans all cores on a node are available to accept jobs. -
seff <jobID>to check how much 'effort' a job took -
sacct -Xto display all accounting data for all jobs for a user.-Xoptions only shows stats relevant to the job allocation itself, leaving out different steps in the process. This is a fine way to check on the status of submitted jobs and to get the job id.
Transfer From BaseSpace to QUEST¶
Illumina provides command line tools that facilitate the direct transfer of project .fastq files from BaseSpace to QUEST. To do this you either need to know the project ID, or know how to look it up. The code below will set up a transfer, which is much much faster than downloading through your personal computer and uploading to QUEST.
# this should probably be done using an interactive slurm session
# add the basespace command line tools
module add basespace
# login to BaseSpace using your credentials
# note, this may only have to be done once...
bs auth
# optional: list available projects newer than 10 days old
bs list project --newer-than=10d
# download to a (target directory)
bs download project -i (projectID) -o (path/to/target/directory)
Current Approach for Routine Processing of Paired-End ChIP/ATAC-seq Data¶
We recently overhauled our core sequencing pipeline to automatically perform a number of routine jobs that were previously handled by individual standalone scripts. A significant amount of the improvement in this pipeline comes from how it utilizes scratch space for fast file read/write. To use this pipeline, you must have a scratch space directory.. If you have one, you can cd to it using cd /scratch/{your netID}. If this directory doesn't exist, then you must apply for access to scratch space.
Overview of the approach¶
We want to trim adapters, map, mark duplicates, and index each individual set of fastQ files. Once this is done, we want to merge bam files across biological replicates, generate bigwig files for visualization, assess overall mapping quality, convert both individual and merged bams to a GRangesList object in R, and clean up intermediate files.
All of these steps are now handled by a single wrapper script, submit_pipeline_v2_1.sh. The master script is located in /projects/b1182/Scripts/Pipeline_v2, and this master script should be copied into the project directory for a particular sequencing run before edits are made. Once formatted, the wrapper submits a series of dependent SLURM jobs that execute each stage in sequence: trimming and mapping (as an array across samples), merging replicates and generating bigwigs, QC summary, GRanges conversion, and cleanup/deletion of trimmed reads. We've decided that keeping trimmed reads is unnecessary.
The scripts that actually do the work (the "operator" scripts) live in the same Scripts/Pipeline_v2 directory. Users should not need to edit these. All user-facing configuration happens within the wrapper itself.
The pipeline will impose the following directory structure on your project directory:
projects
|__ b1182
|__ User
|__ My_Experiment
|__ Raw_Data
|__ Logs
|__ Trimmed_Reads (deleted on successful completion, unless configured otherwise)
|__ Mapped_Reads
|__ Merged
|__ GRanges
|__ Merged
|__ BigWigs
|__ Merged
Once the pipeline is complete, everything should be copied back to the lab server for long-term storage and for local analysis. A copy of the Mapped_Reads and the GRanges can be left on QUEST if a user is interested in performing analysis on the cluster's instance of RStudio.
Running the pipeline¶
This assumes that the user has obtained sequencing data via download from Basespace as described above and that these data have been placed in the top level of a subdirectory named Raw_Data.
1) Prepare a sample manifest.
The wrapper requires a two-column CSV file named following the convention 23003-XX_sample_manifest.csv, placed in your project directory. Admera numbers our sequencing samples using the lab ID (23003) followed by a serial number for the particular run (XX, in this example). We should use this naming convention to keep all of these runs straight.
For the manifest:
- Column 1 is a human-readable name for the biological sample. Replicates must be identified with the suffix
_rep1,_rep2, etc., as the pipeline uses this convention to group samples for merging. If multiple users have contributed samples to the same pool and share similar sample names (e.g.,w_NC14_GFP_rep1), add a distinguishing identifier to prevent unintended merging (e.g.,w_NC14_GFP_user1_rep1). Try to keep these human-readable names short and direct, lacking any unnecessary information. Usuallygenotype_stage_target_(user)_repwill suffice. - Column 2 is the serial number corresponding to the sequencing facility's sample numbering. Use two-digit formatting (e.g.,
01rather than1). This follows the order that the samples were provided to Admera in the sample submission sheet, which is a direct copy-paste from our pooling spreadsheet. Therefore, use the order on the pooling spreadsheet to compute the serial number for your samples.
The pipeline has been designed to accept a .csv file produced even by Excel, so users can use this to generate the manifest, or another text editor if they so desire. Either should work.
2) Copy the wrapper to your project directory and fill in the required variables.
cp /projects/b1182/Scripts/Pipeline_v2/submit_pipeline_v2_1.sh /path/to/My_Experiment/
Open your copy of the script and update the following variables near the top:
| Variable | What to change |
|---|---|
basedir |
Full path to your project directory |
manifest |
Filename of your sample manifest CSV |
scratchdir |
Your personal scratch directory (e.g., /scratch/abc1234) |
PROJ |
Two-digit project identifier (e.g., "07") |
mail_user |
Your email address for SLURM notifications |
The USER_PREFIX and ROUND variables are unlikely to need changing. The PROJ project identifier is the number that follows 23003FL in each of the FastQ files provided by Admera. Review the DELETE_TRIMMED, SKIP_*, and GRanges options at the top of the script and adjust as needed for your situation.
3) Make the script executable and run it from your project directory.
cd /path/to/My_Experiment
chmod u+x submit_pipeline_v2_1.sh
./submit_pipeline_v2_1.sh
The wrapper will submit all of the SLURM jobs, print a summary of the submitted job IDs, and (if DASHBOARD="TRUE") launch a live watch-based dashboard in your terminal showing the progress of each stage. You can quit the dashboard at any time with Ctrl-C without affecting any of the running jobs.
Note
The pipeline detects the number of samples automatically by counting non-blank rows in the manifest, and detects the number of replicate groups for the merging step. Double-check that your manifest has no stray blank lines, and that replicate suffixes are formatted consistently.
Note
If a subset of samples failed and you need to re-run just those array indices, set ARRAY_RANGE in the wrapper to the relevant indices (e.g., "8-9" or "8,9,12") before re-submitting. Individual pipeline stages can also be skipped using the SKIP_MAPPING, SKIP_MERGING, and SKIP_GRANGES flags, which is useful when resuming from a partial run. However you may need to do some troubleshooting to determine why some samples failed. The Logs directory should have information about any errors for a particular sample at a particular stage of the pipeline. This can be a lot to dig through, but remember that the serial number for a sample in Logs is zero-based, so if sample 6 fails, then you are looking for files that end in 05.
Resource allocation. The wrapper uses 12 cpus and 12G per cpu for the mapping arrays by default, and 8 cpus with 6G per cpu for the merging step. These have been tested empirically and should be appropriate for a typical paired-end ChIP-seq or ATAC run. If you observe significant scheduler delays (jobs sitting in "Pending" for a long time), try reducing cpus or mem_per_cpu in the wrapper.
Initial benchmarking on a 48-sample NovaSeqX run took about 2.5 hours for trimming and mapping, 30 minutes for merging and making bigwigs, and ~1.5 hours to export GRanges. So, overall, you can start working with your data in 4-5 hours after initiating the pipeline.
Default Assumptions¶
The .bigwigs that are produced from this pipeline are filtered for mapQuality ≥ 20, are deduplicated, and only reflect primary reads. They are CPM normalized. If these are suboptimal for your needs, re-running the bigwig script is fast.
The GRanges that are produced from this pipeline can be specified for the desired mapQ cutoff and whether or not to exclude duplicates. This is handled within the wrapper script under the GRanges Options heading. Writing the big GRangesList objects here is computationally intensive, and re-doing this will take more work.
Using R on Quest¶
There are two options for using R on Quest, from the command line (no GUI) or with an online instance of RStudio. NUIT has a guide for using R on quest that you can find here.