Genome assembly and slurm management system - Masterclass

> [!tip] Note > This document was written for a masterclass ran by myself and Laura Campbell in February 2023. It was written as an introduction to supercomputing on Durham University's supercomputer Hamilton for 2nd and 3rd year undergraduates. We hope it offers a broad introduction to the topic of bioinformatics for those just starting out. The files used in this practical can be download from https://github.com/ChristophePatterson/My-first-Genome # Introduction to supercomputers ## What is a supercomputer? A supercomputer is just a computer that has very high levels of performance. Super computers are becoming increasingly important as the volume of data created, for example by full genome sequencing, increases. A typical laptop will have 4 cores. Hamilton 8, Durham's supercomputer, has over 15,000 cores. A core is part of a computer capable of completing a computation/task. ## Using the command line When using the Hamilton 8 supercomputer we access files and programmes using the command line instead of a graphical user interface (GUI) like we have on our Windows PCs and Macs. This is primarily to save computational resources. Thankfully using the command line is easier than it may initially seem - Google is your friend! Here are some useful commands for this practical: List - this will list all files in a given directory (directory = folder) ```bash ls ``` This command is useful to check your files are all present, as well as a way to make sure you are in the correct directory. Make a directory - the above command will make a new directory called "genome_assembly" ``` mkdir genome_assembly ``` Nano is a Linux text editor. The above command will produce a new text file called "hifiasm_name.q" and then open it in Nano. If there is already a file called "hifiasm_name.q" in the current directory then Nano will open that file. ```bash nano hifiasm_name.q ``` Prints the first 10 lines of a file ```bash head file_name ``` Prints the last 10 lines of a file ```bash tail file_name ``` Loads the bioinformatics modules available on Hamilton 8 ```bash module load bioinformatics ``` Prints a list of the available modules (programs) including the bioinformatics modules ```bash module avail ``` Loads hifiasm software ```bash module load hifiasm ``` ## SLURM queueing system **S**imple **L**inux **U**tility for **R**esource **M**anagement Supercomputers are almost always shared between a lot of people. Even though Hamilton 8 is a highly powerful computer, the resources are not infinite! Therefore it is important that resources are shared fairly between all users. To complete tasks on Hamilton 8 you have to submit your code as a "job" using the SLURM queuing system. Below is an example of a SLURM script. ``` #!/bin/bash # Request resources: #SBATCH -c 1 # 1 CPU core #SBATCH --mem=1G # memory required, up to 250G on standard nodes #SBATCH --gres=tmp:1G # temporary disk space ($TMPDIR), up to 400G #SBATCH -t 01:00:00 # time limit for job (format: days-hours:minutes:seconds) #SBATCH --mail-user=<EMAIL> # Email address to send status report #SBATCH --mail-type=BEGIN,END,FAIL # Types of status update to email #SBATCH -p shared # Designated queue for job (defaults to shared) ``` For this practical we have provided suitable SLURM scripts for you, but if you do go onto working on Hamilton 8 yourself (or any other supercomputer that utilises SLURM) it is worth experimenting with how many resources you need. Jobs will fail if you do not ask for enough computational resources, however, if you ask for a lot of resources you will spend longer in the queue. Often, increasing the number of cores will make your job run faster. # Introduction to sequence data We are going to go through some of the stages of genome assembly, assessment, and annotation. Due to the large computational tasks that genome assembly involves we have reduced the scope of the data we are using. The process is the same but the output will differ slightly from that of a whole genome assembly. # Getting the data for this practical We have set up some files for you to view and work with. To access these files type in the following commands. ``` # copy the files to your own space scp login1:/tmp/masterclass_data_files.tgz # extract the data files from the compressed tar file tar xvf masterclass_data_files.tgz ``` You should now have a few files and a folder contained within you home directory. Check this by using the `ls` command. # Types of Sequences files In order to construct a draft genome it's important to understand how DNA data is stored and can be viewed. DNA data comes in a variety of different formats but the main two are *FASTA* and *FASTQ*. ### FASTA files FASTA format files store data over two lines. The first always starts with a "`>`" symbol and provides some information about the DNA sequence that follows it. This may be the name of organism it comes from, how it was sequenced or how it the DNA sequences was assembled. The following line contains the DNA sequences typical composed of `ATCG`, but other symbols are possible such as `N` for unknown. Here is a typical example. ``` >Clibanarius_erythropus_Sample_Ke8_Assembly_consensus_sequence_for_COX1 GGTCAACAAATCATAAAGATATTGGCACTTTATATTTTATTTTTGGAGCTTGAGCTGGTATGGTAGGATCTTCACTTAGACTAGTTATTCGAGCCGAACTAGGCCARCCCGGCAGTTTAATTGGAGATGACCAAATCTACAACGTAGTTGTCACAGCCCATGCTTTTGTAATAATTTTTTTCATRGTAATACCTATTATAATTGGTGGKTTTGGTAATTGATTAGTTCCTTTAATATTAGGAGCYCCGGATATAGCTTTCCCACGAATAAACAACATAAGATTCTGATTACTRCCCCCGTCTCTCACTTTATTATTAATAAGAGGAATGGTTGAAAGAGGTGTCGGAACAGGATGAACTGTTTACCCTCCYYTGTCTGCTGCWATTGCCCACGCGGGTGCTTCGGTYGATCTGGGTATTTTTTCTTTACACTTAGCAGGAGTGTCCTCAATCTTAGGCGCCATCAATTTTATAACCACAGTGATCAATATACGGCCCCGTGGKATAAGAATAGACCGAATACCTTTATTCGTGTGRTCTATYTTTATTACCACTATTTTACTTTTATTRTCCCTACCTGTTTTAGCAGGYGCCATTACCATATTATTAACAGACCGAAACTTAAATACTTCTTTCTTTGACCCRGCTGGTGGAGGRGA-CCCTGTTTTATATCAACATTTATTTTGATTTTTTGGTCACCC ``` We have uploaded a FASTA sequence data file to your home drive on Hamilton. You can view it using the `head` command. FASTQ files FASTQ files contain the same information as FASTA files but with the added information of quality information about the confidence of each nucleotide in the sequence. Sequencing machines are never 100% accurate and the quality of the base call is preserved within the FASTQ file format. The first two lines of a fastq file are the same as fasta file, but there are two additional lines that contains the quality of each base call. This is displayed by a new line containing the symbol "+" and then the following line contains coded information for the base call quality. Here's an example: ``` @m64157e_211022_191939/7/ccs ACATCAGCTTTAATCTTGGGTAGGACTAATGAAGCCTTTAGGTTTCCAAACATGGTTTATTCATTACTTAGAGCTGAG + cSFRTWRR+L?:INRNbBddZYBOSRXE_ASKaSH_5FLR@T8HV1R1XD@T4CY9HF=HbH*=\E%1UXRYRRVSX8 ``` The first line contains the sequence identifier, in this case it corresponds to a sequencing machine and then the specific sequencing run used. The next line contains the DNA sequences The third line always contains a "+". The forth line displays the encoded quality information. Each symbol corresponds to a numerical quality score from 0 to 93. Zero is complete rubbish we have no idea what that base could be really. 93 is great we are very confident that individual base has been called correctly. The quality information is not provided as numbers so that each symbol on the forth line corresponds the the same base on the second. For instance the first base has been called as an "`A`" and has a quality score of "`c`". "`c`" is the code for "`66`". The final base in this sequence was called as a "`G`" and has a quality score of "`8`" which is code for "`23`". ## Viewing data on the command line. We have provided you with some sequence data. Use the `head` command to view the FASTQ file stored in the "data" directory. Can you spot each of the four components of the fastq file? # Genome construction We are now going to construct our first genome. In the interest of time we are going to be assembling a small section of the genome of the rubyspot damselfly, *Hetaerina titia*, specifically this female *H titia*. ![[P1010242_HiFi_Female_HXRCb13.jpeg]] To construct our first genome we are going to use the programme [*Hifiasm*](https://hifiasm.readthedocs.io/en/latest/). The code to use Hifisam is surprisingly simple. ```bash hifiasm -o <OUTPUT_FILE_NAME> -t <NUMBER_OF_CORES> /path/to/sequence_data.fastq ``` Remember we are using the SLURM mangement system so we need to create a slurm script first. This should look like the following. Hifiasm has already been installed on Hamilton and you can activate it by using `module load`. ```bash #!/bin/bash #SBATCH -c 8 # The number of cores needed #SBATCH --mem=200G # memory required, in units of k,M or G, up to 250G. #SBATCH --gres=tmp:300G # $TMPDIR space required on each compute node, up to 400G. #SBATCH --time=03:00:00 # time limit in format dd-hh:mm:ss # Loading modules module load bioinformatics module load hifiasm/0.16.1 # Run Hifiasm hifiasm -o H_titia.asm -t 8 gbk.HiFiMapped.bam.fasta ``` Once you have prepared your script, send it to the queue so that your code runs by using the sbatch command ```bash sbatch my_SLURM_script.q ``` You can check the status of your job by either using `sacct` or `squeue` ```bash sacct ``` ```bash squeue -u [username] ``` This will likely take a few mins to run. That's is! You've run your first genome assembly. You can view the draft genome by using the code `cat /path/to/assembly/`. This likely prints out a LOT of ATCGs, to reduces this output we can port the output of cat into the `head` command by using the `|` symbol. `cat /path/to/assembly | head`. How large is the assembly file compared to the raw sequence data? # Genome assement We are now going to use a tools called [Quast](https://quast.sourceforge.net/). Quast is used to asses the quality of a genome assembly through the calculation of statisitcs such as N50 and L50. We have uploaded three draft genome from three different organisms. All three can be found and download from NCBI (National Center for Biotechnology Information). - The red squirrel (*Sciurus vulgaris*) - CACRXJ02.1 - *Genlisea aurea* (A canivours plant) - GCA_000441915.1 - The buff-tailed bumblebee (*bombus terrestris*) - GCF_910591885.1 We are going to run Quast assesments on all of these genome assemblies. The assemblies are stored as `.gz` files which means they are compressed to save memory. For instance, if we had a run of twelve A's (`AAAAAAAAAAAA`) we could store it as 12 A (12 bytes) but alternatively we could compress this information into `Ax12` (4 bytes) This isn't how .gz files compress data but does show that a typlical genome can be easily compressed into a smaller file. The following scripts runs quast on the three genomes ```bash #!/bin/bash #SBATCH -c 6 #SBATCH --mem=30G # memory required, in units of k,M or G, up to 250G. #SBATCH --gres=tmp:30G # $TMPDIR space required on each compute node, up to 400G. #SBATCH --time=02:00:00 # time limit in format dd-hh:mm:ss #SBATCH --output=slurm-%x.%j.out # Commands to execute start here module load bioinformatics module load quast/5.2.0 quast.py GCF_910591885.1_iyBomTerr1.2_genomic.fna.gz \ Bombus_terrertris.gz \ Genlisea_aurea.gz \ Sciurus_vulgaris.gz \ -o Quast \ -t 6 \ --eukaryote --large ``` You can then view the outputed report by opening the `report.txt` file in hamilton or by downloading the `report.html` on to you personal computer (if you are using winSCP). ``` cat Quast/report.txt ``` Take you time to look over the report. What information can you extract? # END ## Christophe Patterson and Laura Campbell