This is my personal GitHub page. I am a bioinformatician in the Department of Anatomy at the University of Otago in Dunedin, New Zealand. This site will be the central place to link to my GitHub projects and repositories. I have added some basic (but hopefully useful) scripts, but will populate this site with more content as they are developed.
I have been running a bioinformatics help session in the Anatomy department tea room. This was postponed during the pandemic but will be starting again soon. Most of the material from these sessions is available on the hacky hour webpage. These sessions range from getting started writing scripts to some guidelines on troubleshooting command line errors. More material will be coming soon.
As part of the Otago Carpentries group (main link) I have helped put on frequent workshops and training sessions. Here is a link to past workshops, which contain a great deal of content. The Bioinformatics Spring School, held in November, 2020, provides a framework for future large events through Otago Carpentries. It consisted of both basic lessons and specific training in several areas, such as gene expression and population genetics.
I have taught several workshops on environmental DNA (eDNA), and organised an eDNA conference attended by researchers from throughout New Zealand and the Pacific region. As part of the conference, I taught a one-day workshop on metabarcoding, which included an optional day for beginners to learn the basics of the command line and R. This was organised with the help of Otago Carpentries. We ran the workshop again as part of the Bioinformatics Spring School.
The growing interest in environmental DNA has led to the creation of an eDNA Hub at the University of Otago. Our inaugural meeting will be held on 25 March. This initiative was organised by Dr. Tina Summerfield (Department of Botany) and myself. The Hub is supported by Genetics Otago and Genomics Aotearoa.
I do most of my Python development using Jupyter notebooks. I have started to export the notebooks in HTML format so I can reference them easily. Here is a link to the index page.
In the repository seq_tools I have put a few scripts that I use for manipulating DNA sequence data. A useful one is seq_extractor.py, that I use to extract a subset of sequences from a larger fasta file, from a file with a list of sequence ids. The page contains example files. I have used this script a great deal, as I find I often want to take a subset of sequences from a file (e.g. differentially expressed transcripts from a transcriptome file). I often use it interactively in Jupyter Notebook. I have extracted sequences from very large files (e.g. spruce genome, 8 Gb with about 12 million scaffolds). The speed was tremendously improved when I converted the list of sequence ids to a Python set (from hours to minutes). Just goes to show that data structures matter!
Another useful tool is revcomp.py that will simply return the reverse complement of either a single sequence, or a fasta file.
The script seq_name_changer.py is a bit more involved, mostly because it has several options. I like to rename all my sequence names in each file to have more information than the random code that the sequencing machine spits out. I know there are heaps of programs and scripts out there that can change read names in a sequence file, or have that capability among other options. However, I have often found that I want to change the names in a way that doesn't fit with the usual program. This script is a combination of the various name-changing scripts I have written. The option I use often is to conform to RNAseq paired end sequencing format that is required by a few programs. The most common (although not only) format is to have forward reads end in '/1' and reverse reads end in '/2', with nothing after this on the name. This option can be invoked with '-f fwd' or '-f rev' (f for format). Not all programs are fussy about the naming, but some of them will spit out errors, or just stop, without telling you what happened. I find it is good practice to change all my paired end data and stick to this scheme.
My repository genbanking contains a couple of scripts for parsing NCBI files, including BLAST files and Genbank format sequence files. I use these kinds of files so often that I put together some general tools for parsing and extracting information from them. I describe two below, but I will be adding more.
For running BLAST searches, my preferred output format is tabular (option -outfmt 6). This is because they are human readable for a quick look, but also are tab-delimited so easy to parse. Also, there are many options to customize them. I have written so many scripts that parse one or other formats, based on different needs, that I finally wrote a function that allows any of the fields from a tabular BLAST output to be utilized, either with default or custom tab format. I realize that Biopython has good modules for parsing BLAST files, but only from xml format (last I looked), which is less convenient when you want to examine the files before parsing. My script filter_blast_results.py can filter a BLAST file using up to 22 different parameters, such as minimum alignment length or evalue.
I have also written several functions for extracting information from genbank files (.gb). I have written the script genbank_to_fasta.py that uses these functions and the Biopython packages SeqIO and SeqFeature to convert the genbank to a fasta file. Biopython alone can do this, but the sequence names output are just the GI number, which makes it hard to keep track of your sequences in an alignment or tree output (without renaming them). The included script will use the taxon name and either GI or Accession number to name the sequences. There are other options to modify these names, including adding the gene name, or description after the sequence name.
More content coming!