Hugh Cross' GitHub Page

Bioinformatic Tools for Biodiversity and Evolutionary Genomics

This is my personal GitHub page. I am the Microbial Ecologist at the National Ecological Observatory Network (NEON). This site will be the central place to link to my GitHub projects and repositories. I have added some basic (but hopefully useful) scripts, but will populate this site with more content as they are developed.

Packages for Data Analysis

I have helped develop several tools for bioinformatics, from small, basic scripts to larger packages. With my colleague Gert-Jan Jeunen we built the Python package Crabs, a tool to build and curate DNA sequence reference databases for metabarcoding studies. This tool can use a range of online databases to source the sequences, including NCBI, Midori, and UNITE. The program includes utilities to do in silico PCR, dereplication, and multiple filtering options.

Another package in current development is phyloNEON, an R package that converts NEON Microbial Community Taxonomy (MCT) data for use in the phyloseq package and other metabarcoding tools. The MCT data are the analyzed products of the raw DNA metabarcoding sequence files that NEON provides from fungal ITS and bacterial/archael 16S amplicons. In the past year I revised and rebuilt the data analysis pipeline to run on NEON's Google Cloud Platform. The NEON database has over 12,000 samples for both ITS and 16S from 81 sites across North America. The MCT data are intended for researchers who do not have computational or bioinformatics resources to analyze the raw sequence data, so I built this package to enable users to quickly convert the analyzed data into their favorite programs for downstream analyses. In development are tools to integrate the amplicon data with the great wealth of other NEON data from the field sites.

Teaching Resources

In my previous position as bioinformatician in the Department of Anatomy at the University of Otago, I ran a bioinformatics help session in the Anatomy department. Most of the material from these sessions is available on the hacky hour webpage. These sessions range from getting started writing scripts to some guidelines on troubleshooting command line errors. I added some reference material to the site, including a section on the building blocks of bioinformatics, which I hope is useful.

As part of the Otago Carpentries group (main link) I have helped put on frequent workshops and training sessions. Here is a link to past workshops, which contain a great deal of content. The Bioinformatics Spring School, held in November, 2020, provides a framework for future large events through Otago Carpentries. It consisted of both basic lessons and specific training in several areas, such as gene expression and population genetics.

I have taught several workshops on environmental DNA (eDNA), and organised an eDNA conference attended by researchers from throughout New Zealand and the Pacific region. As part of the conference, I taught a one-day workshop on metabarcoding, which included an optional day for beginners to learn the basics of the command line and R. This was organised with the help of Otago Carpentries. We ran the workshop again as part of the Bioinformatics Spring School.

Jupyter Notebooks

I do most of my Python development using Jupyter notebooks. I have started to export the notebooks in HTML format so I can reference them easily. Here is a link to the index page.

Some Basic Tools

In the repository seq_tools I have put a few scripts that I use for manipulating DNA sequence data. A useful one is seq_extractor.py, that I use to extract a subset of sequences from a larger fasta file, from a file with a list of sequence ids. The page contains example files. I have used this script a great deal, as I find I often want to take a subset of sequences from a file (e.g. differentially expressed transcripts from a transcriptome file). I often use it interactively in Jupyter Notebook. I have extracted sequences from very large files (e.g. spruce genome, 8 Gb with about 12 million scaffolds). The speed was tremendously improved when I converted the list of sequence ids to a Python set (from hours to minutes). Just goes to show that data structures matter!

Another useful tool is revcomp.py that will simply return the reverse complement of either a single sequence, or a fasta file.

The script seq_name_changer.py is a bit more involved, mostly because it has several options. I like to rename all my sequence names in each file to have more information than the random code that the sequencing machine spits out. I know there are heaps of programs and scripts out there that can change read names in a sequence file, or have that capability among other options. However, I have often found that I want to change the names in a way that doesn't fit with the usual program. This script is a combination of the various name-changing scripts I have written. The option I use often is to conform to RNAseq paired end sequencing format that is required by a few programs. The most common (although not only) format is to have forward reads end in '/1' and reverse reads end in '/2', with nothing after this on the name. This option can be invoked with '-f fwd' or '-f rev' (f for format). Not all programs are fussy about the naming, but some of them will spit out errors, or just stop, without telling you what happened. I find it is good practice to change all my paired end data and stick to this scheme.

Useful Tools for NCBI files

My repository genbanking contains a couple of scripts for parsing NCBI files, including BLAST files and Genbank format sequence files. I use these kinds of files so often that I put together some general tools for parsing and extracting information from them. I describe two below, but I will be adding more.

For running BLAST searches, my preferred output format is tabular (option -outfmt 6). This is because they are human readable for a quick look, but also are tab-delimited so easy to parse. Also, there are many options to customize them. I have written so many scripts that parse one or other formats, based on different needs, that I finally wrote a function that allows any of the fields from a tabular BLAST output to be utilized, either with default or custom tab format. I realize that Biopython has good modules for parsing BLAST files, but only from xml format (last I looked), which is less convenient when you want to examine the files before parsing. My script filter_blast_results.py can filter a BLAST file using up to 22 different parameters, such as minimum alignment length or evalue.

I have also written several functions for extracting information from genbank files (.gb). I have written the script genbank_to_fasta.py that uses these functions and the Biopython packages SeqIO and SeqFeature to convert the genbank to a fasta file. Biopython alone can do this, but the sequence names output are just the GI number, which makes it hard to keep track of your sequences in an alignment or tree output (without renaming them). The included script will use the taxon name and either GI or Accession number to name the sequences. There are other options to modify these names, including adding the gene name, or description after the sequence name.