Thesis (Ph.D., Bioinformatics & Computational Biology)--University of Idaho, June 2014 | DNA microarrays and high throughput sequencing (HTS) have allowed investigators to characterize nucleic acids (DNA and RNA) across tissues, treatments, samples, and species, providing a wealth of information in efficient and cost-effective ways. This dissertation presents an application of DNA microarrays and two novel methods for analysis of HTS data.
Bacterial and fungal species that live in and on the human body are believed to play important roles in the maintenance of health and prevention of disease. To study these communities in the human vagina, we developed the VChip, a DNA microarray with probes representing 313 strains of bacteria as well as 716 human immunity genes. This array was validated using mock bacterial communities and tested using DNA and cDNA from vaginal swabs. The VChip produced results that accurately reflected the composition of the mock bacterial communities, and produced results similar to those obtained from 16S rRNA amplicon pyrosequencing.
Assembly by Reduced Complexity (ARC), is a software package that facilitates iterative, reference seeded assembly of HTS datasets. This strategy is useful for datasets that can be divided into several discreet subsets that can each be assembled independently. A set of reference or "target" sequences is used to recruit initial subsets of reads, each subset is assembled independently into contigs, these contigs are then used to recruit a new set of reads. This process is iterated, to grow assemblies until stopping conditions are met. I showed that ARC works well even with moderately divergent references, and is not plagued by reference bias, a serious limitation of mapping based strategies.
StopGap is a strategy for improving genome assemblies. Gap spanning Pacific Biosciences continuous long reads are identified and used to guide assembly of high quality Illumina or 454 reads with the ARC pipeline. Two assembly merging programs were tested for their ability to take advantage of these gap-bridging contigs. I show that this approach was able to produce more contiguous assemblies and better represent repeated sequences within the assembly. Although StopGap was used here to improve the assembly of a bacterial genome, this approach could be used in the assembly of more complex eukaryotic genomes as well.