COBRE: UID: Pilot: Algorithmic Improvement of Calls and Reads
This subproject is one of many research subprojects utilizing the resourcesprovided by a Center grant funded by NIH/NCRR. Primary support for the subprojectand the subproject's principal investigator may have been provided by other sources,including other NIH sources. The Total Cost listed for the subproject likelyrepresents the estimated amount of Center infrastructure utilized by the subproject,not direct funding provided by the NCRR grant to the subproject or subproject staff.The introduction of 454 Sequencing has lead to significant improvements in sequencing throughput and dramatically reduced sequencing costs. Comparisons estimate that its throughput is 100 fold better than Sanger sequencing at significantly reduced cost. Thus, it represents a clear advance in sequencing technologies.However, 454 Sequencing is still limited by several significant weaknesses. Compared to Sanger sequencing, the read lengths are relatively short - although improvements in the technology are leading to longer reads. Additionally, 454 reads have comparitively high error rates. These weaknesses are closely related as error rates increase significantly as read lengths increase making errors are one of the major limiting factors on read length - past a certain point the read data is so error-filled as to be useless.Shorter, and more error prone, reads makes it more difficult to generate successful, accurate sequences of longer, e.g. non-bacterial, genomes. This weakness can be partially overcome through higher coverage.Multiple overlapping reads are used to identify erroneous reads and build 'consensus' reads. However, generating additional coverage increases cost and time, reducing the advantages of 454 Sequencing.We propose to address the problem of read errors by 1) by measuring the correlations between measured intensities of 454 wells and using those correlations to model and correct calling errors and 2) byusing the raw intensity data (rather than just the called bases, as is currently done) in combination with Hidden Markov models to produce more accurate consensus reads. The resulting error correction agorithms will be packaged into software that is easy to insert into the current 454 data processing pipline. The results of this research would both improve the quality of generated 454 sequences and help to maximize 454 Sequencing's advantages of high throughput and low cost, by limiting the need for redundant reads for error correction.