Mathematics of Information Technology and Complex Systems

Research

Under the Mprime project Assembly and Analysis of 2-base Encoded Sequencing Data, we are working to use advanced computational methods to build better tools for using dibase-encoded, or "color-space" data generated by the ABI SOLiD technology. Our research involved four separate directions: read mapping, genome assembly, variation discovery, and data visualization.

SHRiMP: Read Mapping

We have developed SHRiMP, a widely used package for mapping short reads generated by the "Next-generation" sequencing platforms. SHRiMP featured the first full alignment algorithm for color-space data, allowing not only for sequencing errors and SNPs, but also insertions and deletions. The latest release of the SHRiMP package (version 1.3.0) features nearly a five fold speedup over the previous versions. This was mainly achieved by caching signatures (hashes) of previous hits, so that alignments for multiple similar regions don't have to be recomputed.

VARiD: Identification of SNPs and Small Indels

VARiD is a tool for the detection of SNPs and small indel polymorphisms from the alignments of reads to a reference genome (e.g. by a tool like SHRiMP). One of the key advantages of VARiD is the ability to combine color-space with regular, letter-space reads to achieve significantly higher accuracy than is possible with either type of data. On color-space data alone VARiD outperforms ABI's Corona Light pipeline. An alpha version of VARiD is currently available in Matlab, and a full C version is under development.

Savant: Genome Browser

Savant, the Sequence Annotation, Visualization and ANalysis Tool, is a desktop visualization and analysis browser for genomic data. Savant was developed for visualizing and analyzing High Throughput Sequencing data, with special care taken to enable dynamic visualization in the presence of gigabases of genomic reads and references the size of the human genome. Savant supports the visualization of genome-based sequence, point, interval, and continuous datasets, and multiple visualization modes that enable easy identification of genomic variants (including SNPs, structural, and copy-number variants), and functional genomic information (e.g. peaks in ChIP-seq data) in the context of genomic annotations.