?:abstract
|
-
Microorganisms that cause infectious diseases present critical issues of national security, public health, and economic welfare. For example, in recent years, highly pathogenic strains of avian influenza have emerged in Asia, spread through Eastern Europe, and threaten to become pandemic. As demonstrated by the coordinated response to Severe Acute Respiratory Syndrome (SARS) and influenza, agents of infectious disease are being addressed via large-scale genomic sequencing. The goal of genomic sequencing projects are to rapidly put large amounts of data in the public domain to accelerate research on disease surveillance, treatment, and prevention. However, our ability to derive information from large comparative genomic datasets lags far behind acquisition. Here we review the computational challenges of comparative genomic analyses, specifically sequence alignment and reconstruction of phylogenetic trees. We present novel analytical results on two important infectious diseases, Severe Acute Respiratory Syndrome (SARS) and influenza. SARS and influenza have similarities and important differences both as biological and comparative genomic analysis problems. Influenza viruses (Orthymxyoviridae) are RNA based. Current evidence indicates that influenza viruses originate in aquatic birds from wild populations. Influenza has been studied for decades via well-coordinated international efforts. These efforts center on surveillance via antibody characterization of the hemagglutinin (HA) and neuraminidase (N) proteins of the circulating strains to inform vaccine design. However, we still do not have a clear understanding of (1) various transmission pathways such as the role of intermediate hosts like swine and domestic birds and (2) the key mutation and genomic recombination events that underlie periodic pandemics of influenza. In the past 30 years, sequence data from HA and N loci has become an important data type. In the past year, full genomic data has become prominent. These data present exciting opportunities to address unanswered questions in influenza pandemics. SARS is caused by a previously unrecognized lineage of coronavirus, SARS-CoV, which like influenza has an RNA based genome. Although SARS-CoV is widely believed to have originated in animals, there remains disagreement over the candidate animal source that lead to the original outbreak of SARS. In contrast to the long history of the study of influenza, SARS was only recognized in late 2002 and the virus that causes SARS has been documented primarily by genomic sequencing. In the past, most studies of influenza were performed on a limited number of isolates and genes suited to a particular problem. Major goals in science today are to understand emerging diseases in broad geographic, environmental, societal, biological, and genomic contexts. Synthesizing diverse information brought together by various researchers is important to find out what can be done to prevent future outbreaks [JON03]. Thus comprehensive means to organize and analyze large amounts of diverse information are critical. For example, the relationships of isolates and patterns of genomic change observed in large datasets might not be consistent with hypotheses formed on partial data. Moreover when researchers rely on partial datasets, they restrict the range of possible discoveries. Phylogenetics is well suited to the complex task of understanding emerging infectious disease. Phylogenetic analyses can test many hypotheses by comparing diverse isolates collected from various hosts, environments, and points in time and organizing these data into various evolutionary scenarios. The products of a phylogenetic analysis are a graphical tree of ancestor–descendent relationships and an inferred summary of mutations, recombination events, host shifts, geographic, and temporal spread of the viruses. However, this synthesis comes at a price. The cost of computation of phylogenetic analysis expands combinatorially as the number of isolates considered increases. Thus, large datasets like those currently produced are commonly considered intractable. We address this problem with synergistic development of heuristics tree search strategies and parallel computing.
|