The pace of human genomic sequencing has outstripped the ability of sequencing centers to annotate and understand the sequence prior to submitting it to the archival databases. Multiple third-party groups have stepped into the breach and are currently annotating the human sequence with a combination of computational and experimental methods. Their analytic tools, data models, and visualization methods are diverse, and it is self-evident that this diversity enhances, rather than diminishes, the value of their work.
The main risk of third-party annotation is that it may fracture knowledge about the genome. Instead of having a convenient one-stop source for genomic annotation, such as Entrez, researchers may have to check multiple Web sites for information about a particular region of interest, download the data in several different formats, and perform a manual integration in order to get the whole picture. Clearly, this is undesirable.
There are several possible approaches to this problem. One is for each of the annotation centers to submit their annotations to a centralized database, such as GenBank. However, this option raises a number of political and technical problems, not the least of which is the long-held tradition of GenBank and its sister databases of allowing only the sequence submitter to modify or comment on a GenBank entry. Another option would be a system which uses Web links to point from the GenBank entry to one or more annotation Web sites. Such a system is available now in the form of the NCBI LinkOut service. However, while this makes it easier for researchers to find third-party annotation sites, it does not solve the problem of data integration.
The solution that we advocate allows sequence annotation to be decentralized among multiple third-party annotators and integrated on an as-needed basis by client-side software. A single server is designated the "reference server." It serves essential structural information about the genome: the physical map which relates one entry to another (where an "entry" is an arbitrary segment of the sequence, such as a sequenced BAC or a contig), the DNA sequence for each entry, and the standard authorship information. Multiple sites then act as third-party "annotation servers." Using a web browser-like application, researchers can interrogate one or more annotation servers to retrieve features in a region of interest. The servers return the results using a standard data format, allowing the sequence browser to integrate the annotations and display them in graphical or tabular form. No attempt is made to automatically resolve contradictions between different third-party annotations. Indeed, it is the ability to facilitate comparison among different centers' annotations that distinguish this proposal. We currently have a working prototype of this system based on ACeDB servers and CGI scripts, and are now generalizing this architecture to support other client and server combinations.
The key development that is necessary for a successful distributed annotation system is the adoption of a standard format to describe sequence features. While almost any one of the existing standards could be adapted for this purpose, certain characteristics are very desirable:
- Handling of multiple levels of relative coordinates
In the ideal world, the genome would be finished to the base pair, and we would be able to unambiguously refer to an annotation based on its position from the top of the chromosome. This will not happen for a very long time. For the conceivable future, the genome will consist of multiple segments of high confidence, related to one another by mapping information of lower confidence. In order to deal with annotations in this dynamic and changeable environment, the format must be able to deal with relative coordinates in which annotations are related to arbitrary hierarchical landmarks. For example, a "clone end" annotation may be related to the start of a contig, an "mRNA" annotation may be related to the clone end, and an "exon" annotation may be related to the start of the mRNA.
- Easily generated and parsed
Experience has shown that it is difficult to convince groups to adopt complex and sophisticated data formats. For this reason, a "lowest common denominator" format is desirable, even if it sacrifices some of the expressiveness of the more sophisticated formats. A human-readable format, such as tab-delimited tables, XML, or even ".ace" format is also desirable.
- Extensibility
Any format must be extensible to allow for new types of annotations. Specifically, we feel that it is desirable to create a category of annotation that has to do with the availability of experimental data concerning the region of interest. For example, the format should allow a researcher to note the presence of RNAi results overlapping the region of interest. The format should also provide a mechanism for pointing the researcher to a location where he or she can get more information about a selected annotation. In the ACeDB-based system, each annotation contains a pointer into an ACeDB entry somewhere on the Internet. This entry is in turn linked to related biological and experimental information.
- Functional groupings of annotations
To further enhance the extensibility of the format, it is desirable to group specific annotations into functional categories rather than maintaining an unsorted "laundry list" of feature types. For example, splice sites, polyA signals, introns and exons are all annotations having to do with a generic "mRNA" category, while clone ends, primer pairs, and hybridization probes are "structural" features. Grouping annotations into conceptual categories makes the data more manageable, and facilitates formulating biologically relevant queries on the annotation servers.
Additional background information on DAS can be found in the following project proposal.
The DAS specification describes a simple client/server system that satisfies many of these requirements.