The distributed annotation system (DAS) relies on there being a common "reference sequence" on which to base annotations. The reference sequence consists of a set of "entry points" into the sequence, and the lengths of each entry point. The identity of an entry point will vary from genome to genome. For some genome projects, entry points correspond to entire chromosomes. For others, entry points may be a series of contigs.
The entry points describe the top level items on the reference sequence map. It is possible for each entry point to have substructure, basically a series of subsequences (components) and their start and end points. This structure is recursive. Each annotation is unambiguously located by providing its position as the start and stop positions relative to a "reference sequence." The reference sequence can be one of the entry points, or any of the subsequences within the entry point.
To give a concrete example, the C. elegans reference map consists of six chromosome-length entry points. Each chromosome is formed from several contigs called "superlinks", and each superlink contains one or more smaller contigs called "links". Links in turn are composed of one or more fully-sequenced clones. One could refer to an annotation by specifying its start or stop positions in clone, link, superlink, or chromosome coordinates. The distributed annotation system automatically converts any coordinate system into any other. Because coordinates within clones are more stable to revisions than coordinates within links or chromosomes, it is recommended that annotation coordinates be stored relative to the smallest sequencing unit.
The hierarchy is extensible. If the C. elegans gene predictions were stable, it would make sense to store certain annotations, such as the positions of exons, relative to the transcriptional unit.
Reference and Annotation Servers
The DAS consists of a reference sequence server, and one or more annotation servers.
Annotation servers are specialized for returning lists of annotations across a certain region of the genome. Each annotation is anchored to the genome map by way of a start and stop position relative to one of the reference subsequences. Annotations have an ID that is unique to the server and a structured description that describes its nature and attributes. Annotations may also be associated with Web URLs that provide additional human readable information about the annotation.
Annotations have types, methods and categories. The annotation type is selected from a list of types that have biological significance, and correspond roughly to EMBL/GenBank feature table tags. Examples of annotation types include "exon", "intron", "CDS" and "splice3." The annotation method is intended to describe how the annotated feature was discovered, and may include a reference to a software program. The annotation category is a broad functional category that can be used to filter, group and sort annotations. "Homology", "variation" and "transcribed" are all valid categories. The existence of these categories allows researchers to add new annotation types if the existing list is inadequate without entirely losing all semantic value. The Annotation Categories section contains a list of the annotation types in use in the C. elegans project.
It is intended that larger annotation servers provide pointers to human-readable data that describes its types, methods and categories in more detail. Another optional feature of annotation servers is the ability to provide hints to clients on how the annotations should be rendered visually. This is done by returning a XML "stylesheet".
The reference sequence server is an annotation server that provides the following additional services:
- Given a reference sequence id, it can return the raw DNA of that sequence.
- Given a reference sequence id, it can return annotations of the category "component". Component annotations describe how the sequence is assembled from smaller parts into large parts from the top down.
- Given a reference sequence id, it can return annotations of the category "supercomponent". Component parent annotations describe the assembly of the sequence from the bottom up.
Although the servers are conceptually divided between reference servers and annotation servers, there is in fact no key difference between them. A single server can provide both reference sequence information and annotation information. The main functional difference is that the reference sequence server is required to serve the sequence map and the raw DNA, while annotation servers have no such requirement.
The DAS is Web-based. Clients query the reference and annotation servers by sending a formatted URL request to the server. This request must follow the conventions of the HTTP/1.0 protocol (see RFC2616. Servers process the request and return a response in the form of a formatted XML document (see W3C Extensible Markup Language).
All DAS requests take the form of a URL. Each URL has a site-specific prefix, followed by a standardized path and query string. The standardized path begins with the string /das. This is followed by URL components containing the data source name and a command. For example:
http://www.wormbase.org/db/das/elegans/features?segment=CHROMOSOME_I:1000,2000 ^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^ ^^^^^^^ ^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ site-specific prefix das data command arguments src
In this case, the site-specific prefix is http://www.wormbase.org/db. The request begins with the standardized path /das, and the data source, in this case /elegans. This is followed by the command /features, which requests a list of features, and a query string providing named arguments to the /features command.
The data source component allows a single server to provide information on several genomes.
More information on the format of the request and the various available commands is given below.
The query string portion of the request (the "?" symbol rightward) can be POSTed to the URL following conventional HTTP standards. Since some queries can be quite large, this is the recommended way of argument passing.
Warning: The request may be replaced with a SOAP-style XML-encapsulated document in future versions of this specification.
The response from the server to the client consists of a standard HTTP header with DAS status information within that header followed optionally by an XML file that contains the answer to the query. The DAS status portion of the header consists of two lines. The first is X-DAS-Version and gives the current protocol version number, currently DAS/1.0. The second line is X-DAS-Status and contains a three digit status code which indicates the outcome of the request.
Here is an example HTTP header: (provided by Web server)
HTTP/1.1 200 OK Date: Sun, 12 Mar 2000 16:13:51 GMT Server: Apache/1.3.6 (Unix) mod_perl/1.19 Last-Modified: Fri, 18 Feb 2000 20:57:52 GMT Connection: close Content-Type: text/plain X-DAS-Version: DAS/1.5 X-DAS-Status: 200 X-DAS-Capabilities: error-segment/1.0; unknown-segment/1.0; unknown-feature/1.0; ... data follows...
The defined status codes are listed in Table 1.
|200||OK, data follows|
|400||Bad command (command not recognized)|
|401||Bad data source (data source unknown)|
|402||Bad command arguments (arguments invalid)|
|403||Bad reference object (reference sequence unknown)|
|404||Bad stylesheet (requested stylesheet unknown)|
|405||Coordinate error (sequence coordinate is out of bounds/invalid)|
|500||Server error, not otherwise specified|
The HTTP/1.0 protocol allows web clients to request byte-level compression of the response by sending the HTTP header Accept-Encoding header. Web servers that are capable of it can reply with a Content-transfer-encoding header and a compressed body. Implementors of DAS clients and servers may wish to implement this HTTP feature.
|New in version 1.5|
|The X-Das-Capabilities header provides an extensible list of the capabilities that the server provides. This can be used by those writing experimental extensions to DAS to flag clients that those extensions are available. Capabilities have the form CapabilityName/Version and are separated by semicolon, space, as in "capabilityA/1.0; capabilityB/1.4; capabilityC/1.0". The following standard capabilities are present in the DAS/1.5 protocol: