Notes from the DAS/2 teleconference for the code sprint, 6 Feb 2006 $Id: das2-teleconf-2006-02-06.txt,v 1.2 2006/02/06 19:57:05 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt Sanger: Andreas Prlic, Thomas Down, Roy Sweden: Andrew Dalke UC Berkeley: Nomi Harris UCLA: Allen Day Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Gregg's topics for discussion: * Status report * DAS/2 XML - valid or not valid? * CATEGORY elements -- constructing query URLs * MAINTAINER information * Use of xml:base * update on feature properties - searching, etc. Status Reports - what people are working on for the code sprint ------------------------------------------------------------ andrew - getting folks up to speed on the spec changes, what he wrote. - getting a feel for ensembl schema. - change today: time zone specification b/c td's java time lib did something different than iso did. aday: tag & branch? gh: no branch, maybe tag ad: tagging probably not necessary gh: brings up a related issue: what is our mechanism for versioning - client & spec to understand which version of the spec they are/should be implementing - can talk about it later during the xml validation issue discussion ap: [missed it -- sorry!] td: java om, feature xml done, can read and write. roy: zmap das2 client, read/write das2, written in C. working with ed griffith who's not available this week. currently just a reader. from james gilbert, based on fmap from Acedb gh: updating client and server (mostly client). top down syncing in parallel, one command at a time. sources request is working on both sides. will start w/ allen's server today, doing gh's sources query against allen's server. segments and types today. nh: apollo das2 client. reads das2 xml from andrew's example, write out features in das2, now working on get, testing with server. sc: affy das2 server stuff. streamlining updating it with feature data from UCSC. also working on updating exon array data for use in IGB client. working w/ gregg on other server-related work. gh: graph data as well. ee: working on igb client. talk w/ gregg later to get specifics. gh: lots of ui stuff Topic: xml validation --------------------- ad: dtd's don't support namespaces, so we can't support dtds gh: not that simple. where do we add namespaces? ad: schemas have ns's testing.... gh: concern #1: is one of perception. don't like telling people we don't have valid xml ad: only means suports the dtd, not in human sense. gh: it's one of perception td: self-contained document + validation gh: getting rid of doctype declaration is issue of versioning. how will client know which version of spec it's supposed to be implementing? need to deal with spec crawl. The only way i'm aware of is via looking at dtd pointer changing. gh: not worried about new categories, but changing things like optional vs req'd attributes/elements. ad: content-type contains version td: or content negotiation ap: xml schema validator at w3c.org can use that and claim it is valid. can upload your files, push a button. ad: I have an extension of properties with arbitrary binary data vs text vs href. this is ok with relaxng, not by xsd. ad: we could say what is valid das2 since we're the arbiters of what is valid das xml document. e.g., well-formed, validates against the rng schemas gh: the rng we now have allows arbitrary xml? ad: yes. can say there are arbitrary elements under some node. checked in as file named common.rnc gh: ok, getting rid of requirement for doctype declaration. any versioning is done via content-type gh: if we don't do content neg, a sources query goes out, whatever version that the server supports comes back. this will be the latest version of the spec the server supports. ad: for backwards compatibility that won't be needed. extensibility will be sufficient for a few years. gh: don't believe it. td: spec is churning fast now. there'll be less churn once there are impls. gh: there were impls 3 or 4 mos ago (allen, gregg). so there have been plenty of churn even with impls.so we'll need versioning, ok on content-type. aday: we definitely need versioning. need it now. also want a tagged version we we can work at same time. ad: content-type-xdas;version=1.1 in general not the right solution (not general purpose), but for this case, makes sense. aday: can impl, header says 1.1 gh/ad: contents are a subset of the specification. so it's tied to a version of the rng schema. ad: the tag will be the cvs revision # gh: this isn't temporary, where there will not be a time when we are not generating churn. ad: believes this is temporary, won't have to have it long-term aday: no mechanism for it now. ad: need a way to turn it into meaning. agreement on what string means which verison of a program. nh: second gregg. will always be an issue. ad says it's not good long-term, maybe we should come up with it. gh: we have some basis to go forward. [A] das/2 server will specify spec version via content-type-xdas;version=X.X Topic: category elements, how to construct a query url ------------------------------------------------------ ad: what is syntax of string used to specify ontology? SO:? aday: attribute for it gh: ontol term is a uri aday: type element has ontology gh: id of type is not nec an ontol term ad: the attrib of feat type, ontol=something gh: that's a uri, abs or rel point to a frag in so/fa ontol ad: can't find how this should look. said SO:0000001. that should be a uri? gh: yes. in types xml that's returned, id and ontol are uri's. a server will pick one for it's xml base. the other will have to be a full uri. ad: how do diff clients know a given term corresponds to what term in the ontol? gh: they will have to understand sofa/so. ad: do they have persistent ids? gh: my understanding is that they can use fragment notation for a stable url for the term aday: ontol docs aren't xml, no anchors for pointing to a fragment. they're their own format. nervous about building dependency on fragment record uris into our system gh: good point. would be happier if it was recast as xml aday: is now pointing to an xml document for ontology nodes ad: happier if we could use "SO:xxx" i.e., a urn gh: would like a re-cast as xml document, hosted at so/sofa website. that xml would be like a std ontology representation so you could extend it. so someone could point to an extension of it. Category elements -- constructing query URLs -------------------------------------------- gh: andreas' point (email): query id attribute, constructing these out of relative uri, or based on base uri. agree with andreas: we know what those will be. for clarity of spec, we should specify: here's base uri, here's how you construct the segments query, etc. ad: trouble for segments- could be on ref server gh: doubt that people will impl this way. will be specific to server and will be related to everyone else's notion of chromosomes and assemblies. ad: where does the distributed nature of das come from? ref server gh: das/1: ref server has residues to serve, regions (entry pts) served up by everyone. this was the notion of ref vs non-ref server to carry forward. non-ref server still serves up segments. will have segments in it's reference space. reference would be genome assembly version + organism. sufficient to globally identify it. ap: had discussions about this. query id td: issue comes from seqs being urls rather than opaque ids in a ns defined by coord system. have a set of servers that share common coord syst. then a seq identified by stringx on one server is same as on the other server. the remaining q: server that doesn't want to serve up seqs, what urls does it use? can it use an opaque seq name that is known by that name of ref server? gh: restating concerns here: using query string to construct uri's 1. confusion: arbitrary uri means more confusing spec, and how to implement it (can't just say /segment, but 'whatever is pointed at by such and such uri') 2. size of documents. right now, can use same xml:base for features document, can make feat ids and location id relative to it, nice and short. if seg is on other server, need to expand one of the ids compresses well, but that will take longer than transmission. this is only for features xml. can use coords or assembly info to determine identity between urls. want a defined ns. ad: you want a way to say: these are relative urls to a base url for that data type. so that this type url is relative to some base url for types, similar for segments, features. gh: we have this now, can be relative or absolute ad: there is a default xml base like thing: one for type, segment, features. so you could have relative ids to those bases. gh: possibly, but not ideal. It's better to use a std xml base for all of them. each server has it's own unique uris for segments. I'm proposing that we decouple segments from residues and having segments doesn't mean we can serve residues. reasoning: - this leads to smaller xml docs - simplifies the spec if we didn't have to construct query ids from category element would rather specify the string that's appended in the spec. sc: might could deal with this issue by adding structure to the document in order to add different xml:bases for different data types. e.g., use different parent elements that could define their own xml:bases, one for types, segments, and feautures. might complicate the spec tho. ad: single genome have same types across all dbs. gh: across servers, dangerous. ad/td: globally unique ids, could have everything in the same directory. td: can we just use seq/name, type/name. i.e., codifying what the convention now is. ad: name is put at end of base url a feature document may give types, segments, other features. td: just use simple strings, not urls. gh: std uri syntax isn't important, but a std query mechanism to get all of these is. some uri you put a '/types' on or a '/segments'. ad: you have this right now. gh: but it's only defined for a server, not the whole spec. there's no where in the spec that says this. confusing for people reading/implementing the spec. ap: If you make it free text, you don't know what to put for a given server? ad: you get a document ap: I already know the server, not necessarily a document. ad: taking out the mention of any hierarchy, just refer to things as feat query url. [note taker is having trouble following the thread of this discussion.] gh: let's sleep on it, discuss tomorrow, vote then.