Notes from DAS/2 code sprint #2, day three, 15 Mar 2006 $Id: das2-teleconf-2006-03-15.txt,v 1.1 2006/03/16 20:45:35 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt Sanger: Thomas Down, Andreas Prlic CSHL: Lincoln Stein Dalke Scientific: Andrew Dalke (at Affy) UCLA: Allen Day, Brian O'Connor (at Affy) Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. [Notetaker: joining 10 min into the discussion] ls: how does synonym business work? ad: if server has access to data... ls: we ask server for the global id, uses same global id for segments, and uses same global id for the sequence. gh: to do this in the capabilities for annot server, the global id for segments query points to reference server. ls: if the local machine current server, has sequence capabilities, then it passes global id for segments to current server and it gets the sequence. if it doesn't have that capability, then we need to figure out a way for it to get the sequence. the easiest way to do that would be to resolve that url and fetch it. I'm open to any suggestion. I don't see how this uri/synonym is getting us any closer to being able to find the server where sequence can be fetched. The synonym isn't always a fetchable thing. ad: syn is a global id ad: look at the uri for the segment and fetch it from there ls: could be a remote url. gh: segments query is only thing that gives segment url segments capabilities for the annot server should point ls: break apart segments into: id=a string, then have an attribute seq_url, when fetched returns the seq. returns the bases. ad: is that's what's there already? ls: no, uri is an id ad: every url is an id, but it's up to whim of the server ls: i don't want people to think its for an id. want an agreed upon uri identifier, then optionally have a url. turn synonym into uri, turn uri into resolver make uri required, bases not required. ad: additional constraint is 'agreed upon'. what about a group starts a new sequencing project. There is no globally known uri for it yet. ls: they just create their own ids td: the natural authority is the creator of the assembly. gh: ncbi won't do it. they don't have a das server, unlikely to. ls: can point to genome assembly. can create a url that will return bases from ncbi in a supported format. this approach will disentangle issue of resolvable vs non-resolvable, local vs non-local segment ids and how to get segment dna. gh: I think this will work. ad: 'this' changing key names? ls: key semantics uri is required, global identifier sequence is an optional pointer gh: you say that for feat xml, the id for seq will be the globally agreed on id. ls: yes ad: if you don't have a local copy, if you have ability to map global identifiers, then you know what it is from the coordinates. there are two ways to specificy coordinates: coordinates and segments ad: if you just need the segments and some identifier. only when you need to do an overlay with someone else that you need the coords. gh: no, coords don't say anything about ids of coord (?) gh: if we do it the way lincoln proposed, then the logical way to relate those is that the segments capapbilities points to ref server. ad: when feat returns a location is it in global or local space? gh: lincoln - global space ls: every annot server will know length of its landmarks (chrms). some people will not want to be served dna, they will point somewhere else where to get the dna. There will be many places to get dna for a given global id, they chose one they like. ls: feature locations are given in global id ad: this changes the way it's been working. xml:base issues ls: I know. gh: if base of sequence and base of features are different, the xml will get bigger. ls: so an argument for having local ids is so you can make location string shorter. gh: yes. ls: probably not worth it ad: also makes it easier to set up a basic server. if you want to overlay them, yes you do. ls: you can always set up a local server if you gh: segments response local and global id as we talked about yesterday (which one feature locatn is relative to) gh: if the only way to overlay for a client to know things are in the same coord system is segid=xxxx and globalid=yyyy, how much harder is it for server to use global ids. ls: server can have configuration file to know where its global ids are coming from aday: would need to think about it more. ad: who will set up these identifiers (yeast, human) ls: I'll do it for model org databases, I will specify segments, and their dna fetchers and will look up their lengths. gh: versions? ls: most recent. community can then keep it up to date. I bet ensembl will be happy to generate this file automatically with every build (for vertebrates) ad: local id uri, and a bunch of synonyms. People will set up own server not referencing a global system. ls: then client would do a closure over all systems. imagine three servers: server-a says here is my segment server-b says it can be b or c server-c says it can be c or a so you have to do a join over all servers gh: not encourage people to do that with local seq ids, encourage people to use. need a global referencing system to say this uri is same as that uri. ad: bad logic for the web. If one is wrong, could be a problem td: (proposal - based on genomic coord alignments) ad: that says only alignable things are the same. ad: don't think it will work, they will already have local servers gh: what about 'the stick': people who want to register their server with central registry can only do so if they use global ids for their segments. ls, td: fine ad: if they've been working for a while in house, they would have a big effort to retrofit their system to comply. just won't do. ls: in draft 3, where's assembly info? ad: same as before. ask segments for agp format. draft not complete. gh: the thing that ids which assembly you're on is the coordinates element (authority, taxonomy, ...) ls: authority is a recognized, globally unique organization. Should it be a uri? ad: authority and version is human visible so people can search by it. ls: fine. gh: can invoke the 'stick' idea here: if you 're trying to register something on same genomome assembly, then registry can check your segments to verify they are agreed up. ls: taxon, source, authority, version all must match ad: also an id ap: we discussed in email ad: the only stuff that is complete is in the ucla subdir. ls: the examples are definitive ad: yes, unless we change things today. ls: what if taxon, source, version match but uri doesn't? registry gets submission. makes a segments request on submitter, if it gets a list of same segment identifiers, it accepts it. what if it gets a subset? gh: ok ls: superset is not ok. aday: why? gh: if you allow subset and superset, you can have everything. aday: use case: bacteria with extra plasmid identifier. nh: signing off. will be at affy tomorrow. ls: you would have to create your own coord system. gh: could argue with maintainer to added it. ls: can you have multiple coordinates in a given assembly? aday: proposal: make coords an attribute of the segment. could keep your segment references local. ls: we shouldn't give people ways to create new names. human chr1 ncbi build 35 should be something that everybody can agree on. gh: then we wouldn't allow allen's use case where someone wants a superset of what's in reference? ls: add new coord tag to source version entry, says I'm creating a superset consisting of coords from ref 1, 2, 3, any of these can be a new namespace that I set up. gh: how do you know which ones come from where? right now there's now way to get coord for a segment. ad: can as of yesterday afternoon. ls: to indicate which segments come from which auth. put coord id into segments tag. aday: thank you! ad: alternative proposal - multiple segments use case: when you have scaffolds or chromosomes, or mouse and yeast ls: say you want human mouse scaffolds + chrms, and human chrms three diff coords tags in the sources document each one gives auth, taxon, etc. when client goes to get segments, it will get human chromosomes, mouse chrms, and mouse scaffolds, in one big list, each will point back to coord it got in features requets. gh: knowing what coordinates doesn't tell you global id for segment aday: ok. gh: multiple segments elements vs mult coords in a segment work for me. ad: what does a client do gh: ... ls: three types of entry points, hu chrms, mo chrms, mo scaffolds, now tell me what you want to start browsing. human readable. scaffold on mouse with name xxx from two ad: displaying all together vs one or the other or the other. ee: affymetrix use case in igb. [probe gh: doesn't seem to matter aday: the tag values are easier to implement td: not a big difference to me gh: drawing on whiteboard... ls: let's rename das to distributed annotation research network. then we can say "darn1, darn2"! ad: gregg's request for search to find everything identical (start and end are same) td: if you have contained and inside, you can do identical with an and operation. ls: doesn't make server any more complicated, for completeness you may want to do that. ad: how about includes 1-5000 and excludes ... some of this is asethetic. ls: overlaps, contains, contained-in have good use cases for. exact match - maybe searching for curated exons that exactly match predicted. [Lincoln has to leave.] gh: drawing options for segments and coordinate systems. [whether you put a coords tag per segment, or ome capabilities one for each coord system] allen's approach - one query with filter or multiple fetches aday: uniprot example gh: separate segments query. ap: can we leave it out and add later if necessary? ad: these are things that haven't been discussed in last two years aday: uri ad: xml namespace issue - what do we call it (see email) gh: you pick it ad: required syntax for entry points /das/source gh: recommended, but not required ad: lincoln was only one who felt strongly about it being required, and he's not here. gh: feature xml, every feature can have multiple locations feaures can represent alignments (collapsed alignment tag into feature tag) td: like it gh: naive user- given a feat with multip location on genome, represent as multip locations, or parent child relations td: don't see as a problem. using parent-child you have things to say about child features specific to them gh: genscan prediction, a problem: one server can serve them up as parent child or as multiple locations on parent four child exons in one case four diff locations in other case problem is with feat filters. if yo do an overlaps query and any children meet the condition, you have to return the parent as well and it's parent on up. agreed? ad: yes gh: works fine for parent child, but for multip location situation, if inside query fully contains only two eons, do you return parent? td: I'd assume inside query would return both. as long as one exon is inside the region, the parent is return. define inside as applying to any level. gh: so even though the transcript is not inside, you still return it? td: using the get parent-if-get-children rule gh: rule must apply to all of them, so you don't get transcript since it doesn't meet the inside condition. aday: multiple locations makes sense - just aligned mult times. human alu feature 100,000s, do you want to create a single feature, or just a single identifier and put it in many different locations. ee: that is for alignments not parent-child relationship aday: you consider location as a attribute of the object.. ee: I agree. alu is only one object, but the exon-transcript are different ad: would someone want to annotate the separate exons differently? aday: you would split it off ad: eg blast alignment, hsp is part of the conceptual alignment. gh: in bioperl, some people will go one path, some go the other path, so we need to figure out how to deal with it. feat filters is clear for parent child relationship. aday: inside and overlaps gh: if your overlap query only grazes one child, you return the parent. this is the only one I'm certain about. gh: we haven't specified that the child is within bounds of parent. with insides, we have a difference of opinion. one exon is within, do you return it? ad: most clients will be doing overlaps, you are the only one doing insides what do you want? gh: the multiple locations muddies the issue. if parent child rule is you only return it if parent is inside (and recursive parent), I've already optimized for that. For multiple locations, I can catch that and handle it. the way I want, the behaviour of mult location will be diff than parent child. td: for me, the overlaps is the most important thing. Andreas just get everything. ad: can we delegate to gregg here for what to do in case of inside. [A] gregg will write up description for inside query and multiple locations Status reports ----------------- gh: updating server. overlaps, insides, types, and each good news: latest genome assembly on human on affy server overlayed with allen's server. using hardcoded knowledge in igb for assembly id, not coordinates yet. with andrew: making sure clients can understand any variants of namespace usage in the xml. get client to use more capabilities like links ad: example data set together, updated schema to latest spec, but forgot cigar thing. update validator to use most recent version or rnc schemas. gh: even if your server isn't public you can cut and paste into you validator at http://cgi.biodas.org:8080 aday: biopackages up to date with version 200 of spec file. issues for nomi, and gregg. off by one error. bo: small code refactor in the das server. testing that today. ee: nothing das related yet, but will. implementing style sheets to get colors for features. ap: registry ui for upload of a das/2 source. coding for that gh: what about registry rejecting segment ids if they don't match standard ids for that coord system. sound good to you? ap: basically yes. td: not done a great deal gh: Nomi has been here working on apollo client. we'll hear from her tomorrow. ----------------------- post teleconf discussion re: using global identifiers for uri [Notetaker: just a few morsels were captured here.] ad: most folks i work with get something going locally, then after it's going, hook it up with the rest of the world, integrate with other people. they don't want to revamp their work in order to do that. gh: slightly in favor with andrew ad: get what we have now. they are still uri's so it's just an interpretation. will change attributes to be 'uri and 'reference_uri' gh: how does it get length of segments? ad: good idea to have coordinates and segments in the document. add your own track to ensembl, you don't need to give it a segments, just specify coordinates. gh: seems like it will encourage servers that can only work with particular clients. ad: what about getting rid of coordinates, just needed by Andreas for registry.