Handle blank BLAT results, and fix BLAT results for some targets#51
Handle blank BLAT results, and fix BLAT results for some targets#51sallybg merged 4 commits intomavedb-devfrom
Conversation
If a target has only protein-level variants, but the provided target sequence is a nucleotide sequence, translate the nucleotide sequence to an amino acid sequence immediately after metadata ingestion. This change avoids alignment errors that can occur when a target sequence has been codon-optimized to a non-human organism. Since we do not have sufficient metadata to assume that a target sequence has been codon-optimized, always perform translation when there are no nucleotide-level variants for a target.
bencap
left a comment
There was a problem hiding this comment.
Thanks Sally, code looks good. Left one minor comment that doesn't matter too much.
Question on how this fits into the broader mapping routine: I think we had discussed that this target sequence correction would only happen if the alignment to DNA failed. Is that still true in these changes, or am I misremembering the conversation?
src/dcd_mapping/mavedb_data.py
Outdated
| return _load_scoreset_records(scores_csv, metadata) | ||
|
|
||
|
|
||
| def correct_target_sequence_type( |
There was a problem hiding this comment.
Pretty minor comment, but maybe we can name this patch_target_sequence_type? I think correct has an implication this change is persistent.
Maybe we should keep it though because it's persistent in the context of the mapper. If you think that argument is the relevant scope, then it seems reasonable to keep it as correct.
There was a problem hiding this comment.
I agree that patch is the better term here!
|
@bencap You are right that we originally discussed only patching the target sequence type if BLAT failed! After that discussion, Alan suggested that we could patch it in every case where there are no nucleotide level variants, because we would eventually translate the target sequence anyway and return the protein sequence in our metadata, and pre-mapped objects would be against the translated target sequence. So this doesn't impact the output directly, although it does change what sequence is used for BLAT. I don't think it's too risky to use the protein sequence for BLAT, and it allows us to save a little time by not attempting a BLAT that will fail. It also is easier with the code structure to adjust it right away instead of within the aligner, because 1. we would need to pass variant records to the align function just for this purpose, and 2. we align all targets at once, so we would need to pull out any failing targets, create a new file for them, and then merge them with the previous BLAT results for ones that worked. |
|
Gotcha, yeah that all makes sense. Yeah let's merge this into the dev branch and then we can also merge the other changes I'm working on into there, bump the version number, and deploy these next week. |
This update changes how alignment is performed for some score sets, so bump major version.
Previously, blank BLAT results were not caught immediately after alignment, resulting in an opaque key error.
Catch blank BLAT results for any target and return the error to the client.
Some blank BLAT results were previously due to poor alignment of codon-optimized nucleotide target sequences. This occurs when a target sequence was codon-optimized for a non-human organism and then provided as a nucleotide sequence. If no nucleotide-level variants are reported for such a target, translate the target to an amino acid sequence immediately after score set metadata ingestion, and use this sequence as the target sequence for the mapping job.