Show the variants that have problem in error message.#420
Show the variants that have problem in error message.#420EstelleDa merged 31 commits intorelease-2025.2.0from
Conversation
Refactors dataframe validation logic into 3 component files: column.py, dataframe.py, and variant.py. This simplifies the validation structure and logically separates validation function based on the part of the df they operate on.
Refactors most of the test suite to better identify dependency separation problems. Validation tests may now be run with only core (and dev) dependencies installed, and fixtures which operate on server dependencies are conditionally loaded based on the installed modules. With this change, it will be much more straightforward to identify dependency 'leaks', or server dependencies which mistakenly are leaked into validation type code.
…ssing from targets list
This allows the use of the vs-code pytest extension but still prevents the use of external connections. Enabling this socket makes it easier to test within the code editor.
The hgvs package is not able to parse allelic variation (multi-variants denoted by brackets), which are often a key variant string in base editor data. We work around this by: - Parsing the multi-variant into MaveHGVS without any target info to ascertain whether it is syntactically valid - Parsing each subvariant against the provided transcript to ascertain whether it is informationally valid
Adds tests for multi-variant validation for accession based variants. As part of this change, an additional transcript was added to tests genomic based protein variants in addition to just testing nucleotide based variants.
Prior to this, we weren't really using SeqRepo to do transcript resolution (unintentionally). Note that to use SeqRepo in this manner, a new environment variable `HGVS_SEQREPO_DIR` should be set.
| invalid_accessions = {v for v in variants if str(v).split(":")[0] not in targets} | ||
| if invalid_accessions: | ||
| raise ValidationError( | ||
| f"variant column '{column.name}' has invalid accession identifiers; " | ||
| "some accession identifiers present in the score file were not added as targets." | ||
| "Validation errors found:\n" + "\n".join(invalid_accessions)) |
There was a problem hiding this comment.
Thanks for breaking these out, this will be really useful for users I think. I had two suggestions:
- It's true that the prior code created sets, but I would use a list or tuple here now since our goals have changed a little. We'd like to be able to tell the user about every error that we find, and to do that we should preserve variants even when they aren't unique. It would be confusing for the user to receive an error like 'validation errors found: line 1, line 2', only for them to fix those errors and have identical errors in lines 3 and 4.
- I would make use of the
triggerspart of theValidationErrorclass, here is an example:. This class will be used by the UI to display themavedb-api/src/mavedb/lib/validation/dataframe.py
Lines 447 to 451 in cbacd63
triggersas a list of errors automatically. We can combine this with theenumeratefunction to provide granular feedback to the user:
invalid_accessions = {f"accession identifier {str(v).split(":")[0]} from row {idx}, variant {v} not found" for idx, v in enumerate(variants) if str(v).split(":")[0] not in targets}
if invalid_accessions:
raise ValidationError(
f"variant column '{column.name}' has invalid accession identifiers; "
"{len(invalid_accessions)} accession identifiers present in the score file were not added as targets.",
triggers=invalid_accessions
)
Now, we provide a useful top level error message within the validation error itself in addition to helpful individual errors that will help the user fix the overall error. I only commented on this one, but the same principal should apply for the other validation checks as well.
| "Validation errors found:\n" + "\n".join(invalid_accessions)) | ||
|
|
||
| else: | ||
| if len(set(v[:2] for v in variants)) > 1: |
There was a problem hiding this comment.
We should be able to do the same thing as above for these checks as well.
767b0a5 to
6226c93
Compare
|
Thanks Estelle I looked at the changes you made and they looked good. I'm not sure what happened to the branch though. Would you be able to rebase off of |
|
Thanks Ben! No idea why these weird conflicts happened. I have merged release-2025.2.0 to it. |
bencap
left a comment
There was a problem hiding this comment.
Thanks Estelle, looks good besides the one suggestion I noted which was probably a rebasing issue.
Co-authored-by: Benjamin Capodanno <31941502+bencap@users.noreply.github.com>
|
Thanks Ben! I think so. |
Hi @bencap , I choose to merge to your base-editor-data. Feel free to change it to any other branch that you think is better. Mine is based on the base-editor-data branch.