Skip to content

Show the variants that have problem in error message.#420

Merged
EstelleDa merged 31 commits intorelease-2025.2.0from
improve/estelle/350/surfaceValidationErrors
May 21, 2025
Merged

Show the variants that have problem in error message.#420
EstelleDa merged 31 commits intorelease-2025.2.0from
improve/estelle/350/surfaceValidationErrors

Conversation

@EstelleDa
Copy link
Copy Markdown
Member

Hi @bencap , I choose to merge to your base-editor-data. Feel free to change it to any other branch that you think is better. Mine is based on the base-editor-data branch.

bencap and others added 28 commits March 29, 2025 10:57
Refactors dataframe validation logic into 3 component files: column.py, dataframe.py, and variant.py.
This simplifies the validation structure and logically separates validation function based on the part of the df they operate on.
Refactors most of the test suite to better identify dependency separation problems. Validation tests
may now be run with only core (and dev) dependencies installed, and fixtures which operate on server dependencies are conditionally loaded based on the installed modules.
With this change, it will be much more straightforward to identify dependency 'leaks', or server dependencies which mistakenly are leaked into validation type code.
This allows the use of the vs-code pytest extension but still prevents the use of external connections. Enabling this socket makes it easier to test within the code editor.
The hgvs package is not able to parse allelic variation (multi-variants denoted by brackets), which
are often a key variant string in base editor data. We work around this by:
- Parsing the multi-variant into MaveHGVS without any target info to ascertain whether it is syntactically valid
- Parsing each subvariant against the provided transcript to ascertain whether it is informationally valid
Adds tests for multi-variant validation for accession based variants. As part of this change,
an additional transcript was added to tests genomic based protein variants in addition to
just testing nucleotide based variants.
Prior to this, we weren't really using SeqRepo to do transcript resolution (unintentionally). Note that to use SeqRepo in this manner, a new environment variable `HGVS_SEQREPO_DIR` should be set.
@EstelleDa EstelleDa requested a review from bencap April 3, 2025 07:48
@EstelleDa EstelleDa linked an issue Apr 3, 2025 that may be closed by this pull request
Comment on lines +93 to +98
invalid_accessions = {v for v in variants if str(v).split(":")[0] not in targets}
if invalid_accessions:
raise ValidationError(
f"variant column '{column.name}' has invalid accession identifiers; "
"some accession identifiers present in the score file were not added as targets."
"Validation errors found:\n" + "\n".join(invalid_accessions))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for breaking these out, this will be really useful for users I think. I had two suggestions:

  • It's true that the prior code created sets, but I would use a list or tuple here now since our goals have changed a little. We'd like to be able to tell the user about every error that we find, and to do that we should preserve variants even when they aren't unique. It would be confusing for the user to receive an error like 'validation errors found: line 1, line 2', only for them to fix those errors and have identical errors in lines 3 and 4.
  • I would make use of the triggers part of the ValidationError class, here is an example:
    # format and raise an error message that contains all invalid variants
    if len(invalid_variants) > 0:
    raise ValidationError(
    f"encountered {len(invalid_variants)} invalid variant strings.", triggers=invalid_variants
    )
    . This class will be used by the UI to display the triggers as a list of errors automatically. We can combine this with the enumerate function to provide granular feedback to the user:
invalid_accessions = {f"accession identifier {str(v).split(":")[0]} from row {idx}, variant {v} not found" for idx, v in enumerate(variants) if str(v).split(":")[0] not in targets}
if invalid_accessions:
    raise ValidationError(
        f"variant column '{column.name}' has invalid accession identifiers; "
        "{len(invalid_accessions)} accession identifiers present in the score file were not added as targets.",
        triggers=invalid_accessions
    )

Now, we provide a useful top level error message within the validation error itself in addition to helpful individual errors that will help the user fix the overall error. I only commented on this one, but the same principal should apply for the other validation checks as well.

"Validation errors found:\n" + "\n".join(invalid_accessions))

else:
if len(set(v[:2] for v in variants)) > 1:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to do the same thing as above for these checks as well.

@jstone-dev jstone-dev force-pushed the feature/bencap/317/base-editor-data branch from 767b0a5 to 6226c93 Compare May 6, 2025 16:52
Base automatically changed from feature/bencap/317/base-editor-data to release-2025.2.0 May 8, 2025 23:09
@bencap
Copy link
Copy Markdown
Collaborator

bencap commented May 14, 2025

Thanks Estelle I looked at the changes you made and they looked good. I'm not sure what happened to the branch though. Would you be able to rebase off of release-2025.2.0 when you get a chance? After that I can take a final look and we can finish it up.

@EstelleDa
Copy link
Copy Markdown
Member Author

Thanks Ben! No idea why these weird conflicts happened. I have merged release-2025.2.0 to it.

Copy link
Copy Markdown
Collaborator

@bencap bencap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Estelle, looks good besides the one suggestion I noted which was probably a rebasing issue.

Co-authored-by: Benjamin Capodanno <31941502+bencap@users.noreply.github.com>
@EstelleDa
Copy link
Copy Markdown
Member Author

Thanks Ben! I think so.

@EstelleDa EstelleDa merged commit a16e7b3 into release-2025.2.0 May 21, 2025
14 of 16 checks passed
@EstelleDa EstelleDa deleted the improve/estelle/350/surfaceValidationErrors branch May 21, 2025 01:40
@bencap bencap mentioned this pull request Jun 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Surface Validation Errors as Issues with Specific Lines

2 participants