Show the variants that have problem in error message. by EstelleDa · Pull Request #420 · VariantEffect/mavedb-api

EstelleDa · 2025-04-03T07:48:44Z

Hi @bencap , I choose to merge to your base-editor-data. Feel free to change it to any other branch that you think is better. Mine is based on the base-editor-data branch.

Refactors dataframe validation logic into 3 component files: column.py, dataframe.py, and variant.py. This simplifies the validation structure and logically separates validation function based on the part of the df they operate on.

Refactors most of the test suite to better identify dependency separation problems. Validation tests may now be run with only core (and dev) dependencies installed, and fixtures which operate on server dependencies are conditionally loaded based on the installed modules. With this change, it will be much more straightforward to identify dependency 'leaks', or server dependencies which mistakenly are leaked into validation type code.

…ssing from targets list

…list

This allows the use of the vs-code pytest extension but still prevents the use of external connections. Enabling this socket makes it easier to test within the code editor.

The hgvs package is not able to parse allelic variation (multi-variants denoted by brackets), which are often a key variant string in base editor data. We work around this by: - Parsing the multi-variant into MaveHGVS without any target info to ascertain whether it is syntactically valid - Parsing each subvariant against the provided transcript to ascertain whether it is informationally valid

Adds tests for multi-variant validation for accession based variants. As part of this change, an additional transcript was added to tests genomic based protein variants in addition to just testing nucleotide based variants.

Prior to this, we weren't really using SeqRepo to do transcript resolution (unintentionally). Note that to use SeqRepo in this manner, a new environment variable `HGVS_SEQREPO_DIR` should be set.

bencap · 2025-04-04T13:29:13Z

src/mavedb/lib/validation/dataframe/column.py

+        invalid_accessions = {v for v in variants if str(v).split(":")[0] not in targets}
+        if invalid_accessions:
+            raise ValidationError(
+                f"variant column '{column.name}' has invalid accession identifiers; "
+                "some accession identifiers present in the score file were not added as targets."
+                "Validation errors found:\n" + "\n".join(invalid_accessions))


Thanks for breaking these out, this will be really useful for users I think. I had two suggestions:

It's true that the prior code created sets, but I would use a list or tuple here now since our goals have changed a little. We'd like to be able to tell the user about every error that we find, and to do that we should preserve variants even when they aren't unique. It would be confusing for the user to receive an error like 'validation errors found: line 1, line 2', only for them to fix those errors and have identical errors in lines 3 and 4.

I would make use of the triggers part of the ValidationError class, here is an example:

mavedb-api/src/mavedb/lib/validation/dataframe.py

Lines 447 to 451 in cbacd63

# format and raise an error message that contains all invalid variants

if len(invalid_variants) > 0:

raise ValidationError(

f"encountered {len(invalid_variants)} invalid variant strings.", triggers=invalid_variants

)

. This class will be used by the UI to display the triggers as a list of errors automatically. We can combine this with the enumerate function to provide granular feedback to the user:

invalid_accessions = {f"accession identifier {str(v).split(":")[0]} from row {idx}, variant {v} not found" for idx, v in enumerate(variants) if str(v).split(":")[0] not in targets} if invalid_accessions: raise ValidationError( f"variant column '{column.name}' has invalid accession identifiers; " "{len(invalid_accessions)} accession identifiers present in the score file were not added as targets.", triggers=invalid_accessions )

Now, we provide a useful top level error message within the validation error itself in addition to helpful individual errors that will help the user fix the overall error. I only commented on this one, but the same principal should apply for the other validation checks as well.

bencap · 2025-04-04T13:29:45Z

src/mavedb/lib/validation/dataframe/column.py

+                "Validation errors found:\n" + "\n".join(invalid_accessions))

    else:
        if len(set(v[:2] for v in variants)) > 1:


We should be able to do the same thing as above for these checks as well.

bencap · 2025-05-14T22:42:14Z

Thanks Estelle I looked at the changes you made and they looked good. I'm not sure what happened to the branch though. Would you be able to rebase off of release-2025.2.0 when you get a chance? After that I can take a final look and we can finish it up.

…ationErrors

EstelleDa · 2025-05-15T07:37:43Z

Thanks Ben! No idea why these weird conflicts happened. I have merged release-2025.2.0 to it.

bencap

Thanks Estelle, looks good besides the one suggestion I noted which was probably a rebasing issue.

tests/routers/test_score_set.py

Co-authored-by: Benjamin Capodanno <31941502+bencap@users.noreply.github.com>

EstelleDa · 2025-05-21T00:36:02Z

Thanks Ben! I think so.

bencap and others added 28 commits March 29, 2025 10:57

Fix Docker Casing Warnings

a3d4940

Refactor Dataframe Validation Logic

a7c39af

Refactors dataframe validation logic into 3 component files: column.py, dataframe.py, and variant.py. This simplifies the validation structure and logically separates validation function based on the part of the df they operate on.

Bump Dependencies

b257447

Check for Nonetype Target Sequences to Silence MyPy Error

a2c28e1

Replace DataSet Columns Setter in Worker Variant Mocker

4a65b2d

Add Base Editor Column to Target Accessions Table

7c93e19

Validation logic and test cases for base editor data

833f6d9

Add isBaseEditor Flag to Remaining Accession Tests

0c74f1d

Add GUIDE_SEQUENCE_COLUMN constant to mave lib

e56d299

Use existing boolean flag for transgenic marker in prefix validation

5bc49de

Clarify error message for accession based variants with accessions mi…

6fc41fe

…ssing from targets list

Move guide sequence column to the end of the standard columns sorted …

9f7a88e

…list

Add additional column validation tests

ddd6517

Fix sort order of dataframe test case columns

1ffb891

Use equality comparison over is operator for column name comparison

9cb16e4

Allow the Unix Domain Socket during test runs

d1f4f9e

This allows the use of the vs-code pytest extension but still prevents the use of external connections. Enabling this socket makes it easier to test within the code editor.

Multi-Variant Genomic Validation Tests

53e9065

Adds tests for multi-variant validation for accession based variants. As part of this change, an additional transcript was added to tests genomic based protein variants in addition to just testing nucleotide based variants.

Logical names for git action checks

4afa8a7

Bump SeqRepo Version, Add Volume to Dev Containers

291fc7e

Add SeqRepo based seqfetcher to data provider

c56b374

Prior to this, we weren't really using SeqRepo to do transcript resolution (unintentionally). Note that to use SeqRepo in this manner, a new environment variable `HGVS_SEQREPO_DIR` should be set.

Add SeqFetcher MyPy type stub

8165685

Refactor fixes

db7f250

Use MaveHGVS to determine if variant is a multi-variant

47650a8

Fix tests for MaveHGVS parsing

cf8a95b

Rebase fixes (could fixup)

82813af

Show the variants that have problem in error message.

21fc0dc

EstelleDa requested a review from bencap April 3, 2025 07:48

EstelleDa linked an issue Apr 3, 2025 that may be closed by this pull request

Surface Validation Errors as Issues with Specific Lines #350

Closed

bencap reviewed Apr 4, 2025

View reviewed changes

Modify codes to triggers

633dcf5

jstone-dev force-pushed the feature/bencap/317/base-editor-data branch from 767b0a5 to 6226c93 Compare May 6, 2025 16:52

Base automatically changed from feature/bencap/317/base-editor-data to release-2025.2.0 May 8, 2025 23:09

Merge branch 'release-2025.2.0' into improve/estelle/350/surfaceValid…

bc06144

…ationErrors

bencap approved these changes May 20, 2025

View reviewed changes

tests/routers/test_score_set.py Outdated Show resolved Hide resolved

Update tests/routers/test_score_set.py

489c59c

Co-authored-by: Benjamin Capodanno <31941502+bencap@users.noreply.github.com>

EstelleDa merged commit a16e7b3 into release-2025.2.0 May 21, 2025
14 of 16 checks passed

EstelleDa deleted the improve/estelle/350/surfaceValidationErrors branch May 21, 2025 01:40

bencap mentioned this pull request Jun 11, 2025

Release 2025.2.0 #455

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show the variants that have problem in error message.#420

Show the variants that have problem in error message.#420
EstelleDa merged 31 commits intorelease-2025.2.0from
improve/estelle/350/surfaceValidationErrors

EstelleDa commented Apr 3, 2025

Uh oh!

bencap Apr 4, 2025

Uh oh!

bencap Apr 4, 2025

Uh oh!

bencap commented May 14, 2025

Uh oh!

EstelleDa commented May 15, 2025

Uh oh!

bencap left a comment

Uh oh!

Uh oh!

EstelleDa commented May 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	# format and raise an error message that contains all invalid variants
	if len(invalid_variants) > 0:
	raise ValidationError(
	f"encountered {len(invalid_variants)} invalid variant strings.", triggers=invalid_variants
	)

Conversation

EstelleDa commented Apr 3, 2025

Uh oh!

bencap Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

bencap Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

bencap commented May 14, 2025

Uh oh!

EstelleDa commented May 15, 2025

Uh oh!

bencap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

EstelleDa commented May 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants