Skip to content

Undesired/undocumented behavior: a shorter match without errors is prioritized over a longer match within the error tolerance #49

@bentyeh

Description

@bentyeh

Initially, I thought that this problem only appeared when using partial5 or partial3 specifications - see the previous issue #47. But this issue seems to affect splitcode even with simple distance error tolerance specifications not at the 5' or 3' ends of sequences.

Expected behavior:

The matching sequence is prioritized by length (with longest sequence getting the highest priority). splitcode docs FAQ

  • This is not only the documented behavior, but the desired behavior in many applications.

Current behavior (splitcode version 0.31.4): a shorter match without errors is prioritized over a longer match within the error tolerance.

Example

config.tsv

ids	tags	distance
tagT	TTT	0
tagT	TTTT	0
tagT	TTTTT	0
tagT	TTTTTT	0
tagT	TTTTTTT	0
tagT	TTTTTTTT	0
tagT	TTTTTTTTT	0
tagT	TTTTTTTTTT	1
tagT	TTTTTTTTTTT	1
tagT	TTTTTTTTTTTT	1
tagT	TTTTTTTTTTTTT	1
tagT	TTTTTTTTTTTTTT	1
tagT	TTTTTTTTTTTTTTT	1

input.fastq

@11_bp_no_mismatch
AAATTTTTTTTTTTAAA
+
;;;;;;;;;;;;;;;;;
@11_bp_1_internal_mismatch
AAATTTTTTTTTCTAAA
+
;;;;;;;;;;;;;;;;;
@11_bp_1_internal_mismatch2
AAATTTTTTTTCTTAAA
+
;;;;;;;;;;;;;;;;;
@11_bp_1_5prime_mismatch
AAACTTTTTTTTTTAAA
+
;;;;;;;;;;;;;;;;;
@11_bp_1_3prime_mismatch
AAATTTTTTTTTTCAAA
+
;;;;;;;;;;;;;;;;;

command

splitcode -c config.tsv --loc-names --out-fasta --nFastqs 1 --pipe input.fastq

manually annotated output

  • carets ^ mark the positions of splitcode's match
  • asterisks '*' mark the positions of the desired/expected match
  • the positions of the desired/expected match is also included in the read name
>11_bp_no_mismatch LX:Z:tagT:0,2-14 Expected:same
AAATTTTTTTTTTTAAA
  ^^^^^^^^^^^^
  ************

>11_bp_1_internal_mismatch LX:Z:tagT:0,2-12 Expected:same
AAATTTTTTTTTCTAAA
  ^^^^^^^^^^
  **********

>11_bp_1_internal_mismatch2 LX:Z:tagT:0,3-11 Expected:0,3-14
AAATTTTTTTTCTTAAA
   ^^^^^^^^
   ***********

>11_bp_1_5prime_mismatch LX:Z:tagT:0,3-14 Expected:same
AAACTTTTTTTTTTAAA
   ^^^^^^^^^^^
   ***********

>11_bp_1_3prime_mismatch LX:Z:tagT:0,2-13 Expected:same
AAATTTTTTTTTTCAAA
  ^^^^^^^^^^^
  ***********

This same example can be explored interactively in this Google Colab notebook: https://colab.research.google.com/drive/1uRaVpy57iKtk70xcb_8syfD0BrCmLfpX

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions