Skip to content

specifying substitution rate in partial5 can prevent identifying longest match #47

@bentyeh

Description

@bentyeh

For a partial5 match at the 5' end of a sequence, a shorter match without errors is currently prioritized over a longer match within the error tolerance.

config.tsv

ids	tags	distance	partial5
tagT	TTTTTTTTTT	0	3:0.2
tagA	AAAAAAAAAA	1	3:0.2
tagG	GGGGGGGGGG	1	-

input.fastq

@read1 5prime tagT
TTTTTTTTTT
+
;;;;;;;;;;
@read2 5prime tagT with 3prime substitution
TTTTTTTTTC
+
;;;;;;;;;;
@read3 5prime tagT with interior substitution
TTTCTTTTTT
+
;;;;;;;;;;
@read4 5prime tagT with interior substitution within first 3 bp
TTCTTTTTTT
+
;;;;;;;;;;
@read5 5prime tagT with interior substitution within first 3 bp
TCTTTTTTTT
+
;;;;;;;;;;
@read6 5prime tagT with 5prime substitution
CTTTTTTTTT
+
;;;;;;;;;;
@read7 5prime tagT with 5prime and 3prime substitutions
CTTTTTTTTC
+
;;;;;;;;;;
@read8 5prime tagT with 5prime and interior substitutions
CTTTTTTCTT
+
;;;;;;;;;;
@read9 5prime tagA with interior substitution; like read3
AAACAAAAAA
+
;;;;;;;;;;
@read10 5prime tagG with interior substitution; like read3
GGGCGGGGGG
+
;;;;;;;;;;

splitcode command:

splitcode -c config.tsv --loc-names --out-fasta --nFastqs 1 --pipe input.fastq

output:

>read1 LX:Z:tagT:0,0-10
TTTTTTTTTT
>read2 LX:Z:tagT:0,0-9
TTTTTTTTTC
>read3 LX:Z:tagT:0,0-3
TTTCTTTTTT
>read4 LX:Z:tagT:0,0-10
TTCTTTTTTT
>read5 LX:Z:tagT:0,0-10
TCTTTTTTTT
>read6 LX:Z:tagT:0,0-10
CTTTTTTTTT
>read7 LX:Z:tagT:0,0-10
CTTTTTTTTC
>read8 LX:Z:tagT:0,0-10
CTTTTTTCTT
>read9 LX:Z:tagA:0,0-3
AAACAAAAAA
>read10 LX:Z:tagG:0,0-10
GGGCGGGGGG

Based on the FAQ that

The matching sequence is prioritized by length (with longest sequence getting the highest priority).

I would have expected the longest match (i.e., LX:Z:tag[ATG]:0,0-10) to be prioritized in all cases. Thus, splitcode behavior on read2, read3, and read9 currently does not match expectation.

Furthermore, I would have naively expected that adding a partial5 specification for a tag would lead to a superset of possible matches compared to not using a partial5 specification, but splitcode's behavior with read9 shows that providing a partial5 specification precludes matching a longer match, even when the longer match could have been found without the partial5 specification (as demonstrated by read10).

Tested with splitcode version 0.31.4.

Executable Google Colab notebook: https://colab.research.google.com/drive/1hzVxHB_mWfawPmqbjAGcSiddmD8gtk1U

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions