-
Notifications
You must be signed in to change notification settings - Fork 4
Description
For a partial5 match at the 5' end of a sequence, a shorter match without errors is currently prioritized over a longer match within the error tolerance.
config.tsv
ids tags distance partial5
tagT TTTTTTTTTT 0 3:0.2
tagA AAAAAAAAAA 1 3:0.2
tagG GGGGGGGGGG 1 -
input.fastq
@read1 5prime tagT
TTTTTTTTTT
+
;;;;;;;;;;
@read2 5prime tagT with 3prime substitution
TTTTTTTTTC
+
;;;;;;;;;;
@read3 5prime tagT with interior substitution
TTTCTTTTTT
+
;;;;;;;;;;
@read4 5prime tagT with interior substitution within first 3 bp
TTCTTTTTTT
+
;;;;;;;;;;
@read5 5prime tagT with interior substitution within first 3 bp
TCTTTTTTTT
+
;;;;;;;;;;
@read6 5prime tagT with 5prime substitution
CTTTTTTTTT
+
;;;;;;;;;;
@read7 5prime tagT with 5prime and 3prime substitutions
CTTTTTTTTC
+
;;;;;;;;;;
@read8 5prime tagT with 5prime and interior substitutions
CTTTTTTCTT
+
;;;;;;;;;;
@read9 5prime tagA with interior substitution; like read3
AAACAAAAAA
+
;;;;;;;;;;
@read10 5prime tagG with interior substitution; like read3
GGGCGGGGGG
+
;;;;;;;;;;
splitcode command:
splitcode -c config.tsv --loc-names --out-fasta --nFastqs 1 --pipe input.fastq
output:
>read1 LX:Z:tagT:0,0-10
TTTTTTTTTT
>read2 LX:Z:tagT:0,0-9
TTTTTTTTTC
>read3 LX:Z:tagT:0,0-3
TTTCTTTTTT
>read4 LX:Z:tagT:0,0-10
TTCTTTTTTT
>read5 LX:Z:tagT:0,0-10
TCTTTTTTTT
>read6 LX:Z:tagT:0,0-10
CTTTTTTTTT
>read7 LX:Z:tagT:0,0-10
CTTTTTTTTC
>read8 LX:Z:tagT:0,0-10
CTTTTTTCTT
>read9 LX:Z:tagA:0,0-3
AAACAAAAAA
>read10 LX:Z:tagG:0,0-10
GGGCGGGGGG
Based on the FAQ that
The matching sequence is prioritized by length (with longest sequence getting the highest priority).
I would have expected the longest match (i.e., LX:Z:tag[ATG]:0,0-10) to be prioritized in all cases. Thus, splitcode behavior on read2, read3, and read9 currently does not match expectation.
Furthermore, I would have naively expected that adding a partial5 specification for a tag would lead to a superset of possible matches compared to not using a partial5 specification, but splitcode's behavior with read9 shows that providing a partial5 specification precludes matching a longer match, even when the longer match could have been found without the partial5 specification (as demonstrated by read10).
Tested with splitcode version 0.31.4.
Executable Google Colab notebook: https://colab.research.google.com/drive/1hzVxHB_mWfawPmqbjAGcSiddmD8gtk1U