Skip to content

Rewrite of CSV regex parsing to avoid exponential behavior#2513

Merged
Interkarma merged 5 commits intoInterkarma:masterfrom
petchema:csv-parsing-no-exponential-regex
Apr 27, 2023
Merged

Rewrite of CSV regex parsing to avoid exponential behavior#2513
Interkarma merged 5 commits intoInterkarma:masterfrom
petchema:csv-parsing-no-exponential-regex

Conversation

@petchema
Copy link
Copy Markdown
Collaborator

@petchema petchema commented Apr 27, 2023

Problem was discovered after DFU froze (busy-looping) while parsing a CSV with a faulty

1554,"%it, Le Protecteur

(missing closing quotes) line in Text/Internal_MagicItems.csv

Nested star operators - like here, (a*b*)* - can take exponential time under the right conditions.
Minimal fix is to replace this pattern with (a+|b+)*, assuming a and b share no common prefix.

I did not stop there and removed the intermediate matches array by directly parsing key and value columns together in the regex.

Another potential bug fix, .Trim("\"") may remove more than the quotes surrounding quoted strings, ex.

"""Approach"" said the King" -> Approach" said the King

so I replaced it with a more explicit removal of surrounding quotes.

Problem was discovered after DFU started busy-looping while parsing a
CSV with a faulty

    1554,"%it, Le Protecteur

(missing closing quotes) line in Text/Internal_MagicItems.csv

Nested star operators - like here, (a*b*)* - can take exponential time
under the right conditions.
Minimal fix is to replace this pattern with (a+|b+)*, assuming
a and b share no common prefix.

I did not stop there and removed the intermediate matches array by
directly parsing key and value columns together in the regex.

Another potential bug with .Trim("\""), it may remove more than the
quotes surrounding quoted strings, ex.

"""Approach"" said the King" -> Approach" said the King
@petchema petchema added the bug label Apr 27, 2023
@Interkarma
Copy link
Copy Markdown
Owner

Thanks for suggested fix Pango, I appreciate it. :)

I'll just need to regression test change against master CSV files. These are exported direct from string tables using Unity Editor (here), so that's always the baseline for compatibility and expected format of CSV input.

Pierre Etchemaite added 3 commits April 27, 2023 09:37
Also, return the lines as a List instead of an Array.

Rationale: Arrays must allocate exactly the number of elements, so
IEnumerable.ToArray() overallocates as it aggregates elements, then as a
last step does an extra copy to return the exact number of elements.

Since the result is short lived, extra allocated elements is not a
concern.

Source:
https://stackoverflow.com/questions/1105990/is-it-better-to-call-tolist-or-toarray-in-linq-queries

Another solution would have been to return an IEnumerable, but it could
generate parsing exceptions from unexpected places.
I feel previous style is more consistent with the rest of DFU coding
style. Both approaches have a O(n) complexity.
@Interkarma
Copy link
Copy Markdown
Owner

This is excellent! It handles all master CSVs perfectly, plus some edge cases I look out for. It stands up to bad input in value field much better than the previous. Thank you Pango!

@Interkarma Interkarma merged commit bf8094d into Interkarma:master Apr 27, 2023
@petchema
Copy link
Copy Markdown
Collaborator Author

@KABoissonneault @jefetienne
Thank you for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants