Rewrite of CSV regex parsing to avoid exponential behavior#2513
Merged
Interkarma merged 5 commits intoInterkarma:masterfrom Apr 27, 2023
Merged
Rewrite of CSV regex parsing to avoid exponential behavior#2513Interkarma merged 5 commits intoInterkarma:masterfrom
Interkarma merged 5 commits intoInterkarma:masterfrom
Conversation
Problem was discovered after DFU started busy-looping while parsing a
CSV with a faulty
1554,"%it, Le Protecteur
(missing closing quotes) line in Text/Internal_MagicItems.csv
Nested star operators - like here, (a*b*)* - can take exponential time
under the right conditions.
Minimal fix is to replace this pattern with (a+|b+)*, assuming
a and b share no common prefix.
I did not stop there and removed the intermediate matches array by
directly parsing key and value columns together in the regex.
Another potential bug with .Trim("\""), it may remove more than the
quotes surrounding quoted strings, ex.
"""Approach"" said the King" -> Approach" said the King
Owner
|
Thanks for suggested fix Pango, I appreciate it. :) I'll just need to regression test change against master CSV files. These are exported direct from string tables using Unity Editor (here), so that's always the baseline for compatibility and expected format of CSV input. |
jefetienne
reviewed
Apr 27, 2023
jefetienne
reviewed
Apr 27, 2023
added 3 commits
April 27, 2023 09:37
Also, return the lines as a List instead of an Array. Rationale: Arrays must allocate exactly the number of elements, so IEnumerable.ToArray() overallocates as it aggregates elements, then as a last step does an extra copy to return the exact number of elements. Since the result is short lived, extra allocated elements is not a concern. Source: https://stackoverflow.com/questions/1105990/is-it-better-to-call-tolist-or-toarray-in-linq-queries Another solution would have been to return an IEnumerable, but it could generate parsing exceptions from unexpected places.
I feel previous style is more consistent with the rest of DFU coding style. Both approaches have a O(n) complexity.
Owner
|
This is excellent! It handles all master CSVs perfectly, plus some edge cases I look out for. It stands up to bad input in value field much better than the previous. Thank you Pango! |
Collaborator
Author
|
@KABoissonneault @jefetienne |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem was discovered after DFU froze (busy-looping) while parsing a CSV with a faulty
(missing closing quotes) line in Text/Internal_MagicItems.csv
Nested star operators - like here,
(a*b*)*- can take exponential time under the right conditions.Minimal fix is to replace this pattern with
(a+|b+)*, assuming a and b share no common prefix.I did not stop there and removed the intermediate
matchesarray by directly parsing key and value columns together in the regex.Another potential bug fix,
.Trim("\"")may remove more than the quotes surrounding quoted strings, ex.so I replaced it with a more explicit removal of surrounding quotes.