Conversation
Catch and report all kinds of WTF-8 encoding errors in the source strings, including invalid leading bytes, invalid trailing bytes, unexpected ends of strings, and invalid surrogate sequences. Insert replacement characters into the output as necessary. Add a TODO about minimizing size by escaping only those code points mandated to be escaped by the JSON spec. Generally improve readability of the code.
|
Current dependencies on/for this PR: This stack of pull requests is managed by Graphite. |
| os << "\\b"; | ||
| continue; | ||
| case '\f': | ||
| os << "\\f"; |
There was a problem hiding this comment.
Are \b and \f necessary here? (What spec would I read that in?)
There was a problem hiding this comment.
(if you add them to the test without this PR, does the test fail?)
There was a problem hiding this comment.
See page 4 of the JSON spec: https://ecma-international.org/wp-content/uploads/ECMA-404_2nd_edition_december_2017.pdf. We could also use \uXXXX escapes, which is what the previous code would have done, but these are shorter.
There was a problem hiding this comment.
Got it, thanks.
Maybe worth adding those to the test?
| // Print.cpp would consider the contents unprintable, messing up our test. | ||
| bool isNaivelyPrintable = 32 <= u && u < 127; | ||
| if (isNaivelyPrintable) { | ||
| assert(u < 0x80 && "need additional logic to emit valid UTF-8"); |
There was a problem hiding this comment.
We only get here if u < 127, so the only difference with u < 0x80 = 128 is when u == 128? If so perhaps check that directly?
There was a problem hiding this comment.
This is meant to guard against future improvement where we would start emitting most code points directly as UTF-8 instead of escaping them. After the initial refactoring to allow most code points to hit this code path by default, this assertion will trigger if we don't add extra logic for code points 0x80 and greater. The assertion would be less robust in the context of that future change if it checked against 128 specifically.
|
I added tests for all the error cases. Will land, but PTAL if you're interested. |
Catch and report all kinds of WTF-8 encoding errors in the source strings, including invalid leading bytes, invalid trailing bytes, unexpected ends of strings, and invalid surrogate sequences. Insert replacement characters into the output as necessary. Add a TODO about minimizing size by escaping only those code points mandated to be escaped by the JSON spec. Generally improve readability of the code.

Catch and report all kinds of WTF-8 encoding errors in the source strings,
including invalid leading bytes, invalid trailing bytes, unexpected ends of
strings, and invalid surrogate sequences. Insert replacement characters into the
output as necessary. Add a TODO about minimizing size by escaping only those
code points mandated to be escaped by the JSON spec. Generally improve
readability of the code.