-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Fix OpenQASM 3 exporter to properly escape invalid identifiers #15498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The exporter was producing invalid OpenQASM 3 identifiers in two cases: 1. Register names starting with ASCII digits (e.g., "3qr") 2. Names containing Unicode number characters (e.g., "j²") Replaced the regex-based validation with proper Unicode-aware functions that use unicodedata to correctly identify valid identifier characters per the OpenQASM 3 spec: - First character: Unicode letter (category L*) or underscore - Subsequent: Unicode letters, underscores, or ASCII digits 0-9 This also simplifies the escaping logic by only replacing invalid characters rather than always prepending an underscore. Fixes Qiskit#15304, fixes Qiskit#15303
|
Thank you for opening a new pull request. Before your PR can be merged it will first need to pass continuous integration tests and be reviewed. Sometimes the review process can be slow, so please be patient. While you're waiting, please feel free to review other open PRs. While only a subset of people are authorized to approve pull requests for merging, everyone is encouraged to review open pull requests. Doing reviews helps reduce the burden on the core team and helps make the project's code better for everyone. One or more of the following people are relevant to this code:
|
| def _is_valid_identifier(name: str) -> bool: | ||
| """Check if a name is a valid OpenQASM 3 identifier. | ||
| Per the OpenQASM 3 spec, identifiers must: | ||
| - Start with a Unicode letter (category L*) or underscore | ||
| - Contain only Unicode letters, underscores, or ASCII digits (0-9) | ||
| This excludes Unicode digit/number characters (categories Nd, Nl, No) from | ||
| the first position, and excludes non-ASCII digit characters (Nl, No) from | ||
| all positions. | ||
| """ | ||
| if not name: | ||
| return False | ||
| first = name[0] | ||
| # First char must be letter (L*) or underscore, not any kind of number | ||
| first_cat = unicodedata.category(first) | ||
| if not (first_cat.startswith("L") or first == "_"): | ||
| return False | ||
| # Rest can be letters, underscore, or ASCII digits 0-9 | ||
| for char in name[1:]: | ||
| if char in "0123456789" or char == "_": | ||
| continue | ||
| cat = unicodedata.category(char) | ||
| if not cat.startswith("L"): | ||
| return False | ||
| return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current code uses startswith('L'). Could you confirm this is the intended behavior, or should we be more explicit (e.g. cat in ('Lu', 'Ll', 'Lt', 'Lm', 'Lo'))
|
Hiya - thanks for the PR, but #15305 was already open and just needed a final merge for a while now (it got a little lost apparently), which also addresses #15303 and #15304 too. This proposed PR allows more identifiers without escaping, but at the cost of manual Python-space iteration through every character of identifies and checking the unicode database, which I'm worried about the cost to performance of. I'd potentially like to keep things simpler/faster on the export than trying to be "perfect" - if we try and involve too much Unicode, we end up in nasty situations where we have to make decisions about identifier normalisation etc, whereas restricting our export to a much simpler character set avoids all that and lets us be a bit faster. |
Summary
Fixes #15304 and #15303
The OpenQASM 3 exporter was producing invalid identifiers in two cases:
3qr→ should be escaped)j²→ should be escaped)The root cause was the regex
[\w]which matches digits, so names like3qrwere incorrectly considered valid.Solution
Replaced the regex-based validation with proper Unicode-aware functions using
unicodedata.category()to correctly identify valid identifier characters per the OpenQASM 3 spec:L*) or underscoreThis also simplifies the escaping by only replacing invalid characters rather than always prepending an underscore.
Examples
3qr3qr_qrj²j²j_t[0]_t_0_t_0_Test plan