Skip to content

Conversation

@ddri
Copy link

@ddri ddri commented Jan 4, 2026

Summary

Fixes #15304 and #15303

The OpenQASM 3 exporter was producing invalid identifiers in two cases:

  1. Register names starting with ASCII digits (e.g., 3qr → should be escaped)
  2. Names containing Unicode number characters (e.g., → should be escaped)

The root cause was the regex [\w] which matches digits, so names like 3qr were incorrectly considered valid.

Solution

Replaced the regex-based validation with proper Unicode-aware functions using unicodedata.category() to correctly identify valid identifier characters per the OpenQASM 3 spec:

  • First character: Unicode letter (category L*) or underscore
  • Subsequent: Unicode letters, underscores, or ASCII digits 0-9

This also simplifies the escaping by only replacing invalid characters rather than always prepending an underscore.

Examples

Input Before (invalid) After (valid)
3qr 3qr _qr
j_
t[0] _t_0_ t_0_

Test plan

  • Added 4 new tests for identifier escaping
  • All 119 existing QASM3 export tests pass
  • Manual verification of escaping behavior

The exporter was producing invalid OpenQASM 3 identifiers in two cases:
1. Register names starting with ASCII digits (e.g., "3qr")
2. Names containing Unicode number characters (e.g., "j²")

Replaced the regex-based validation with proper Unicode-aware functions
that use unicodedata to correctly identify valid identifier characters
per the OpenQASM 3 spec:
- First character: Unicode letter (category L*) or underscore
- Subsequent: Unicode letters, underscores, or ASCII digits 0-9

This also simplifies the escaping logic by only replacing invalid
characters rather than always prepending an underscore.

Fixes Qiskit#15304, fixes Qiskit#15303
@ddri ddri requested a review from a team as a code owner January 4, 2026 09:47
@qiskit-bot qiskit-bot added the Community PR PRs from contributors that are not 'members' of the Qiskit repo label Jan 4, 2026
@qiskit-bot
Copy link
Collaborator

Thank you for opening a new pull request.

Before your PR can be merged it will first need to pass continuous integration tests and be reviewed. Sometimes the review process can be slow, so please be patient.

While you're waiting, please feel free to review other open PRs. While only a subset of people are authorized to approve pull requests for merging, everyone is encouraged to review open pull requests. Doing reviews helps reduce the burden on the core team and helps make the project's code better for everyone.

One or more of the following people are relevant to this code:

  • @Qiskit/terra-core

Comment on lines +181 to +206
def _is_valid_identifier(name: str) -> bool:
"""Check if a name is a valid OpenQASM 3 identifier.
Per the OpenQASM 3 spec, identifiers must:
- Start with a Unicode letter (category L*) or underscore
- Contain only Unicode letters, underscores, or ASCII digits (0-9)
This excludes Unicode digit/number characters (categories Nd, Nl, No) from
the first position, and excludes non-ASCII digit characters (Nl, No) from
all positions.
"""
if not name:
return False
first = name[0]
# First char must be letter (L*) or underscore, not any kind of number
first_cat = unicodedata.category(first)
if not (first_cat.startswith("L") or first == "_"):
return False
# Rest can be letters, underscore, or ASCII digits 0-9
for char in name[1:]:
if char in "0123456789" or char == "_":
continue
cat = unicodedata.category(char)
if not cat.startswith("L"):
return False
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current code uses startswith('L'). Could you confirm this is the intended behavior, or should we be more explicit (e.g. cat in ('Lu', 'Ll', 'Lt', 'Lm', 'Lo'))

@jakelishman
Copy link
Member

Hiya - thanks for the PR, but #15305 was already open and just needed a final merge for a while now (it got a little lost apparently), which also addresses #15303 and #15304 too. This proposed PR allows more identifiers without escaping, but at the cost of manual Python-space iteration through every character of identifies and checking the unicode database, which I'm worried about the cost to performance of. I'd potentially like to keep things simpler/faster on the export than trying to be "perfect" - if we try and involve too much Unicode, we end up in nasty situations where we have to make decisions about identifier normalisation etc, whereas restricting our export to a much simpler character set avoids all that and lets us be a bit faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community PR PRs from contributors that are not 'members' of the Qiskit repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OpenQASM 3 exporter does not escape digits at the start of identifiers

4 participants