Skip to content

Conversation

@sergv
Copy link

@sergv sergv commented Dec 28, 2025

As discussed in #17.

The #type, #peek, #poke, #ptr, #offset, #size, and #alignment directives are implemented more or less completely as far as I can tell. The #enum is omitted since it's used very rarely - I failed to find any uses. It could be added later relatively easily judging from its grammar description.

The only thing that could is implemented partially and requires some discussion is the #const (and its identical #counst_str cousin). The problem is that the #const directive takes any valid C expression.

I supported only C expressions that refer to a name (e.g. (#const TCP_CONNECTIONTIMEOUT) ), this already handles most of the uses of #const but there are cases in the wild that cannot be parsed currently, e.g.:

(#const sizeof(struct linger))
(#const offsetof(struct {char x__; struct timeval (y__); }, y__))
#const sizeof(struct addrinfo)
#const sizeof(((struct sockaddr_un *)NULL)->sun_path)
#{const __builtin_offsetof (KEY_EVENT_RECORD, wRepeatCount)}
(#{const GENERIC_WRITE | GENERIC_READ})
(#const CONTEXT_FULL|CONTEXT_DEBUG_REGISTERS|CONTEXT_FLOATING_POINT)
(#const SDL_WINDOWPOS_CENTERED_DISPLAY(15))

It would be good to support these, but it's not clear what's the best way to do that. One way would be to re-implement C grammar, but it's pretty large and relatively complex for feature o f this magnitude so likely that's not a good solution.

Maybe there's a way to re-use existing C grammar - add it as a submodule and re-use some of its rules?

Another possibility would be to support arbitrary combinations of identifiers combined with binary operators, function calls like f(...), and nested brackets? We don't really care about the syntax tree of the C expression, just the fact that the hsc grammar accepts anything that looks like #{const ...} provided nesting of parens, brackets and braces within ... is OK. Do you think this is feasible?

Copy link
Owner

@tek tek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I don't have any requirements for the grammar, so aside from the pragma stuff in the scanner, all my comments are just recommendations.

Maybe there's a way to re-use existing C grammar - add it as a submodule and re-use some of its rules?

The usual approach for this is to use injection queries, but for this case it probably won't work – injections don't allow you to specify which rule to use as the root, so they would only succeed if the directive was followed by a top level declaration.

It's definitely possible to import the C grammar as a node package though! We're already doing that with the Haskell grammar, but I have no clue how difficult that would be to integrate into non-npm package ecosystems, like nvim-treesitter or nixpkgs. You're welcome to experiment with that, would be interesting!

One more high-level suggestion: Since HSC appears to reserve the # operator exclusively, I would recommend changing the scanner and grammar so that it doesn't return LHash in HSC mode. I'd expect that to reduce friction quite a bit.

hsc/grammar.js Outdated
precedences: ($, previous) => previous,
// Hook up hsc directives into Haskell grammar depending
// on their return type.
_stringly: ($, previous) => choice(
Copy link
Owner

@tek tek Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general guideline for TS grammars is that we don't wan't to validate or typecheck the language, but rather recognize it leniently.

Here it seems like you're providing specialized nodes based on what type they have.

I understand that the arguments of the directive differ depending on the keyword, so that may be a justification for this approach.
If you care about that, I have no objection, but if you only want to avoid parse errors in HSC files, it may be easier to just parse (in the scanner) the basic structure mentioned in the docs 🙂:

They extend up to the nearest unmatched ), ] or }, or to the end of line if it occurs outside any () [] {} '' "" /**/ and is not preceded by a backslash.

Copy link
Author

@sergv sergv Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to at least know the name of a directive, but that's not strictly necessary for me. I'll try to check how it could be done in the scanner.

Going via scanner looks like a pretty good idea because it will allow supporting #{const} nodes than contain arbitrary C expressions without having to parse them.

It would be very useful if you can point me to the place in the scanner where it should go.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to clarify that I initially thought you could avoid hooking into expression and type and so on, but I realize that that's probably necessary – the alternative I imagined was to use extras (which can occur anywhere), because my assumption is that HSC allows these constructs to begin at any point in the file. But that may be a misconception, and importantly, it would break parsing since extras don't replace the node that would be expected in their place, like the rhs expression of a binding.

Though I think my suggestion of eagerly matching in the scanner still holds for the purpose of avoiding the C syntax nodes, if that's desirable.

Also I would assume that _stringly and _number are unnecessary, since they only occur as children of the others?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went after _stringly and _number mostly to avoid enumerating all the toplevel contexts where an hsc rule may occur, e.g. expression, pattern, type, maybe something else.

Once _number is extended it goes into all constructs that may contain numbers automatically. That was the primary reason of introducing complication of classifying directives by their return type.

hsc/grammar.js Outdated
rules: {

supertypes: ($, previous) => previous,
// Collect all possible hsc directives under this node but don’t use it since
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so that different directives will alias under the same hsc node for easier patternmatching by users.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, this is neither possible nor necessary! Check out this branch I just pushed: https://github.com/tek/tree-sitter-haskell/tree/tek/hsc

I added two query tests to demonstrate. You can run them with:

test/query/run.bash hsc-good
test/query/run.bash hsc-bad

(Make sure to regenerate beforehand!)

I renamed the hsc node to hsc_stuff and queried for both in those tests.
You can see in hsc/src/node_types.json line 3178 that there's still a node type named hsc generated from the aliases, and the query for (hsc_stuff) fails.


Aside: Have you noticed that the parser (hsc.so) is 5.9MB large, compared to 3.7MB of the base parser? I wonder if that's because you're injecting the new nodes into those large choices, which are also supertypes...probably difficult to debug.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no it's probably because of the unicode category regexes in hsc_c_identifier – these are insanely expensive.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting, these regexes were just copy-paste from the C grammar definition. I think unicode support is something that can be left for the next PR.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI you can probably just use something like choice($.variable, $.name), that should be close enough to C to cover most cases!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually simplifying regex to token(/[a-zA-Z_$][0-9a-zA-Z_$]*/) does not reduce so size for me.

@sergv
Copy link
Author

sergv commented Dec 30, 2025

It's definitely possible to import the C grammar as a node package though! We're already doing that with the Haskell grammar

As far as I understand with Haskell we're sort of inheriting from it. At least we get all Haskell rules and even override some of them with new stuff so that they have new effect in the unmodified parts of the Haskell grammar.

With C I need only parts of its expression syntax, not the whole grammar. However, C's expression parsing is spread over a lots of rules. I can probably import C's grammar and refer to its rules, but when I'll import a rule I'll also need to add its dependencies to the grammar. For example, C's expression will have something like expression: $ => choice(variable, unary, binary, cast, ...) so when I'll refer to C.expression then the grammar also needs to have definitions for variable, unary, binary, cast, etc.

It seems I will have to trace rule dependencies manually and enumerate all rules C expressions depend on by hand. Since there's only one namespace for grammar rules there's possibility of name clashes so imported C rules need to be renamed. Taken together, these spell a bad story for grammar maintenance.

Maybe there's no low-maintenance way to do what I'm after. Or perhaps fully parsing C expressions in those rare directives is an overkill.

One more high-level suggestion: Since HSC appears to reserve the # operator exclusively, I would recommend changing the scanner and grammar so that it doesn't return LHash in HSC mode. I'd expect that to reduce friction quite a bit.

Never returning LHash would break parsing of the # operator in regular Haskell I'd expected. Ideally regular Haskell should parse as before.

I managed to avoid conflicts in the current version because only following new tokens were introduced: #{, #peek, #poke, and so on for each supported command. Scanner seems to able to distinguish them from regular #.

@tek
Copy link
Owner

tek commented Dec 30, 2025

It seems I will have to trace rule dependencies manually and enumerate all rules C expressions depend on by hand. Since there's only one namespace for grammar rules there's possibility of name clashes so imported C rules need to be renamed. Taken together, these spell a bad story for grammar maintenance.

My naive assumption is that something like this could work:

const haskell = require("../grammar.js")
const c = require("????")

module.exports = grammar(haskell, {
  name: 'hsc',

  rules: {
	...c.rules,

	expression: ($, previous) => choice(
	  // blah
	),
  }
})

But of course this might horribly break because there are name clashes that would need to be handled manually.
Note that (just like hsc_stuff I addressed earlier) TS omits any orphan rules from the parser, so you wouldn't get any useless C nodes from merging everything.

I could also imagine that you could just reference c.rules.type in expression if you hack the $ thing used by grammar – we could investigate how it works.

Luckily, the C grammar does not have a scanner, which I assume would be a showstopper 😅

Never returning LHash would break parsing of the # operator in regular Haskell I'd expected. Ideally regular Haskell should parse as before.

The HSC docs say that the # operator needs to be escaped as ##, so it wouldn't break in HSC; and of course you'd only skip LHash when HSC_EXT is defined!

I managed to avoid conflicts in the current version because only following new tokens were introduced: #{, #peek, #poke, and so on for each supported command. Scanner seems to able to distinguish them from regular #.

The reason for #peek etc. working is likely the precedence mechanic for TS – you're using a static string node (like '#const') while the conflicting label node uses token(seq('#', /[A-Z].../)), and longer string nodes have priority over shorter ones and regexes (here I assume it's competing with '#' only, but token nodes may have some more subtle differences).
LHash causes the scanner to fail instead of emitting VARSYM, because hashes are very special (unboxed, labels), so they are purely handled by _operator_hash_head, which also matches on '#'.


Regarding the scanner approach for consuming entire HSC fragments:

My impulse would be to introduce a new cond symbol (see grammar/externals.js for an explanation) and emit that in HSC mode when encountering LHash. Then you can write:

{
  hsc_no_brace: $ => seq($._hsc_hash /* this is the cond */, '#', $.hsc_kind, $.hsc_c)
}

hsc_c would then either be parsed entirely in the scanner or in the grammar, though I'm unsure which would be easier or more efficient.

For the scanner variant, hsc_c be an external cmd symbol that would be checked at the start of pre_ws_commands and then consume everything up to the next newline or closing bracket, which requires some state management like constraint_lookahead.

For the grammar variant, you could let TS figure out brackets, but you'd have to check whether extras (like comments) need to be blocked in the scanner to not break it. Escaped newlines maybe also be quite annoying, but you could definitely use an auxiliary scanner rule to detect some stuff that's too hard in the grammar. This will definitely result in pretty messy (and potentially expensive) parse trees though.

Maybe the C grammar approach is easier, not sure. Feel free to ask more questions!

@tek
Copy link
Owner

tek commented Dec 30, 2025

Oh btw we can also just merge this preliminary implementation and defer the more comprehensive one until later, if this is too much of an effort right now!

@sergv
Copy link
Author

sergv commented Jan 1, 2026

Yes, on a second thought lets defer doing more general thing in lexer. I'm a bit split on whether I want to have structure or not. Some structure would definitely be cool, but supporting all forms of #{const ...} directive would also be cool.

I think for now let's go with current version and get back to it once in the wild more complex #{const ...} constructs start breaking current version and then I'll get back and revisit.

@tek
Copy link
Owner

tek commented Jan 1, 2026

Looks like you didn't remove enough parens there 🙂

Also remember to remove the useless hsc rule!

Finally, you'll probably want to duplicate the two upload steps in .github/workflows/assets.yml.

@sergv
Copy link
Author

sergv commented Jan 4, 2026

Upon further consideration I decided to actually implement scanner that skips C expressions. It all went according to your directions and is now much more general so I'm quite happy with the end result.

However there may be some review suggestions so I'm ready to fix if you have any. It took a while to fix all the tests involving hashes by duplicating all the string literals in all places. Most of the challenge was identifying the relevant places.

The C scanner also got its share of #ifdefs so if you have any ideas on what to improve there I'd be glad to learn about them.

@sergv
Copy link
Author

sergv commented Jan 4, 2026

NB this version still has some issues, namely following does not parse:

expType t = case t of
  Read  -> #{const FOO}
  Write	-> 2

The end brace ends case's layout and second alternative is not parsed correctly. Haven't figured this one out but will before the final submission.

Also, if you would prefer to create a new PR for new version then I can do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants