-
Notifications
You must be signed in to change notification settings - Fork 1
Add grammar for hsc2hs dialect #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
tek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! I don't have any requirements for the grammar, so aside from the pragma stuff in the scanner, all my comments are just recommendations.
Maybe there's a way to re-use existing C grammar - add it as a submodule and re-use some of its rules?
The usual approach for this is to use injection queries, but for this case it probably won't work – injections don't allow you to specify which rule to use as the root, so they would only succeed if the directive was followed by a top level declaration.
It's definitely possible to import the C grammar as a node package though! We're already doing that with the Haskell grammar, but I have no clue how difficult that would be to integrate into non-npm package ecosystems, like nvim-treesitter or nixpkgs. You're welcome to experiment with that, would be interesting!
One more high-level suggestion: Since HSC appears to reserve the # operator exclusively, I would recommend changing the scanner and grammar so that it doesn't return LHash in HSC mode. I'd expect that to reduce friction quite a bit.
hsc/grammar.js
Outdated
| precedences: ($, previous) => previous, | ||
| // Hook up hsc directives into Haskell grammar depending | ||
| // on their return type. | ||
| _stringly: ($, previous) => choice( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general guideline for TS grammars is that we don't wan't to validate or typecheck the language, but rather recognize it leniently.
Here it seems like you're providing specialized nodes based on what type they have.
I understand that the arguments of the directive differ depending on the keyword, so that may be a justification for this approach.
If you care about that, I have no objection, but if you only want to avoid parse errors in HSC files, it may be easier to just parse (in the scanner) the basic structure mentioned in the docs 🙂:
They extend up to the nearest unmatched ), ] or }, or to the end of line if it occurs outside any () [] {} '' "" /**/ and is not preceded by a backslash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to at least know the name of a directive, but that's not strictly necessary for me. I'll try to check how it could be done in the scanner.
Going via scanner looks like a pretty good idea because it will allow supporting #{const} nodes than contain arbitrary C expressions without having to parse them.
It would be very useful if you can point me to the place in the scanner where it should go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to clarify that I initially thought you could avoid hooking into expression and type and so on, but I realize that that's probably necessary – the alternative I imagined was to use extras (which can occur anywhere), because my assumption is that HSC allows these constructs to begin at any point in the file. But that may be a misconception, and importantly, it would break parsing since extras don't replace the node that would be expected in their place, like the rhs expression of a binding.
Though I think my suggestion of eagerly matching in the scanner still holds for the purpose of avoiding the C syntax nodes, if that's desirable.
Also I would assume that _stringly and _number are unnecessary, since they only occur as children of the others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went after _stringly and _number mostly to avoid enumerating all the toplevel contexts where an hsc rule may occur, e.g. expression, pattern, type, maybe something else.
Once _number is extended it goes into all constructs that may contain numbers automatically. That was the primary reason of introducing complication of classifying directives by their return type.
hsc/grammar.js
Outdated
| rules: { | ||
|
|
||
| supertypes: ($, previous) => previous, | ||
| // Collect all possible hsc directives under this node but don’t use it since |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so that different directives will alias under the same hsc node for easier patternmatching by users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, this is neither possible nor necessary! Check out this branch I just pushed: https://github.com/tek/tree-sitter-haskell/tree/tek/hsc
I added two query tests to demonstrate. You can run them with:
test/query/run.bash hsc-good
test/query/run.bash hsc-bad
(Make sure to regenerate beforehand!)
I renamed the hsc node to hsc_stuff and queried for both in those tests.
You can see in hsc/src/node_types.json line 3178 that there's still a node type named hsc generated from the aliases, and the query for (hsc_stuff) fails.
Aside: Have you noticed that the parser (hsc.so) is 5.9MB large, compared to 3.7MB of the base parser? I wonder if that's because you're injecting the new nodes into those large choices, which are also supertypes...probably difficult to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah no it's probably because of the unicode category regexes in hsc_c_identifier – these are insanely expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting, these regexes were just copy-paste from the C grammar definition. I think unicode support is something that can be left for the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI you can probably just use something like choice($.variable, $.name), that should be close enough to C to cover most cases!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually simplifying regex to token(/[a-zA-Z_$][0-9a-zA-Z_$]*/) does not reduce so size for me.
As far as I understand with Haskell we're sort of inheriting from it. At least we get all Haskell rules and even override some of them with new stuff so that they have new effect in the unmodified parts of the Haskell grammar. With C I need only parts of its expression syntax, not the whole grammar. However, C's expression parsing is spread over a lots of rules. I can probably import C's grammar and refer to its rules, but when I'll import a rule I'll also need to add its dependencies to the grammar. For example, C's expression will have something like It seems I will have to trace rule dependencies manually and enumerate all rules C expressions depend on by hand. Since there's only one namespace for grammar rules there's possibility of name clashes so imported C rules need to be renamed. Taken together, these spell a bad story for grammar maintenance. Maybe there's no low-maintenance way to do what I'm after. Or perhaps fully parsing C expressions in those rare directives is an overkill.
Never returning LHash would break parsing of the I managed to avoid conflicts in the current version because only following new tokens were introduced: |
My naive assumption is that something like this could work: const haskell = require("../grammar.js")
const c = require("????")
module.exports = grammar(haskell, {
name: 'hsc',
rules: {
...c.rules,
expression: ($, previous) => choice(
// blah
),
}
})But of course this might horribly break because there are name clashes that would need to be handled manually. I could also imagine that you could just reference Luckily, the C grammar does not have a scanner, which I assume would be a showstopper 😅
The HSC docs say that the
The reason for Regarding the scanner approach for consuming entire HSC fragments: My impulse would be to introduce a new {
hsc_no_brace: $ => seq($._hsc_hash /* this is the cond */, '#', $.hsc_kind, $.hsc_c)
}
For the scanner variant, For the grammar variant, you could let TS figure out brackets, but you'd have to check whether extras (like comments) need to be blocked in the scanner to not break it. Escaped newlines maybe also be quite annoying, but you could definitely use an auxiliary scanner rule to detect some stuff that's too hard in the grammar. This will definitely result in pretty messy (and potentially expensive) parse trees though. Maybe the C grammar approach is easier, not sure. Feel free to ask more questions! |
|
Oh btw we can also just merge this preliminary implementation and defer the more comprehensive one until later, if this is too much of an effort right now! |
|
Yes, on a second thought lets defer doing more general thing in lexer. I'm a bit split on whether I want to have structure or not. Some structure would definitely be cool, but supporting all forms of I think for now let's go with current version and get back to it once in the wild more complex |
|
Looks like you didn't remove enough parens there 🙂 Also remember to remove the useless Finally, you'll probably want to duplicate the two upload steps in |
|
Upon further consideration I decided to actually implement scanner that skips C expressions. It all went according to your directions and is now much more general so I'm quite happy with the end result. However there may be some review suggestions so I'm ready to fix if you have any. It took a while to fix all the tests involving hashes by duplicating all the string literals in all places. Most of the challenge was identifying the relevant places. The C scanner also got its share of |
|
NB this version still has some issues, namely following does not parse: The end brace ends case's layout and second alternative is not parsed correctly. Haven't figured this one out but will before the final submission. Also, if you would prefer to create a new PR for new version then I can do that. |
As discussed in #17.
The
#type,#peek,#poke,#ptr,#offset,#size, and#alignmentdirectives are implemented more or less completely as far as I can tell. The#enumis omitted since it's used very rarely - I failed to find any uses. It could be added later relatively easily judging from its grammar description.The only thing that could is implemented partially and requires some discussion is the
#const(and its identical#counst_strcousin). The problem is that the#constdirective takes any valid C expression.I supported only C expressions that refer to a name (e.g.
(#const TCP_CONNECTIONTIMEOUT)), this already handles most of the uses of#constbut there are cases in the wild that cannot be parsed currently, e.g.:It would be good to support these, but it's not clear what's the best way to do that. One way would be to re-implement C grammar, but it's pretty large and relatively complex for feature o f this magnitude so likely that's not a good solution.
Maybe there's a way to re-use existing C grammar - add it as a submodule and re-use some of its rules?
Another possibility would be to support arbitrary combinations of identifiers combined with binary operators, function calls like
f(...), and nested brackets? We don't really care about the syntax tree of the C expression, just the fact that the hsc grammar accepts anything that looks like#{const ...}provided nesting of parens, brackets and braces within...is OK. Do you think this is feasible?