Skip to content

feat: add C, C++, Kotlin, Swift, Scala, Bash language support#708

Merged
carlos-alm merged 18 commits intomainfrom
feat/phase7-languages
Mar 30, 2026
Merged

feat: add C, C++, Kotlin, Swift, Scala, Bash language support#708
carlos-alm merged 18 commits intomainfrom
feat/phase7-languages

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Add 6 high-demand languages to both TypeScript (WASM) and Rust (native) engines: C, C++, Kotlin, Swift, Scala, Bash
  • Each language gets: TS extractor, Rust extractor, registry entry, complexity rules (LangRules + HalsteadRules), CFG rules, AST config, and parser tests
  • Handles critical tree-sitter grammar quirks: Swift uses class_declaration for class/struct/enum, Kotlin uses class_declaration for class/interface, Scala import_declaration has alternating identifier/dot children

Changes per layer

Layer Files
TS extractors src/extractors/{c,cpp,kotlin,swift,scala,bash}.ts
Rust extractors crates/codegraph-core/src/extractors/{c,cpp,kotlin,swift,scala,bash}.rs
Infrastructure types.ts, parser.ts, index.ts, build-wasm.ts, package.json, Cargo.toml, parser_registry.rs, types.rs, mod.rs
Rules complexity.rs (LangRules + HalsteadRules), cfg.rs (CfgRules), helpers.rs (LangAstConfig)
Tests tests/parsers/{c,cpp,kotlin,swift,scala,bash}.test.ts

Test plan

  • 45/45 new parser tests pass (C:7, C++:8, Kotlin:8, Swift:9, Scala:8, Bash:5)
  • 317/317 total parser tests pass (zero regressions)
  • cargo build compiles with new Rust extractors
  • CI passes

…pth duplication

The walk_node/walk_node_depth pattern was duplicated identically across
all 9 language extractors (~190 lines of boilerplate). Each extractor
repeated the same depth check, match dispatch, and child traversal loop
— only the match arms differed.

Add a generic `walk_tree<F>` function to helpers.rs that handles depth
limiting and child recursion, accepting a closure for language-specific
node matching. Refactor all 9 extractors and 7 type map walkers to use
it. Zero-cost abstraction (monomorphized, no dyn dispatch).
Add 6 high-demand languages to both the TypeScript (WASM) and Rust
(native) engines: C, C++, Kotlin, Swift, Scala, Bash.

Each language gets:
- TypeScript extractor (src/extractors/<lang>.ts)
- Rust extractor (crates/codegraph-core/src/extractors/<lang>.rs)
- LANGUAGE_REGISTRY entry + LanguageKind enum variant
- Complexity rules (LangRules + HalsteadRules)
- CFG rules (CfgRules)
- AST config (LangAstConfig)
- Parser tests (tests/parsers/<lang>.test.ts)

Key grammar quirks handled:
- Swift tree-sitter uses class_declaration for class/struct/enum
- Kotlin tree-sitter uses class_declaration for class/interface
- Scala import_declaration has alternating identifier/dot children

317/317 parser tests pass (272 existing + 45 new), zero regressions.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 30, 2026

Greptile Summary

This PR adds first-class support for C, C++, Kotlin, Swift, Scala, and Bash across all layers of the codegraph stack: TypeScript (WASM) extractors, Rust (native) extractors, parser registry entries, complexity rules (LangRules + HalsteadRules), CFG rules, and AST configs. 45 new parser tests are added covering all 6 languages.

All previously-flagged issues (CPP/Swift AST_CONFIG shadowing, Kotlin object_declaration kind mismatch, Swift extends/implements split, Scala/Swift val/var kinds, Kotlin || missing from complexity, dead node_text_raw) have been addressed.

Remaining findings:

  • P1 – Bash complexity undercounts branches: BASH_RULES.branch_nodes and nesting_nodes omit "c_style_for_statement" and "until_statement", even though both are present in BASH_CFG. Any Bash function using for (( … )) or until loops will have those branches silently excluded from cyclomatic/cognitive complexity, while the CFG correctly graphs them — producing a metrics mismatch.
  • P2 – Kotlin CFG jump_expression overlap: return_node, break_node, and continue_node all resolve to "jump_expression" (forced by tree-sitter-kotlin-sg). If the CFG builder matches these sequentially, every jump will match all three; a comment documenting this grammar limitation would help future maintainers.
  • P2 – Scala grouped imports incomplete in TS extractor: handleScalaImportDecl skips import_selectors nodes, so import scala.collection.{Map, Set} produces source: "scala.collection", names: ["collection"] instead of the two imported names. The Rust extractor handles this correctly.

Confidence Score: 4/5

Safe to merge after fixing the Bash complexity omission; the grouped Scala import and Kotlin CFG issues are non-blocking but should be tracked.

All previously-raised P0/P1 issues have been resolved. One new P1 remains (Bash c_style_for_statement/until_statement missing from complexity branch_nodes) that will silently undercount complexity metrics. Two P2 issues do not block functionality but represent observable behavioral discrepancies. Score is 4 rather than 5 because of the one confirmed P1 complexity-metric bug.

crates/codegraph-core/src/complexity.rs (BASH_RULES branch_nodes/nesting_nodes), src/extractors/scala.ts (handleScalaImportDecl), crates/codegraph-core/src/cfg.rs (KOTLIN_CFG jump_expression)

Important Files Changed

Filename Overview
crates/codegraph-core/src/complexity.rs Adds LangRules and HalsteadRules for all 6 new languages; logical_node_type correctly migrated to logical_node_types slice. Bash branch_nodes/nesting_nodes are missing c_style_for_statement and until_statement (both present in BASH_CFG), causing underestimated complexity metrics.
crates/codegraph-core/src/cfg.rs Adds CfgRules for C, C++, Kotlin, Swift, Scala, Bash. Overall correct; Kotlin maps return/break/continue all to jump_expression (grammar-forced) which may cause CFG edge mis-classification.
src/extractors/scala.ts Correct class/trait/object/function extraction; val/var kinds now fixed. Grouped import selectors ({Map, Set}) are dropped from import paths, unlike the Rust extractor.
src/extractors/swift.ts Handles class/struct/enum via class_declaration quirk, protocol as interface, correct extends/implements split, property kind (let→constant/var→variable) now fixed.
src/extractors/kotlin.ts Class/interface/enum/object extraction looks correct; object_declaration now emits kind:class matching the Rust extractor; delegation specifiers produce correct extends/implements split.
crates/codegraph-core/src/extractors/kotlin.rs Solid extractor; class/interface/enum/object distinction, delegation specifiers, and import_header path extraction all look correct. Misleading node_text_raw was removed.
crates/codegraph-core/src/extractors/scala.rs Handles class/trait/object/function/import; import_selectors are properly appended. extract_scala_import_path correctly handles wildcards and grouped selectors.
crates/codegraph-core/src/extractors/swift.rs Handles Swift class/struct/enum/protocol with correct kind detection and inheritance split; previously reported AST_CONFIG shadow issue resolved.
crates/codegraph-core/src/extractors/helpers.rs New LangAstConfig constants for all 6 languages added; CPP_AST_CONFIG now correctly includes co_await_expression and string prefixes (shadow issue fixed).
src/extractors/c.ts Covers function definitions, struct/enum, typedefs, includes, and call expressions. Typedef extraction handles nested pointer declarators correctly via reverse-child scan.
src/extractors/cpp.ts Adds class_specifier, namespace_definition, and base_class_clause inheritance on top of C; namespace emits namespace kind (added to CoreSymbolKind).
crates/codegraph-core/src/extractors/mod.rs All 6 new language extractors registered in the dispatch match.
crates/codegraph-core/src/parser_registry.rs LanguageKind enum, extension mapping, and tree-sitter language bindings all correctly extended for 6 new languages.
src/domain/parser.ts LANGUAGE_REGISTRY entries and patchImports camelCase patching correctly extended for all 6 new languages.

Reviews (3): Last reviewed commit: "fix(ci): increase embedding regression t..." | Re-trigger Greptile

Comment on lines +20 to +29
const CPP_AST_CONFIG: LangAstConfig = LangAstConfig {
call_types: &["call_expression"],
new_types: &["new_expression"],
throw_types: &["throw_statement"],
await_types: &[],
string_types: &["string_literal", "raw_string_literal"],
regex_types: &[],
quote_chars: &['"'],
string_prefixes: &[],
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Local CPP_AST_CONFIG shadows the richer helpers.rs version

cpp.rs defines a private const CPP_AST_CONFIG with await_types: &[] and string_prefixes: &[]. Because the module also does use super::helpers::*, the local binding shadows the pub const CPP_AST_CONFIG in helpers.rs, which was added with await_types: &["co_await_expression"] and string_prefixes: &['L', 'u', 'U', 'R'].

As a result, the native (Rust) extractor will never classify C++ co_await expressions as await-type AST nodes, and wide/unicode string literals (L"...", u"...", U"...", R"...") won't be recognized as prefixed strings — both silently omitted from the analysis.

The same shadowing issue exists for swift.rs (const SWIFT_AST_CONFIG line 20).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Removed all local const *_AST_CONFIG definitions from cpp.rs, swift.rs, c.rs, kotlin.rs, scala.rs, and bash.rs. All six extractors now use the pub const versions from helpers.rs via use super::helpers::*. The cpp.rs shadow was silently dropping co_await_expression and C++ string prefixes (L, u, U, R).

Comment on lines +192 to +201
function handleKotlinObjectDecl(node: TreeSitterNode, ctx: ExtractorOutput): void {
const nameNode = findChild(node, 'type_identifier');
if (!nameNode) return;
ctx.definitions.push({
name: nameNode.text,
kind: 'module',
line: node.startPosition.row + 1,
endLine: nodeEndLine(node),
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Kotlin object_declaration kind mismatch between engines

The TypeScript (WASM) extractor emits kind: 'module' for Kotlin object declarations, but the Rust (native) extractor (kotlin.rs, line 272) emits kind: "class" for the same node. One must be changed to align them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Changed the TS extractor to emit \ for Kotlin , matching the Rust extractor. Kotlin objects are singletons with class-like semantics (properties, methods, interface implementation), so \ is the correct kind. Updated the test expectation accordingly.

Comment on lines +149 to +165
// Inheritance: inheritance_specifier nodes are DIRECT children of class_declaration
for (let i = 0; i < node.childCount; i++) {
const child = node.child(i);
if (!child || child.type !== 'inheritance_specifier') continue;
// inheritance_specifier > user_type > type_identifier
const userType = findChild(child, 'user_type');
if (userType) {
const typeId = findChild(userType, 'type_identifier');
if (typeId) {
ctx.classes.push({
name,
extends: typeId.text,
line: node.startPosition.row + 1,
});
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Swift multiple-inheritance always emits extends, never implements

The loop over inheritance_specifier children unconditionally pushes { extends: typeId.text } for every inherited type. The Rust extractor correctly uses a first flag to distinguish the superclass from subsequent protocol conformances. The TS extractor should do the same.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added a first flag to the TS Swift inheritance loop: first inheritance_specifier emits extends (superclass), subsequent ones emit implements (protocol conformances). This matches the Rust extractor behavior.

Comment on lines +174 to +186
function handleScalaValVarDef(node: TreeSitterNode, ctx: ExtractorOutput): void {
// Only handle top-level vals/vars
if (node.parent?.type === 'template_body') return;
const pattern = node.childForFieldName('pattern');
if (!pattern) return;
const nameNode =
pattern.type === 'identifier' ? pattern : findChild(pattern, 'identifier');
if (!nameNode) return;
ctx.definitions.push({
name: nameNode.text,
kind: 'function',
line: node.startPosition.row + 1,
endLine: nodeEndLine(node),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Top-level val/var definitions emitted with kind: 'function'

handleScalaValVarDef emits kind: 'function' for top-level Scala value and variable definitions. Should be 'constant'/'variable'. Same issue in swift.ts at line 272.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Scala val_definition now emits kind "constant", var_definition emits "variable". Swift property_declaration now checks for let/var keyword child: let -> "constant", var -> "variable". Previously both incorrectly emitted "function".

Comment on lines +370 to +380
pub static KOTLIN_RULES: LangRules = LangRules {
branch_nodes: &["if_expression", "for_statement", "while_statement", "do_while_statement", "catch_block", "when_expression", "when_entry"],
case_nodes: &["when_entry"],
logical_operators: &["&&", "||"],
logical_node_type: "conjunction_expression",
optional_chain_type: Some("safe_navigation"),
nesting_nodes: &["if_expression", "for_statement", "while_statement", "do_while_statement", "catch_block", "when_expression"],
function_nodes: &["function_declaration"],
if_node_type: Some("if_expression"),
else_node_type: None,
elif_node_type: None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Kotlin logical_node_type covers only &&, not ||

Kotlin || lives in disjunction_expression nodes, not conjunction_expression, causing OR branches to be undercounted in complexity metrics.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Changed LangRules.logical_node_type from a single &str to logical_node_types: &[&str]. Kotlin now specifies &["conjunction_expression", "disjunction_expression"] so both && and || branches are counted in complexity metrics. All other languages updated to use the slice form with their existing single type.

Comment on lines +102 to +111
fn node_text_raw(node: &Node) -> &str {
// We only check kind, not text content for keywords
node.kind()
}

fn find_kotlin_parent_class<'a>(node: &Node<'a>, source: &[u8]) -> Option<String> {
let mut current = node.parent();
while let Some(parent) = current {
match parent.kind() {
"class_declaration" | "object_declaration" => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Dead code with a misleading name

node_text_raw returns node.kind() (not text) and its only caller could use child.kind() directly. The misleading name is a correctness hazard for future callers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Removed the misleading node_text_raw function entirely. Its only call site now uses mod_child.kind() directly, which is what the function was actually returning.

…708)

Remove private const *_AST_CONFIG in c.rs, cpp.rs, kotlin.rs, swift.rs,
scala.rs, bash.rs that shadowed the richer pub versions in helpers.rs.
The cpp.rs shadow silently dropped co_await_expression and string prefixes.
…708)

Both TS and Rust extractors now emit kind: 'class' for Kotlin object
declarations, matching Kotlin semantics (singletons are class-like).
First inheritance_specifier is superclass (extends), subsequent ones are
protocol conformances (implements), matching the Rust extractor behavior.
Also remove unused isClass variable.
Scala val_definition → 'constant', var_definition → 'variable'.
Swift property_declaration: let → 'constant', var → 'variable'.
Previously both emitted 'function' incorrectly.
)

Change LangRules.logical_node_type from &str to logical_node_types: &[&str]
so Kotlin can match both conjunction_expression (&&) and
disjunction_expression (||). Without this, OR branches were undercounted.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai review

tree-sitter-kotlin 0.3.x exports language() function, not LANGUAGE const.
…708)

The new tree-sitter grammar packages have conflicting peer dependency
requirements (0.21 vs 0.25). legacy-peer-deps allows npm install to
succeed in CI.
)

tree-sitter-kotlin 0.3.x uses the old language() -> Language API from
tree-sitter 0.20, incompatible with tree-sitter 0.24's LanguageFn.
Declare the extern C function directly and wrap with LanguageFn::from_raw.
… linking (#708)

- Add 'variable' and 'namespace' to SymbolKind type union
- Use transmute instead of extern C redeclaration for tree-sitter-kotlin
  bridge to fix Windows linker failure (unresolved external symbol)
- Fix kotlin.rs test to use LanguageKind::Kotlin.tree_sitter_language()
… duplicate symbol errors (#708)

tree-sitter-kotlin 0.3 depends on tree-sitter 0.20, which bundles its
own C runtime (ts_language_*, ts_lexer_*) that conflicts with
tree-sitter 0.24's copy at link time on Linux/Windows.

tree-sitter-kotlin-sg 0.4 uses tree-sitter-language (no bundled C
runtime), eliminating the duplicate symbols. Same upstream grammar
(fwcd/tree-sitter-kotlin) so all node types are identical.

Also adds namespace and variable to DEFAULT_NODE_COLORS in colors.ts
to satisfy the Record<AnyNodeKind, string> constraint after the
SymbolKind expansion.
… engine (#708)

- trait_definition kind: 'trait' -> 'interface' (matches Rust)
- object_definition kind: 'module' -> 'class' (matches Rust)
- Inheritance: use found_extends flag to distinguish extends vs implements
- Skip function-local val/var in Scala and let/var in Swift extractors
- Update test expectations accordingly
The beforeAll hook was timing out at 120s on macOS CI runners due to
slow model download. Doubled to 240s for headroom.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed all Greptile round 2 findings:

P1: trait_definition kind mismatch — Changed TS Scala extractor from 'trait' to 'interface', matching the Rust engine. Updated test expectations.

P1: object_definition kind mismatch — Changed TS Scala extractor from 'module' to 'class', matching the Rust engine (and consistent with how Kotlin object_declaration was fixed). Updated test expectations.

P1: Scala inheritance mismatch — Added foundExtends flag to extractScalaInheritance: first type in extends_clause emits extends, subsequent types emit implements. Also handles generic_type nodes. Matches the Rust extract_scala_extends logic.

P2: Function-local bindings leaking as top-level definitions — Added guards in both extractors:

  • Scala: skip val_definition/var_definition when parent is block or indented_block
  • Swift: skip property_declaration when parent is statements or function_body

CI: embedding-regression timeout — Increased beforeAll timeout from 120s to 240s to fix flaky macOS CI failure.

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

…ed (#708)

The impact workflow restores a cached .codegraph/ directory that can
become corrupted. Added fallback: if incremental build fails, delete
the cache and rebuild from scratch.
@carlos-alm carlos-alm merged commit dcad6e7 into main Mar 30, 2026
20 checks passed
@carlos-alm carlos-alm deleted the feat/phase7-languages branch March 30, 2026 14:45
@github-actions github-actions bot locked and limited conversation to collaborators Mar 30, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant