Skip to content

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Jan 6, 2026

Replaces #2836

alpe added 28 commits November 12, 2025 15:16
* main:
  fix: remove duplicate error logging in light node shutdown (#2841)
  chore: fix incorrect function name in comment (#2840)
  chore: remove sequencer go.mod (#2837)
* main:
  build(deps): Bump the go_modules group across 2 directories with 3 updates (#2846)
  build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.7.0 to 1.8.0 in /test/e2e (#2851)
  build(deps): Bump github.com/consensys/gnark-crypto from 0.18.0 to 0.18.1 in /test/e2e (#2844)
  build(deps): Bump github.com/cometbft/cometbft from 0.38.17 to 0.38.19 in /test/e2e (#2843)
  build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.6.0 to 1.7.0 in /test/e2e (#2845)
(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)
* main:
  chore: reduce log noise (#2864)
  fix: sync service for non zero height starts with empty store (#2834)
  build(deps): Bump golang.org/x/crypto from 0.43.0 to 0.45.0 in /execution/evm (#2861)
  chore: minor improvement for docs (#2862)
* main:
  chore: bump da (#2866)
  chore: bump  core (#2865)
* main:
  chore: fix some comments (#2874)
  chore: bump node in evm-single (#2875)
  refactor(syncer,cache): use compare and swap loop and add comments (#2873)
  refactor: use state da height as well (#2872)
  refactor: retrieve highest da height in cache (#2870)
  chore: change from event count to start and end height (#2871)
* main:
  chore: remove extra github action yml file (#2882)
  fix(execution/evm): verify payload status (#2863)
  feat: fetch included da height from store (#2880)
  chore: better output on errors (#2879)
  refactor!: create da client and split cache interface (#2878)
  chore!: rename `evm-single` and `grpc-single` (#2839)
  build(deps): Bump golang.org/x/crypto from 0.42.0 to 0.45.0 in /tools/da-debug in the go_modules group across 1 directory (#2876)
  chore: parallel cache de/serialization (#2868)
  chore: bump blob size (#2877)
* main:
  build(deps): Bump mdast-util-to-hast from 13.2.0 to 13.2.1 in /docs in the npm_and_yarn group across 1 directory (#2900)
  refactor(block): centralize timeout in client (#2903)
  build(deps): Bump the all-go group across 2 directories with 3 updates (#2898)
  chore: bump default timeout (#2902)
  fix: revert default db (#2897)
  refactor: remove obsolete // +build tag (#2899)
  fix:da visualiser namespace  (#2895)
  refactor: omit unnecessary reassignment (#2892)
  build(deps): Bump the all-go group across 5 directories with 6 updates (#2881)
  chore: fix inconsistent method name in retryWithBackoffOnPayloadStatus comment (#2889)
  fix: ensure consistent network ID usage in P2P subscriber (#2884)
  build(deps): Bump golangci/golangci-lint-action from 9.0.0 to 9.1.0 (#2885)
  build(deps): Bump actions/checkout from 5 to 6 (#2886)
* main: (34 commits)
  feat: make reaper poll duration configurable (#2951)
  chore!: move sequencers to pkg (#2931)
  feat: Ensure Header integrity on DA (#2948)
  feat(testda): add header support with GetHeaderByHeight method (#2946)
  chore: improve code comments clarity (#2947)
  chore(sequencers): optimize store check (#2945)
  fix: make evm_execution more robust (#2942)
  fix(sequencers/single): deterministic queue (#2938)
  fix(block): fix init logic sequencer for da epoch fetching (#2926)
  feat: use DA timestamp (#2939)
  chore: improve code comments clarity (#2943)
  build(deps): bump libp2p (#2937)
  build(deps): Bump actions/cache from 4 to 5 (#2934)
  build(deps): Bump actions/download-artifact from 6 to 7 (#2933)
  build(deps): Bump actions/upload-artifact from 5 to 6 (#2932)
  feat: DA Client remove interface part 3, replace types with new code (#2910)
  DA Client remove interface: Part 2.5, create e2e test to validate that a blob is posted in DA layer. (#2920)
  fix(syncing): skip forced txs checks for p2p blocks (#2922)
  build(deps): Bump the all-go group across 5 directories with 5 updates (#2919)
  chore: loosen syncer state check (#2927)
  ...
@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedJan 15, 2026, 5:00 PM

@claude
Copy link
Contributor

claude bot commented Jan 6, 2026

Claude finished @alpe's task in 3m 3s —— View job


Comprehensive Code Review: RAFT High Availability Implementation

I've completed a thorough code review of this PR introducing RAFT consensus for high availability. This is a significant architectural change (80 files, 4270 additions, 588 deletions) that adds leader election, state replication, and failover capabilities. Below is my detailed feedback organized by severity and category.


🔴 Critical Issues

1. Bootstrap Logic Prevents Non-Bootstrap Nodes from Starting ⚠️

Location: pkg/raft/node.go:112-115

The current implementation requires ALL nodes to be in bootstrap mode, which breaks the ability to add new nodes to an existing cluster:

if !n.config.Bootstrap {
    return fmt.Errorf("raft cluster requires bootstrap mode")
}

Issue: A new node joining an existing cluster will fail to start because Bootstrap=false is the correct setting for non-bootstrap nodes.

Impact: This makes the cluster non-extensible and breaks standard RAFT cluster expansion patterns.

Fix: Remove this check or make it conditional. Bootstrap logic should only execute when Bootstrap=true, and the function should return nil for non-bootstrap nodes to allow them to start and join via AddPeer.

Fix this →


2. Resource Leak in Executor Block Production ⚠️

Location: block/internal/executing/executor.go:351-353

Early returns in produceBlock() don't set the function-scoped err variable, causing the deferred batch discard logic to fail:

if e.raftNode != nil && !e.raftNode.IsLeader() {
    return errors.New("not raft leader")  // err variable not set!
}

The deferred cleanup checks err but this return bypasses it, causing database batch resource leaks.

Impact: Over time, this could lead to resource exhaustion or database corruption due to undiscarded batches.

Fix: Assign to err before returning:

if e.raftNode != nil && !e.raftNode.IsLeader() {
    err = errors.New("not raft leader")
    return err
}

This applies to ALL early returns in lines 351-517 that don't use the function-scoped err variable.

Fix this →


3. Error Propagation Failures in Node Initialization ⚠️

Location: node/full.go:113-119

Naked return statements swallow errors from newSingleRoleElector:

if leaderElection, err = newSingleRoleElector(leaderFactory); err != nil {
    return  // Returns (nil, nil) - ERROR IS LOST!
}

Impact: Initialization errors are silently masked, making debugging extremely difficult. The function returns nil, nil instead of propagating the error.

Fix:

if leaderElection, err = newSingleRoleElector(leaderFactory); err != nil {
    return nil, err
}

This issue exists at both line 113 and line 117.

Fix this →


4. Inconsistent Error Aggregation in Config Validation ⚠️

Location: pkg/config/config.go:298-300

The first validation error overwrites multiErr instead of joining:

if c.NodeID == "" {
    multiErr = fmt.Errorf("node ID is required")  // Direct assignment!
}
if c.RaftAddr == "" {
    multiErr = errors.Join(multiErr, fmt.Errorf(...))  // Join subsequent errors
}

Impact: If both NodeID and RaftAddr are empty, only the RaftAddr error is returned. The first error is lost.

Fix: Use errors.Join consistently for all validations:

if c.NodeID == "" {
    multiErr = errors.Join(multiErr, fmt.Errorf("node ID is required"))
}

Fix this →


🟡 High Priority Issues

5. Race Condition in Leadership Transition ⚠️

Location: pkg/raft/election.go:98-126

The leadership transition logic has a potential race between checking waitForMsgsLanded and checking leaderID:

raftSynced := d.node.waitForMsgsLanded(d.node.Config().SendTimeout) == nil
if d.node.leaderID() != d.node.NodeID() {
    d.logger.Info().Msg("lost leadership during sync wait")
    continue
}

Issue: Between the waitForMsgsLanded call and the leaderID check, leadership could be lost. The node might proceed with outdated state.

Impact: Could lead to double-signing if the node proceeds as leader with stale state after actually losing leadership.

Recommendation: Add an additional leadership check immediately before attempting recovery or leadership transfer. Consider making the entire check-and-transition atomic.


6. Channel Buffer Size Could Cause Blocking ⚠️

Location: block/internal/syncing/syncer.go:181 and pkg/raft/node.go:62

The heightInCh has a buffer of 1,000 events, and the raft applyCh has buffer of 1. Under high load or slow processing, these could fill up.

heightInCh: make(chan common.DAHeightEvent, 1_000),

Issue: When full, the raft FSM drops messages with only a warning log (node.go:298). This is by design for "slow consumers," but could mask serious problems.

Recommendation:

  • Add metrics for dropped RAFT messages
  • Consider alerting when drop rate exceeds threshold
  • Document the backpressure strategy clearly

7. Typo in Bootstrap Log Message

Location: pkg/raft/node.go:122

n.logger.Info().Msg("Boostrap raft cluster")  // "Boostrap" should be "Bootstrap"

Minor but worth fixing for professionalism.

Fix this →


🟢 Medium Priority Issues

8. Naming Convention Violation for Initialisms

Location: pkg/rpc/server/http.go:143-146

Go convention requires initialisms like "ID" to be uppercase:

NodeId string `json:"node_id"`  // Should be NodeID

Fix:

NodeID string `json:"node_id"`

9. Debug Log at Wrong Level

Location: block/internal/syncing/syncer.go:191

s.logger.Debug().Uint64("header_height", state.LastSubmittedDAHeaderHeight)...
    .Msg("received raft block state")

The +++ prefix pattern typically indicates debugging output that shouldn't be at Info level. Good that it's Debug, but remove the +++ prefix for consistency.


10. Misleading Method Name

Location: block/internal/cache/pending_headers.go:69-71

The method GetLastSubmittedDataHeight() is misleading because it's part of the PendingHeaders struct:

func (ph *PendingHeaders) GetLastSubmittedDataHeight() uint64 {
    return ph.base.getLastSubmittedHeight()
}

Fix: Rename to GetLastSubmittedHeaderHeight() for clarity. Update call sites accordingly.


11. Redundant Type Conversion

Location: block/internal/executing/executor.go:570-572

for _, tx := range batchData.Transactions {
    data.Txs = append(data.Txs, types.Tx(tx))
}

Since types.Tx is an alias for []byte, this could be simplified to:

data.Txs = append(data.Txs, batchData.Transactions...)

More efficient and clearer.


📊 Test Coverage Concerns

The Codecov report shows 37.78% patch coverage with 634 lines missing coverage. Key areas:

  1. pkg/raft/node.go: 12.12% coverage (174 missing lines) - Core RAFT node logic barely tested
  2. pkg/raft/node_mock.go: 45.40% coverage - Even mocks lack coverage
  3. block/internal/syncing/syncer.go: 13.08% coverage (93 missing lines) - Sync logic undertested
  4. block/internal/syncing/raft_retriever.go: 0% coverage (63 missing lines) - NO TESTS!

Critical Gaps:

  • No tests for RAFT block state recovery (Recover methods)
  • Insufficient testing of leadership transition edge cases
  • Missing tests for error paths in RAFT message application
  • No tests for concurrent leadership changes during block production

Recommendation: Prioritize adding tests for:

  1. RAFT retriever loop and error handling
  2. Leadership transition during block production
  3. Recovery from split-brain scenarios
  4. Snapshot restore functionality

🏗️ Architecture & Design

Positive Aspects:

  1. Clean separation of concerns: RAFT logic isolated in pkg/raft/
  2. Interface-based design: Runnable, Recoverable, RaftNode interfaces promote testability
  3. Graceful degradation: Non-RAFT mode preserved via singleRoleElector
  4. E2E testing: failover_e2e_test.go demonstrates end-to-end validation
  5. State synchronization: RAFT state includes DA submission heights for consistency

Design Concerns:

  1. ⚠️ Bootstrap-only limitation: Current code only supports bootstrap mode
  2. ⚠️ Split-brain risk: Recovery logic needs more edge case handling
  3. ⚠️ Resource management: Several resource leak opportunities in error paths
  4. ⚠️ Observability gaps: Limited metrics for RAFT operations (dropped messages, leadership changes, sync lag)

🔐 Security Considerations

No Critical Security Issues Found, but note:

  1. Peer validation: splitPeerAddr() skips address validation (line 369 in node.go: "we can skip address validation as they come from a local configuration"). This is acceptable for local config but ensure configuration is properly validated at ingestion.

  2. RAFT communication: Uses TCP transport without TLS. For production:

    • Document that RAFT addresses should be on private networks
    • Consider adding TLS support for cross-region deployments
  3. Signature validation: RAFT-received blocks validate proposer address (raft_retriever.go:119-122), which is good. Maintain this for security.


🔧 Error Handling & Resilience

Good Practices Observed:

  • ✅ Context cancellation properly propagated
  • ✅ Graceful shutdown with timeouts (failover.go:177-185)
  • ✅ Error wrapping for context (fmt.Errorf(...: %w, err))

Issues:


🔄 Concurrency & Resource Management

Concerns:

  1. Goroutine lifecycle: election.go:78-86 spawns workers with proper cleanup, but verify no goroutine leaks under error conditions

  2. Channel buffer sizes: Document rationale for buffer sizes (1,000 for heightInCh seems arbitrary)

  3. Atomic operations: Good use of atomic.Pointer and atomic.Bool for lock-free state access

  4. Lock contention: Minimal mutex usage - good! Uses channels and atomics appropriately.

  5. Resource cleanup: Deferred cleanup in failoverState.Run() is well-structured


📝 Configuration & Documentation

Configuration:

  • ✅ Well-structured RaftConfig with validation
  • ✅ Sensible defaults in pkg/config/defaults.go
  • ⚠️ Missing documentation on operational requirements (network topology, disk space for logs, etc.)

Code Documentation:

  • ✅ Good function-level comments
  • ✅ Interface contracts documented
  • ⚠️ Missing package-level documentation for pkg/raft
  • ⚠️ Could benefit from more inline comments in complex leader election logic

🎯 Priority Recommendations

Must Fix Before Merge:

  1. Fix bootstrap logic to allow non-bootstrap nodes (Critical ABCI interface #1)
  2. Fix resource leaks in executor (Critical Peer discovery #2)
  3. Fix error propagation in node initialization (Critical Transaction gossiping #3)
  4. Fix error aggregation in config validation (Critical Fraud proofs gossiping #4)

Should Fix:

  1. Address leadership transition race condition (High Mempool #5)
  2. Add metrics for RAFT message drops (High Fraud proof generation #6)
  3. Improve test coverage, especially for raft_retriever (Test Coverage)

Nice to Have:

  1. Fix minor naming issues (Medium Data Availability layer API #8-11)
  2. Add package-level documentation
  3. Document operational requirements and runbooks

✅ Strengths of This Implementation

Despite the issues identified, this is a well-architected RAFT integration:

  • Modular design with clear interfaces
  • Comprehensive E2E testing demonstrates real-world failover scenarios
  • Proper state synchronization including DA heights
  • Graceful degradation to non-RAFT mode
  • Production-ready error handling patterns in most places
  • Good separation of leader vs. follower responsibilities

The critical issues are fixable and don't represent fundamental design flaws.


📌 Summary

This PR successfully implements RAFT-based high availability for ev-node. The architecture is sound, but there are 4 critical issues that must be fixed before merge, primarily around:

  • Bootstrap mode restrictions
  • Resource management
  • Error propagation

Test coverage (37.78%) is concerning for such critical infrastructure. Prioritize adding tests for error paths and edge cases in RAFT operations.

Overall Assessment: Approve with required changes. The foundation is excellent, but the critical issues need resolution before production deployment.


Tasks

  • Read repository guidelines (CLAUDE.md)
  • Review core RAFT implementation (pkg/raft/*)
  • Review node failover mechanism (node/failover.go, node/full.go)
  • Review block processing changes for RAFT integration
  • Review error handling and resilience patterns
  • Review concurrency and resource management
  • Review security considerations
  • Review test coverage and quality
  • Review configuration and observability
  • Address issues from previous reviews (Gemini feedback)
  • Compile final review feedback

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @alpe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Rollkit network's resilience and fault tolerance by integrating the Raft consensus algorithm. The changes enable a robust failover mechanism, allowing nodes to seamlessly transition between leader and follower roles. This ensures continuous operation and consistent state across the network, even if individual nodes experience failures. The update touches various core components, from dependency management and block processing to RPC services and node lifecycle management, all aimed at achieving high availability.

Highlights

  • Raft Consensus Integration: Introduced Raft consensus for high availability, enabling dynamic leader election and state replication. This involves new dependencies on github.com/hashicorp/raft and github.com/hashicorp/raft-boltdb.
  • Node Architecture Refactoring: The node's core logic has been refactored to support a failover mechanism, allowing nodes to dynamically switch between aggregator (leader) and sync (follower) modes based on Raft's leadership status. This includes changes to how P2P clients are initialized and passed to node startup commands.
  • Block Processing with Raft: The block executor now checks Raft leadership before producing new blocks and proposes the block state to the Raft cluster after execution, ensuring state consistency across the cluster.
  • Cache and State Management Enhancements: Added methods to retrieve and persist the last submitted DA header and data heights from the cache. Logic was introduced to skip in-flight (temporary) headers and data during cache retrieval to prevent processing incomplete blocks.
  • New RPC Endpoints and Configuration: Added new RPC endpoints for querying Raft node status (e.g., /raft/node) and introduced new configuration flags for Raft-specific settings like node_id, raft_addr, bootstrap, and peers.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces high availability via RAFT consensus, which is a significant architectural change. The implementation is extensive, touching many parts of the system from node startup to block processing and syncing. Overall, the changes are well-structured, introducing clear interfaces for RAFT integration and refactoring the node components to support dynamic role switching between leader and follower. I've identified a few critical issues related to error handling and startup logic that need to be addressed, along with some medium-severity issues for improving code clarity and maintainability. Great work on this complex feature.

I am having trouble creating individual review comments. Click here to see my feedback.

node/full.go (113-115)

high

The naked return here will not propagate the error from newSingleRoleElector. This will cause the function to return (nil, nil), masking the underlying error. The error should be returned to the caller.

		if leaderElection, err = newSingleRoleElector(leaderFactory); err != nil {
			return nil, err
		}

node/full.go (117-119)

high

Similar to the previous case, the naked return here will swallow the error from newSingleRoleElector. The error should be propagated up to the caller.

		if leaderElection, err = newSingleRoleElector(followerFactory); err != nil {
			return nil, err
		}

pkg/raft/node.go (111-113)

high

This check prevents a node from starting if Bootstrap is false, which is problematic for nodes joining an existing cluster. A new node attempting to join will fail to start. The bootstrap logic should only execute if n.config.Bootstrap is true, and the function should return nil otherwise, allowing non-bootstrap nodes to start and join a cluster.

block/internal/cache/pending_headers.go (69-71)

medium

The method name GetLastSubmittedDataHeight is misleading as it's part of the PendingHeaders struct. For clarity and consistency, it should be renamed to GetLastSubmittedHeaderHeight.

This change will also require updating the call site in block/internal/cache/manager.go.

func (ph *PendingHeaders) GetLastSubmittedHeaderHeight() uint64 {
	return ph.base.getLastSubmittedHeight()
}

block/internal/executing/executor.go (570-572)

medium

The explicit type conversion types.Tx(tx) is redundant since types.Tx is an alias for []byte, and tx is already of type []byte. The change to a direct assignment is good, but it seems this loop could be replaced with a single, more efficient append call.

	data.Txs = append(data.Txs, batchData.Transactions...)

block/internal/syncing/syncer.go (184)

medium

This log message seems to be for debugging purposes, indicated by the +++ prefix. It should be logged at the Debug level instead of Info to avoid cluttering the logs in a production environment.

			s.logger.Debug().Uint64("header_height", state.LastSubmittedDAHeaderHeight).Uint64("data_height", state.LastSubmittedDADataHeight).Msg("received raft block state")

pkg/rpc/server/http.go (143-146)

medium

To adhere to Go's naming conventions for initialisms, the struct field NodeId should be renamed to NodeID.

				NodeID   string `json:"node_id"`
			}{
				IsLeader: raftNode.IsLeader(),
				NodeID:   raftNode.NodeID(),

@codecov
Copy link

codecov bot commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 37.65978% with 634 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.49%. Comparing base (52825bf) to head (dfa2204).

Files with missing lines Patch % Lines
pkg/raft/node.go 12.12% 174 Missing ⚠️
pkg/raft/node_mock.go 45.40% 74 Missing and 21 partials ⚠️
block/internal/syncing/syncer.go 11.42% 91 Missing and 2 partials ⚠️
block/internal/syncing/raft_retriever.go 0.00% 63 Missing ⚠️
node/full.go 32.30% 37 Missing and 7 partials ⚠️
block/internal/executing/executor.go 4.54% 38 Missing and 4 partials ⚠️
node/failover.go 72.34% 26 Missing and 13 partials ⚠️
pkg/raft/election.go 78.72% 14 Missing and 6 partials ⚠️
pkg/rpc/server/http.go 6.66% 13 Missing and 1 partial ⚠️
block/components.go 27.27% 7 Missing and 1 partial ⚠️
... and 13 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2954      +/-   ##
==========================================
- Coverage   58.77%   56.49%   -2.28%     
==========================================
  Files         101      108       +7     
  Lines        9685    10527     +842     
==========================================
+ Hits         5692     5947     +255     
- Misses       3381     3930     +549     
- Partials      612      650      +38     
Flag Coverage Δ
combined 56.49% <37.65%> (-2.28%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

alpe added 9 commits January 7, 2026 16:11
* main:
  feat(tracing): add tracing to EngineClient (#2959)
  chore: inject W3C headers into engine client and eth client (#2958)
  feat: adding tracing for Executor and added initial configuration (#2957)
* main:
  fix: Fail fast when executor ahead (#2966)
  feat(block): async epoch fetching (#2952)
  perf: tune badger defaults and add db bench (#2950)
@alpe
Copy link
Contributor Author

alpe commented Jan 13, 2026

@gemini-code-assist review the latest code

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces high availability through RAFT consensus, which is a significant and complex change. The overall architecture, including the use of a leader election mechanism and the separation of roles into leader (aggregator) and follower (sync) modes, is well-designed. The code is modular, leveraging interfaces and factories which is great for testability and maintenance. I've identified a critical issue related to resource management and a medium-severity issue in configuration validation. My feedback focuses on ensuring correctness and robustness, especially in error handling paths.

Comment on lines +337 to +339
if e.raftNode != nil && !e.raftNode.IsLeader() {
return errors.New("not raft leader")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This early return, and others in produceBlock, can cause a resource leak. The function uses a defer to discard a database batch on error, but this return statement doesn't set the err variable that the defer closure checks. To ensure the batch is discarded, you should assign the error to the function-scoped err variable before returning. This applies to all early returns in this function, including the new one for raft broadcast failure.

Suggested change
if e.raftNode != nil && !e.raftNode.IsLeader() {
return errors.New("not raft leader")
}
if e.raftNode != nil && !e.raftNode.IsLeader() {
err = errors.New("not raft leader")
return err
}

Comment on lines +298 to +300
if c.NodeID == "" {
multiErr = fmt.Errorf("node ID is required")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The first validation error is assigned directly to multiErr, while subsequent errors are joined using errors.Join. This can lead to the first error being overwritten if it's not the only one. For consistency and to ensure all validation errors are collected and reported correctly, you should use errors.Join for all error aggregations.

Suggested change
if c.NodeID == "" {
multiErr = fmt.Errorf("node ID is required")
}
if c.NodeID == "" {
multiErr = errors.Join(multiErr, fmt.Errorf("node ID is required"))
}

* main:
  chore: adding syncing tracing (#2981)
  feat(tracing): adding block production tracing (#2980)
  feat(tracing): Add Store, P2P and Config tracing (#2972)
  chore: fix upgrade test (#2979)
  build(deps): Bump github.com/ethereum/go-ethereum from 1.16.7 to 1.16.8 in /execution/evm/test in the go_modules group across 1 directory (#2974)
  feat(tracing): adding tracing to DA client (#2968)
  chore: create onboarding skill  (#2971)
  test: add e2e tests for force inclusion (part 2) (#2970)
  feat(tracing): adding eth client tracing (#2960)
  test: add e2e tests for force inclusion (#2964)
  build(deps): Bump the all-go group across 4 directories with 10 updates (#2969)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants