Skip to content

feat: Implement vector search and vectorization features for course t…#48

Merged
webdevcody merged 5 commits intomainfrom
vector-embeddings
Jan 1, 2026
Merged

feat: Implement vector search and vectorization features for course t…#48
webdevcody merged 5 commits intomainfrom
vector-embeddings

Conversation

@webdevcody
Copy link
Copy Markdown
Owner

@webdevcody webdevcody commented Jan 1, 2026

…ranscripts

  • Updated Docker configuration to use pgvector for PostgreSQL.
  • Added new endpoints and UI components for vector search and vectorization in the admin panel.
  • Implemented chunking and embedding generation for video transcripts.
  • Created database schema for storing transcript chunks and embeddings.
  • Developed use cases for vectorizing segments and searching transcripts using embeddings.
  • Introduced admin-only access for vector search functionality.
  • Enhanced pricing configuration and updated related components for consistency.

Note

Introduces vector support and schema for semantic transcript search.

  • Switches Postgres images in docker-compose.yml and GitHub Actions to pgvector/pgvector:pg17
  • Migration enables vector extension and creates app_transcript_chunk with vector(1536) embeddings, indexes, and FK to app_segment
  • Follow-up migration standardizes many FK columns to integer and relaxes select nullability constraints

Written by Cursor Bugbot for commit e22e0f1. This will update automatically on new commits. Configure here.

Summary by CodeRabbit

  • New Features

    • Admin semantic transcript search and Vectorization dashboard to generate, monitor, and run embeddings
    • Dynamic pricing display updated (current price now $249; original price shown as $299)
  • Chores

    • Added embedding/tokenizer and OpenAI integration with batching/retries, transcript chunking, and API key handling
    • Database migrations and pgvector-enabled DB image to store and search vectorized transcript chunks
  • Other

    • New admin routes and UI for vector search and vectorization controls

✏️ Tip: You can customize this high-level summary in your review settings.

…ranscripts

- Updated Docker configuration to use pgvector for PostgreSQL.
- Added new endpoints and UI components for vector search and vectorization in the admin panel.
- Implemented chunking and embedding generation for video transcripts.
- Created database schema for storing transcript chunks and embeddings.
- Developed use cases for vectorizing segments and searching transcripts using embeddings.
- Introduced admin-only access for vector search functionality.
- Enhanced pricing configuration and updated related components for consistency.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 1, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds pgvector-backed transcript chunk storage and migrations, embedding generation and chunking utilities, data-access and use-case orchestration for vectorization/search, admin routes and UIs with server functions, env and dependency additions, and minor pricing config updates.

Changes

Cohort / File(s) Summary
Database & Migrations
docker-compose.yml, .github/workflows/test.yml, drizzle/0050_glossy_infant_terrible.sql, drizzle/0051_lowly_songbird.sql, drizzle/meta/_journal.json, src/db/schema.ts
Switched Postgres images to pgvector/pgvector:pg17; added pgvector extension and new transcript_chunk table (vector(1536) embedding) with indexes and relations; large ALTER TABLE migration converting many serial FK columns to integer and updating relations; journal entries appended.
Chunking & Embeddings
src/lib/chunking.ts, src/lib/openai.ts, src/utils/env.ts, package.json, src/utils/openai.ts
New transcript chunking (TARGET_CHUNK_SIZE=500, OVERLAP_SIZE=50); OpenAI embedding helpers with batching, retries, and rich errors; env OPENAI_API_KEY added and client switched to use it; dependencies gpt-tokenizer and openai added.
Schema & Data Access
src/db/schema.ts, src/data-access/transcript-chunks.ts
New transcriptChunks table/type and relations exported; new data-access module with create/delete/get/count/search (pgvector similarity) and SearchResult type.
Use Cases & Server Functions
src/use-cases/vector-search.ts, src/fn/vector-search.ts
New use-case layer: vectorize per-segment and bulk, embedding-backed search, and status reporting; admin-protected server functions wired to these use-cases.
Admin UI, Routes & Nav
src/routes/admin/vectorization.tsx, src/routes/admin/vector-search.tsx, src/routes/admin/-components/admin-nav.tsx, src/routeTree.gen.ts
New admin pages/routes for Vectorization and Vector Search, route tree updated, and admin nav extended with entries for both tools.
Frontend Pricing
src/config.ts, src/routes/-components/pricing.tsx
PRICING_CONFIG values updated (CURRENT_PRICE 199→249, ORIGINAL_PRICE 297→299); pricing UI now reads from config.
Routing / Add-segment Flow
src/routes/learn/-components/add-segment/server-functions.ts, src/routes/learn/-components/add-segment/use-add-segment.ts
Added validateSlugFn (admin middleware) and invoked it pre-upload in add-segment flow to ensure slug uniqueness.
Docs & Prompt
vector-search-prompt.md
New design/implementation prompt describing vector search architecture, migrations, chunking, and admin UI requirements.
Tests / CI
.github/workflows/test.yml
Postgres test container image switched to pgvector/pgvector:pg17.
Minor UI Logic
src/routes/learn/-components/module-panel.tsx, src/routes/learn/-components/module-accordion-header.tsx
Segment completion logic updated to consult local segment context, prioritizing locally uncompleted segments.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Admin as Admin UI
    participant SF as Server Fn
    participant UC as Use Case
    participant DA as Data Access
    participant OpenAI as OpenAI API
    participant DB as PostgreSQL (pgvector)

    Admin->>SF: POST /vectorize { segmentId }
    SF->>UC: vectorizeSegmentUseCase(segmentId)
    UC->>DA: fetch segment + transcript
    DA->>DB: SELECT segment + transcript
    DB-->>DA: segment record
    DA-->>UC: segment + transcript

    alt transcript exists
        UC->>UC: chunkTranscript()
        UC->>OpenAI: generateEmbeddings(batch of chunk texts)
        OpenAI-->>UC: embeddings[]
        UC->>DA: deleteChunksBySegmentId(segmentId)
        DA->>DB: DELETE FROM transcript_chunk WHERE segmentId=...
        UC->>DA: createTranscriptChunks(chunks with embeddings)
        DA->>DB: INSERT ... embedding vector
        DB-->>DA: insert result
        DA-->>UC: created count
        UC-->>SF: success { chunkCount }
    else no transcript
        UC-->>SF: error
    end
    SF-->>Admin: response
Loading
sequenceDiagram
    autonumber
    participant Admin as Admin UI
    participant SF as Server Fn
    participant UC as Use Case
    participant OpenAI as OpenAI API
    participant DA as Data Access
    participant DB as PostgreSQL (pgvector)

    Admin->>SF: GET /search?q=...
    SF->>UC: searchTranscriptsUseCase(query, limit)
    UC->>OpenAI: generateEmbedding(query)
    OpenAI-->>UC: embedding
    UC->>DA: searchByEmbedding(embedding, limit)
    DA->>DB: SELECT ... ORDER BY embedding <-> query_embedding LIMIT N
    rect rgb(220,240,220)
        note over DB: pgvector cosine distance search
    end
    DB-->>DA: ranked rows with segment/module
    DA-->>UC: SearchResult[]
    UC-->>SF: results
    SF-->>Admin: results (snippets + scores)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐇 I hop through chunks of token light,

embeddings hum and vectors bite,
postgres cradles each little piece,
admins click to make searches cease,
I nibble answers — bright and neat. 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: implementing vector search and vectorization features for course transcripts, which is the core focus of this PR.
✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Comment thread vector-search-prompt.md
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (6)
src/routes/learn/-components/add-segment/server-functions.ts (1)

93-103: Consider documenting the intentional validation duplication.

The slug validation logic is duplicated between validateSlugFn (lines 97-100) and createSegmentFn (lines 30-34). While this duplication serves a valid purpose—validateSlugFn provides early feedback before expensive video uploads, and createSegmentFn guards against race conditions—it would be helpful to add comments explaining why both checks are necessary.

🔎 Suggested documentation
 export const validateSlugFn = createServerFn()
   .middleware([adminMiddleware])
   .inputValidator(z.object({ slug: z.string() }))
   .handler(async ({ data }) => {
+    // Early validation to provide fast feedback before expensive video upload.
+    // Note: createSegmentFn re-validates to prevent race conditions.
     if (await isSlugInUse(data.slug)) {
       throw new Error(
         `The slug "${data.slug}" is already in use. Please choose a different slug.`
       );
     }
     return { valid: true };
   });
src/lib/openai.ts (1)

10-16: Consider adding input validation.

Adding validation for empty strings and excessively long text could prevent unnecessary API calls and provide clearer error messages.

🔎 Suggested input validation
 export async function generateEmbedding(text: string): Promise<number[]> {
+  if (!text || text.trim().length === 0) {
+    throw new Error('Text input cannot be empty');
+  }
+  
   try {
     const response = await openai.embeddings.create({
       model: EMBEDDING_MODEL,
       input: text,
     });
     return response.data[0].embedding;
   } catch (error) {
     console.error('Failed to generate embedding:', error);
     throw new Error(
       `Embedding generation failed: ${error instanceof Error ? error.message : 'Unknown error'}`
     );
   }
 }
src/lib/chunking.ts (1)

12-57: Consider adding error handling for tokenization.

The encode and decode functions from gpt-tokenizer could potentially fail on malformed input. Adding try-catch would make the function more robust.

🔎 Suggested error handling
 export function chunkTranscript(transcript: string): Chunk[] {
   if (!transcript || transcript.trim().length === 0) {
     return [];
   }
 
+  try {
     const tokens = encode(transcript);
     const chunks: Chunk[] = [];
 
     // ... rest of the logic ...
 
     return chunks;
+  } catch (error) {
+    console.error('Failed to chunk transcript:', error);
+    throw new Error(
+      `Transcript chunking failed: ${error instanceof Error ? error.message : 'Unknown error'}`
+    );
+  }
 }
src/routes/admin/vectorization.tsx (1)

137-145: Consider optimizing module lookup with a Map.

The current implementation uses .find() for each segment to locate its module by title, resulting in O(n×m) complexity. For larger datasets, consider building a Map<string, number> for O(1) lookups.

🔎 Proposed optimization
+  const moduleTitleToId = new Map(modules.map((m) => [m.title, m.id]));
+
   segments.forEach((segment) => {
-    const moduleId = modules.find((m) => m.title === segment.moduleTitle)?.id;
+    const moduleId = moduleTitleToId.get(segment.moduleTitle);
     if (moduleId) {
       const moduleData = segmentsByModule.get(moduleId);
       if (moduleData) {
         moduleData.segments.push(segment);
       }
     }
   });
src/use-cases/vector-search.ts (1)

47-61: Consider defensive check for embeddings array length.

The code assumes generateEmbeddings(texts) returns an array with the same length as texts. If the OpenAI API returns fewer embeddings (e.g., due to filtering or errors), accessing embeddings[j] could yield undefined, resulting in chunks stored with invalid embeddings.

🔎 Proposed defensive check
   for (let i = 0; i < chunks.length; i += EMBEDDING_BATCH_SIZE) {
     const batch = chunks.slice(i, i + EMBEDDING_BATCH_SIZE);
     const texts = batch.map((c) => c.text);
     const embeddings = await generateEmbeddings(texts);

+    if (embeddings.length !== batch.length) {
+      throw new Error(
+        `Embedding count mismatch: expected ${batch.length}, got ${embeddings.length}`
+      );
+    }
+
     for (let j = 0; j < batch.length; j++) {
       allChunksWithEmbeddings.push({
src/data-access/transcript-chunks.ts (1)

70-86: Hardcoded table names may cause maintenance issues.

The raw SQL uses hardcoded table names (app_transcript_chunk, app_segment, app_module) which are derived from the PREFIX constant in schema.ts. If the prefix changes, this query will break silently.

🔎 Consider using template literals with table references

You could extract table name references or use Drizzle's getTableName() utility to keep names in sync:

import { getTableName } from "drizzle-orm";

// Then in the query:
const tcTable = getTableName(transcriptChunks);
const sTable = getTableName(segments);
const mTable = getTableName(modules);

Alternatively, document the dependency on the prefix or add a test that verifies the table names match.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50079b2 and 78c76ef.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (21)
  • docker-compose.yml
  • drizzle/0050_glossy_infant_terrible.sql
  • drizzle/meta/0050_snapshot.json
  • drizzle/meta/_journal.json
  • package.json
  • src/config.ts
  • src/data-access/transcript-chunks.ts
  • src/db/schema.ts
  • src/fn/vector-search.ts
  • src/lib/chunking.ts
  • src/lib/openai.ts
  • src/routeTree.gen.ts
  • src/routes/-components/pricing.tsx
  • src/routes/admin/-components/admin-nav.tsx
  • src/routes/admin/vector-search.tsx
  • src/routes/admin/vectorization.tsx
  • src/routes/learn/-components/add-segment/server-functions.ts
  • src/routes/learn/-components/add-segment/use-add-segment.ts
  • src/use-cases/vector-search.ts
  • src/utils/env.ts
  • vector-search-prompt.md
🧰 Additional context used
📓 Path-based instructions (7)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.{ts,tsx}: When invoking a TanStack server function, pass an object with a data property containing parameters
All card components should use the shadcn Card component with CardTitle, CardDescription, and other shadcn Card subcomponents
All server-side operations should use TanStack Start server functions created with createServerFn instead of direct backend logic
Form handling should use React Hook Form with Zod validation

**/*.{ts,tsx}: Never hard code magic numbers into code; consolidate them to the top of the file or in /src/config/index.ts to keep the code clean
Never allow a file to exceed 1,000 lines of code; split overly large files into smaller modular components to maintain code maintainability

Files:

  • src/utils/env.ts
  • src/routes/learn/-components/add-segment/server-functions.ts
  • src/routes/admin/-components/admin-nav.tsx
  • src/fn/vector-search.ts
  • src/lib/openai.ts
  • src/routes/admin/vectorization.tsx
  • src/use-cases/vector-search.ts
  • src/data-access/transcript-chunks.ts
  • src/routes/admin/vector-search.tsx
  • src/config.ts
  • src/routes/-components/pricing.tsx
  • src/routeTree.gen.ts
  • src/routes/learn/-components/add-segment/use-add-segment.ts
  • src/lib/chunking.ts
  • src/db/schema.ts
**/{components,routes}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Component styling should follow shadcn/ui patterns with Tailwind CSS v4

Files:

  • src/routes/learn/-components/add-segment/server-functions.ts
  • src/routes/admin/-components/admin-nav.tsx
  • src/routes/admin/vectorization.tsx
  • src/routes/admin/vector-search.tsx
  • src/routes/-components/pricing.tsx
  • src/routes/learn/-components/add-segment/use-add-segment.ts
**/routes/admin/**/*.tsx

📄 CodeRabbit inference engine (CLAUDE.md)

When making an admin page, protect it using assertIsAdminFn() in the beforeLoad hook with createFileRoute

Files:

  • src/routes/admin/-components/admin-nav.tsx
  • src/routes/admin/vectorization.tsx
  • src/routes/admin/vector-search.tsx
**/routes/**/*.tsx

📄 CodeRabbit inference engine (CLAUDE.md)

**/routes/**/*.tsx: Pages should use the Page component and PageHeader component when possible
Implement error boundaries at the route level using DefaultCatchBoundary

Files:

  • src/routes/admin/-components/admin-nav.tsx
  • src/routes/admin/vectorization.tsx
  • src/routes/admin/vector-search.tsx
  • src/routes/-components/pricing.tsx
**/{fn,functions}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

**/{fn,functions}/**/*.{ts,tsx}: When making a TanStack server function that requires authentication, use .middleware([authenticatedMiddleware])
When making a TanStack server function that requires admin-only permission, use .middleware([adminMiddleware])
When making a TanStack server function with optional authentication, use .middleware([unauthenticatedMiddleware])

Files:

  • src/fn/vector-search.ts
src/fn/**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/tanstack-server-functions.mdc)

src/fn/**/*.ts: When creating a tanstack start server function, always include a middleware function call. Existing middleware functions are available in src/lib/auth.ts
When using a server function, always attach a validator to it
When using server functions, never import drizzle related objects inside the function handler. Server functions should always invoke use cases based on the layered architecture pattern. Reference docs/technical/layered-architecture.md for layered architecture guidelines

Files:

  • src/fn/vector-search.ts
**/data-access/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Use the Repository pattern (defined in /src/data-access/) for database query functions with Drizzle ORM

Files:

  • src/data-access/transcript-chunks.ts
🧠 Learnings (11)
📚 Learning: 2025-12-22T03:59:58.018Z
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: .cursor/rules/tanstack-server-functions.mdc:0-0
Timestamp: 2025-12-22T03:59:58.018Z
Learning: Applies to src/fn/**/*.ts : When using a server function, always attach a validator to it

Applied to files:

  • src/routes/learn/-components/add-segment/server-functions.ts
  • src/fn/vector-search.ts
📚 Learning: 2025-12-19T06:04:43.442Z
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-19T06:04:43.442Z
Learning: Applies to **/{fn,functions}/**/*.{ts,tsx} : When making a TanStack server function that requires admin-only permission, use `.middleware([adminMiddleware])`

Applied to files:

  • src/routes/learn/-components/add-segment/server-functions.ts
  • src/fn/vector-search.ts
📚 Learning: 2025-12-22T03:59:58.018Z
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: .cursor/rules/tanstack-server-functions.mdc:0-0
Timestamp: 2025-12-22T03:59:58.018Z
Learning: Applies to src/fn/**/*.ts : When creating a tanstack start server function, always include a middleware function call. Existing middleware functions are available in `src/lib/auth.ts`

Applied to files:

  • src/routes/learn/-components/add-segment/server-functions.ts
  • src/fn/vector-search.ts
📚 Learning: 2025-12-19T06:04:43.442Z
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-19T06:04:43.442Z
Learning: Applies to **/*.{ts,tsx} : All server-side operations should use TanStack Start server functions created with `createServerFn` instead of direct backend logic

Applied to files:

  • src/fn/vector-search.ts
📚 Learning: 2025-12-19T06:04:43.442Z
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-19T06:04:43.442Z
Learning: Applies to **/{fn,functions}/**/*.{ts,tsx} : When making a TanStack server function with optional authentication, use `.middleware([unauthenticatedMiddleware])`

Applied to files:

  • src/fn/vector-search.ts
📚 Learning: 2025-12-19T06:04:43.442Z
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-19T06:04:43.442Z
Learning: Applies to **/{fn,functions}/**/*.{ts,tsx} : When making a TanStack server function that requires authentication, use `.middleware([authenticatedMiddleware])`

Applied to files:

  • src/fn/vector-search.ts
📚 Learning: 2025-12-19T06:04:43.442Z
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-19T06:04:43.442Z
Learning: Applies to **/routes/admin/**/*.tsx : When making an admin page, protect it using `assertIsAdminFn()` in the `beforeLoad` hook with `createFileRoute`

Applied to files:

  • src/routes/admin/vectorization.tsx
  • src/routes/admin/vector-search.tsx
  • src/routeTree.gen.ts
📚 Learning: 2025-12-19T06:04:43.442Z
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-19T06:04:43.442Z
Learning: Applies to **/routes/**/*.tsx : Pages should use the `Page` component and `PageHeader` component when possible

Applied to files:

  • src/routes/admin/vectorization.tsx
  • src/routes/admin/vector-search.tsx
📚 Learning: 2025-12-19T06:04:43.442Z
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-19T06:04:43.442Z
Learning: Routes are auto-generated to `/src/routeTree.gen.ts` and should never be edited manually

Applied to files:

  • src/routes/admin/vector-search.tsx
  • src/routeTree.gen.ts
📚 Learning: 2025-12-19T06:04:47.031Z
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: .cursor/rules/clean-code.mdc:0-0
Timestamp: 2025-12-19T06:04:47.031Z
Learning: Applies to **/*.{ts,tsx} : Never hard code magic numbers into code; consolidate them to the top of the file or in `/src/config/index.ts` to keep the code clean

Applied to files:

  • src/config.ts
📚 Learning: 2025-12-19T06:04:43.442Z
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-19T06:04:43.442Z
Learning: Applies to **/*.{ts,tsx} : Form handling should use React Hook Form with Zod validation

Applied to files:

  • src/routes/learn/-components/add-segment/use-add-segment.ts
🧬 Code graph analysis (10)
src/routes/learn/-components/add-segment/server-functions.ts (1)
src/data-access/segments.ts (1)
  • isSlugInUse (43-51)
src/routes/admin/-components/admin-nav.tsx (1)
src/db/index.ts (1)
  • Database (31-31)
src/fn/vector-search.ts (1)
src/use-cases/vector-search.ts (4)
  • vectorizeSegmentUseCase (16-70)
  • vectorizeAllSegmentsUseCase (72-98)
  • searchTranscriptsUseCase (100-110)
  • getVectorizationStatusUseCase (136-171)
src/lib/openai.ts (1)
src/utils/env.ts (1)
  • env (27-47)
src/routes/admin/vectorization.tsx (2)
src/fn/vector-search.ts (3)
  • getVectorizationStatusFn (31-35)
  • vectorizeAllSegmentsFn (18-22)
  • vectorizeSegmentFn (11-16)
src/db/schema.ts (2)
  • modules (115-126)
  • segments (128-153)
src/use-cases/vector-search.ts (5)
src/data-access/segments.ts (2)
  • getSegmentById (52-59)
  • getSegments (6-8)
src/data-access/transcript-chunks.ts (6)
  • deleteChunksBySegmentId (11-16)
  • createTranscriptChunks (6-9)
  • SearchResult (51-60)
  • searchByEmbedding (62-89)
  • getChunkCountBySegmentIds (26-39)
  • getTotalChunkCount (41-49)
src/lib/chunking.ts (1)
  • chunkTranscript (12-57)
src/lib/openai.ts (2)
  • generateEmbeddings (18-27)
  • generateEmbedding (10-16)
src/use-cases/modules.ts (1)
  • getModulesWithSegmentsUseCase (37-45)
src/data-access/transcript-chunks.ts (2)
src/db/schema.ts (2)
  • TranscriptChunkCreate (1202-1202)
  • transcriptChunks (176-197)
src/db/index.ts (1)
  • database (41-41)
src/routes/admin/vector-search.tsx (5)
src/fn/vector-search.ts (1)
  • searchTranscriptsFn (24-29)
src/routes/admin/-components/page.tsx (1)
  • Page (7-21)
src/components/ui/button.tsx (1)
  • Button (59-59)
src/components/ui/card.tsx (5)
  • Card (85-85)
  • CardHeader (86-86)
  • CardTitle (88-88)
  • CardDescription (90-90)
  • CardContent (91-91)
src/components/ui/badge.tsx (1)
  • Badge (38-38)
src/routes/-components/pricing.tsx (1)
src/config.ts (1)
  • PRICING_CONFIG (4-18)
src/routes/learn/-components/add-segment/use-add-segment.ts (1)
src/routes/learn/-components/add-segment/server-functions.ts (1)
  • validateSlugFn (93-103)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Run Playwright Tests
🔇 Additional comments (26)
src/routes/learn/-components/add-segment/use-add-segment.ts (1)

25-27: Good UX improvement with early slug validation.

Validating the slug before initiating the video upload is a smart optimization that prevents users from wasting time on large uploads only to discover a slug conflict. The server-side validation in createSegmentFn still guards against race conditions, making this a safe change.

Optional: Enhance the comment to mention race condition handling
-      // Validate slug before uploading video to avoid wasting time on large uploads
+      // Validate slug before uploading video to avoid wasting time on large uploads.
+      // createSegmentFn re-validates to handle race conditions.
       await validateSlugFn({ data: { slug: values.slug } });
package.json (1)

78-78: The packages specified are already at their latest versions: gpt-tokenizer@3.4.0 and openai@6.15.0 are the current releases with no known vulnerabilities. No action needed.

Likely an incorrect or invalid review comment.

docker-compose.yml (1)

3-3: pgvector:pg17 is a stable, officially supported choice compatible with the current stack.

The pgvector/pgvector:pg17 image (v0.8.1) is an official Docker tag. Drizzle-orm 0.44.3 explicitly supports the pgvector extension with PostgreSQL 17, and pg 8.16.3 is compatible. No changes needed. Note: pgvector does not publish alpine variants, so pg17-alpine is not available.

Also applies to: 15-15

Likely an incorrect or invalid review comment.

src/routes/admin/-components/admin-nav.tsx (2)

19-20: LGTM! Icons imported correctly.

The Database and Search icons are appropriate for the vectorization and vector search features.


60-71: Add assertIsAdminFn() protection to admin routes.

The routes /admin/vectorization and /admin/vector-search are missing the required beforeLoad hook with assertIsAdminFn(). Update both route files to include:

export const Route = createFileRoute("/admin/vectorization")({
  beforeLoad: assertIsAdminFn(),
  component: AdminVectorization,
});
⛔ Skipped due to learnings
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-19T06:04:43.442Z
Learning: Applies to **/routes/admin/**/*.tsx : When making an admin page, protect it using `assertIsAdminFn()` in the `beforeLoad` hook with `createFileRoute`
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-19T06:04:43.442Z
Learning: Applies to **/{fn,functions}/**/*.{ts,tsx} : When making a TanStack server function that requires admin-only permission, use `.middleware([adminMiddleware])`
Learnt from: amoscicki
Repo: webdevcody/agentic-jumpstart PR: 46
File: src/fn/dev/middleware.ts:3-8
Timestamp: 2025-12-26T04:51:53.750Z
Learning: In this codebase, dev-mode protection uses different strategies by context: server functions (using DevGuardMiddleware) throw errors that are caught by error boundaries, while route-level protection (like assertDevModeFn in beforeLoad) redirects to homepage for better UX. This prevents users from seeing error pages when accidentally accessing dev routes in production.
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-19T06:04:43.442Z
Learning: Applies to **/routes/**/*.tsx : Pages should use the `Page` component and `PageHeader` component when possible
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-19T06:04:43.442Z
Learning: Applies to **/routes/**/*.tsx : Implement error boundaries at the route level using `DefaultCatchBoundary`
src/routes/-components/pricing.tsx (2)

9-9: LGTM! Centralized pricing configuration imported.

This follows the coding guideline to consolidate magic numbers into configuration files.


71-86: Excellent refactoring to eliminate hardcoded pricing values.

The implementation correctly uses:

  • FORMATTED_ORIGINAL_PRICE for the strike-through display
  • DISCOUNT_PERCENTAGE for the savings label
  • FORMATTED_CURRENT_PRICE for the main price

This improves maintainability by centralizing all pricing logic in ~/config.ts, and the discount percentage will automatically update when prices change.

src/config.ts (1)

5-6: Note the significant pricing and discount changes.

The price update reflects:

  • Current price increased from $199 to $249 (+25%)
  • Original price increased from $297 to $299
  • Discount reduced from 33% to 17%

This is a major pricing change that will be reflected across all pricing displays. Ensure this aligns with business objectives and that stakeholders are aware of the reduced discount percentage.

drizzle/0050_glossy_infant_terrible.sql (3)

1-2: LGTM! pgvector extension enabled correctly.

The pgvector extension is properly enabled with IF NOT EXISTS to support vector similarity search.


9-9: LGTM! Correct embedding dimension for OpenAI.

The vector(1536) dimension matches OpenAI's text-embedding-ada-002 and text-embedding-3-small models. The nullable constraint allows for asynchronous embedding generation.


15-17: LGTM! Appropriate constraints and indexes.

The foreign key with ON DELETE cascade ensures data integrity, and the indexes on (segmentId) and (segmentId, chunkIndex) will optimize common query patterns for segment-based lookups and ordered chunk retrieval.

vector-search-prompt.md (1)

1-63: LGTM! Well-structured implementation prompt.

This documentation clearly outlines the vector search implementation requirements and aligns well with the actual implementation in this PR. The prompt covers all key aspects: architecture, chunking strategy, admin workflows, and search functionality.

src/lib/chunking.ts (1)

3-4: LGTM! Constants properly defined.

The chunk size and overlap constants are appropriately defined at the top of the file, following the coding guidelines to avoid magic numbers.

Based on coding guidelines: "Never hard code magic numbers into code; consolidate them to the top of the file."

src/fn/vector-search.ts (1)

11-35: LGTM! Server functions properly implemented.

All four server functions correctly follow the coding guidelines:

  • ✓ Use createServerFn for server-side operations
  • ✓ Apply adminMiddleware for admin-only access
  • ✓ Include input validators where needed
  • ✓ Delegate to use-cases following layered architecture
  • ✓ Access parameters through data property

Based on coding guidelines and learnings: Server functions properly use middleware, validators, and delegate to use-cases.

src/routeTree.gen.ts (1)

1-1677: Auto-generated file – no review required.

This file is automatically generated by TanStack Router and should not be manually edited or reviewed for code quality issues.

Based on learnings: "Routes are auto-generated to /src/routeTree.gen.ts and should never be edited manually."

src/lib/openai.ts (1)

8-8: The choice of text-embedding-3-small is appropriate and well-documented.

OpenAI's text-embedding-3-small (1536 dimensions, 8,192 token context) is actively maintained and recommended as the optimal cost-to-quality choice for production embedding workloads. Dimensions are standard, rate limits are tiered and documented, and there are no deprecation concerns. Consider text-embedding-3-large only if your use case requires maximum accuracy and your budget allows for the higher cost ($0.13 vs $0.02 per 1M tokens).

src/routes/admin/vector-search.tsx (1)

29-172: Well-structured component implementation.

The component correctly uses Page and PageHeader components, follows shadcn/ui patterns with proper Card composition, handles loading/empty states gracefully, and correctly invokes the server function with the { data: {...} } pattern.

src/routes/admin/vectorization.tsx (1)

44-267: Solid implementation with proper patterns.

The component correctly uses shadcn Card components, handles mutations with toast feedback, invalidates queries appropriately, and manages processing state for individual segments. Good use of Page and PageHeader components.

src/use-cases/vector-search.ts (3)

16-45: Clean segment vectorization implementation.

The use-case properly validates the segment, deletes existing chunks for idempotency, chunks the transcript, and handles the empty chunks case. Good separation of concerns.


72-98: Good error isolation in bulk vectorization.

The bulk operation correctly isolates errors per segment, allowing successful segments to complete while collecting errors for reporting. This is a resilient pattern for batch operations.


100-110: Simple and effective search implementation.

The early return for empty queries and delegation to data-access layer follows good patterns.

src/db/schema.ts (3)

17-29: Well-implemented custom vector type for pgvector.

The custom type correctly handles serialization to/from the pgvector string format. Using JSON.parse for the fromDriver conversion is appropriate since pgvector's output format [1,2,3,...] is valid JSON.


176-197: Appropriate schema design for transcript chunks.

The table structure includes proper foreign key with cascade delete, useful indexes for segment-based queries and ordering by chunk index. The nullable embedding column allows storing chunks before embeddings are generated.


792-803: Relations correctly defined.

The bidirectional relationship between segments and transcript chunks is properly established.

src/data-access/transcript-chunks.ts (2)

6-49: Clean repository pattern implementation.

The CRUD functions handle edge cases (empty inputs), use appropriate Drizzle ORM methods, and return useful data types (Map for count lookups). Good adherence to the data-access layer patterns.


51-60: Well-defined SearchResult interface.

The interface provides a clear contract for search results with all necessary fields for the UI layer.

Comment on lines +4 to +13
CREATE TABLE "app_transcript_chunk" (
"id" serial PRIMARY KEY NOT NULL,
"segmentId" serial NOT NULL,
"chunkIndex" integer NOT NULL,
"chunkText" text NOT NULL,
"embedding" vector(1536),
"tokenCount" integer NOT NULL,
"created_at" timestamp DEFAULT now() NOT NULL,
"updated_at" timestamp DEFAULT now() NOT NULL
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Fix segmentId type from serial to integer.

Line 6 defines segmentId as serial, but this column is a foreign key (Line 15) referencing app_segment(id). The serial type creates an auto-incrementing sequence, which is incorrect for a foreign key column.

🔎 Proposed fix
 CREATE TABLE "app_transcript_chunk" (
 	"id" serial PRIMARY KEY NOT NULL,
-	"segmentId" serial NOT NULL,
+	"segmentId" integer NOT NULL,
 	"chunkIndex" integer NOT NULL,
 	"chunkText" text NOT NULL,
 	"embedding" vector(1536),
 	"tokenCount" integer NOT NULL,
 	"created_at" timestamp DEFAULT now() NOT NULL,
 	"updated_at" timestamp DEFAULT now() NOT NULL
 );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
CREATE TABLE "app_transcript_chunk" (
"id" serial PRIMARY KEY NOT NULL,
"segmentId" serial NOT NULL,
"chunkIndex" integer NOT NULL,
"chunkText" text NOT NULL,
"embedding" vector(1536),
"tokenCount" integer NOT NULL,
"created_at" timestamp DEFAULT now() NOT NULL,
"updated_at" timestamp DEFAULT now() NOT NULL
);
CREATE TABLE "app_transcript_chunk" (
"id" serial PRIMARY KEY NOT NULL,
"segmentId" integer NOT NULL,
"chunkIndex" integer NOT NULL,
"chunkText" text NOT NULL,
"embedding" vector(1536),
"tokenCount" integer NOT NULL,
"created_at" timestamp DEFAULT now() NOT NULL,
"updated_at" timestamp DEFAULT now() NOT NULL
);
🤖 Prompt for AI Agents
In drizzle/0050_glossy_infant_terrible.sql around lines 4 to 13, change the
"segmentId" column from type serial to integer NOT NULL so it does not create an
auto-increment sequence and correctly matches the referenced app_segment(id)
foreign key; update the migration SQL to use integer for segmentId, ensure any
sequence creation for that column is removed or not executed, and keep the NOT
NULL and foreign key constraint intact.

Comment thread src/lib/chunking.ts Outdated
Comment thread src/lib/openai.ts
Comment thread src/lib/openai.ts
Comment thread src/routes/admin/vector-search.tsx
Comment thread src/routes/admin/vectorization.tsx
Comment thread src/utils/env.ts
…t IDs

- Added new SQL migration file to alter multiple tables, changing user and segment ID columns from serial to integer types.
- Updated schema definitions in TypeScript to reflect these changes, ensuring consistency across the application.
- Introduced a new snapshot for the database schema to capture the latest structure.
--> statement-breakpoint
CREATE TABLE "app_transcript_chunk" (
"id" serial PRIMARY KEY NOT NULL,
"segmentId" serial NOT NULL,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Foreign key column incorrectly defined as serial type

The segmentId column in app_transcript_chunk is defined as serial but it's a foreign key referencing app_segment.id. Using serial creates an auto-increment sequence, which is incorrect for a foreign key column. While migration 0051 changes the type to integer, PostgreSQL's SET DATA TYPE doesn't remove the DEFAULT value created by serial, potentially leaving auto-increment behavior on a FK column.

Fix in Cursor Fix in Web

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/db/schema.ts (1)

17-29: Extract the embedding dimension to a config constant.

The dimension 1536 is a magic number tied to OpenAI's embedding model. Per coding guidelines, consolidate this to /src/config/index.ts for maintainability and easier updates if you switch embedding models.

🔎 Suggested refactor

In src/config/index.ts:

export const EMBEDDING_DIMENSION = 1536;

Then in schema:

+import { EMBEDDING_DIMENSION } from "@/config";
+
 // Custom type for pgvector
 const vector = customType<{ data: number[]; driverData: string }>({
   dataType() {
-    return "vector(1536)";
+    return `vector(${EMBEDDING_DIMENSION})`;
   },
   // ...
 });
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 78c76ef and d054462.

📒 Files selected for processing (4)
  • drizzle/0051_lowly_songbird.sql
  • drizzle/meta/0051_snapshot.json
  • drizzle/meta/_journal.json
  • src/db/schema.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • drizzle/meta/_journal.json
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.{ts,tsx}: When invoking a TanStack server function, pass an object with a data property containing parameters
All card components should use the shadcn Card component with CardTitle, CardDescription, and other shadcn Card subcomponents
All server-side operations should use TanStack Start server functions created with createServerFn instead of direct backend logic
Form handling should use React Hook Form with Zod validation

**/*.{ts,tsx}: Never hard code magic numbers into code; consolidate them to the top of the file or in /src/config/index.ts to keep the code clean
Never allow a file to exceed 1,000 lines of code; split overly large files into smaller modular components to maintain code maintainability

Files:

  • src/db/schema.ts
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Run Playwright Tests
🔇 Additional comments (6)
drizzle/0051_lowly_songbird.sql (1)

1-34: Large schema migration with column type changes.

This migration modifies many foreign key columns from serial to integer types across the database. A few points to verify:

  1. Data compatibility: Ensure existing data in these columns is compatible with the integer type (should be fine since serial produces integers).
  2. Lines 20 and 27: These drop NOT NULL constraints on category_id and segmentId respectively—confirm this is intentional for your data model.
  3. Missing newline at EOF: Add a trailing newline for POSIX compliance.
src/db/schema.ts (5)

176-197: LGTM!

The transcriptChunks table is well-designed:

  • Nullable embedding allows a two-phase insert (chunk first, embed later)
  • Proper cascade delete on the segment FK
  • Good index coverage for segment lookups and ordered retrieval

239-241: Verify nullable segmentId is intentional.

The segmentId foreign key is now nullable, which allows progress records without a specific segment. Ensure this aligns with your business logic for tracking progress.


797-808: LGTM!

The relations are properly defined, enabling bi-directional navigation between segments and their transcript chunks.


1206-1207: LGTM!

Type exports follow the established pattern and provide proper type inference for the new entity.


468-481: LGTM!

The enum and table definitions are correctly structured. The multiline formatting improves readability.

- Changed PostgreSQL service image in GitHub Actions to use pgvector for improved vector operations.
- Refactored chunking logic to ensure proper handling of token processing.
- Introduced retry logic for embedding generation with error handling for API responses.
- Added validation for input text in embedding functions to ensure non-empty strings.
- Implemented admin-only access for vector search and vectorization routes, enhancing security.
- Updated OpenAI API key retrieval to use environment variables for better configuration management.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
src/lib/openai.ts (2)

8-11: Consider centralizing configuration constants.

While these constants are properly placed at the top of the file, consider moving them to /src/config/index.ts for consistency with other application configuration values (as suggested in the coding guidelines). This would make it easier to adjust embedding parameters across the application.

Example centralized configuration

In /src/config/index.ts:

export const OPENAI_CONFIG = {
  EMBEDDING_MODEL: "text-embedding-3-small",
  MAX_RETRIES: 3,
  INITIAL_RETRY_DELAY_MS: 1000,
  MAX_BATCH_SIZE: 100,
} as const;

Then import in this file:

import { OPENAI_CONFIG } from "~/config";

25-27: Extract duplicated sleep utility to a shared module.

This sleep function is duplicated in src/lib/video-processing-worker.ts (lines 353-355). Extract it to a shared utility module (e.g., src/utils/async.ts or src/utils/helpers.ts) to avoid code duplication and improve maintainability.

🔎 Proposed refactor

Create a new file src/utils/async.ts:

export async function sleep(ms: number): Promise<void> {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

Then import in both files:

import { sleep } from "~/utils/async";

And update src/lib/video-processing-worker.ts to use the shared utility instead of its private method.

src/routes/admin/vectorization.tsx (1)

139-147: Consider handling unmatched segments.

Segments without a matching module title are silently skipped. While this appears intentional, you could improve observability by logging or displaying a warning when segments can't be matched to modules.

💡 Optional enhancement
 segments.forEach((segment) => {
   const moduleId = modules.find((m) => m.title === segment.moduleTitle)?.id;
   if (moduleId) {
     const moduleData = segmentsByModule.get(moduleId);
     if (moduleData) {
       moduleData.segments.push(segment);
     }
+  } else {
+    console.warn(`Segment "${segment.title}" has no matching module: "${segment.moduleTitle}"`);
   }
 });
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d054462 and 488e2d3.

📒 Files selected for processing (6)
  • .github/workflows/test.yml
  • src/lib/chunking.ts
  • src/lib/openai.ts
  • src/routes/admin/vector-search.tsx
  • src/routes/admin/vectorization.tsx
  • src/utils/openai.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/lib/chunking.ts
  • src/routes/admin/vector-search.tsx
🧰 Additional context used
📓 Path-based instructions (4)
**/routes/admin/**/*.tsx

📄 CodeRabbit inference engine (CLAUDE.md)

When making an admin page, protect it using assertIsAdminFn() in the beforeLoad hook with createFileRoute

Files:

  • src/routes/admin/vectorization.tsx
**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.{ts,tsx}: When invoking a TanStack server function, pass an object with a data property containing parameters
All card components should use the shadcn Card component with CardTitle, CardDescription, and other shadcn Card subcomponents
All server-side operations should use TanStack Start server functions created with createServerFn instead of direct backend logic
Form handling should use React Hook Form with Zod validation

**/*.{ts,tsx}: Never hard code magic numbers into code; consolidate them to the top of the file or in /src/config/index.ts to keep the code clean
Never allow a file to exceed 1,000 lines of code; split overly large files into smaller modular components to maintain code maintainability

Files:

  • src/routes/admin/vectorization.tsx
  • src/utils/openai.ts
  • src/lib/openai.ts
**/routes/**/*.tsx

📄 CodeRabbit inference engine (CLAUDE.md)

**/routes/**/*.tsx: Pages should use the Page component and PageHeader component when possible
Implement error boundaries at the route level using DefaultCatchBoundary

Files:

  • src/routes/admin/vectorization.tsx
**/{components,routes}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Component styling should follow shadcn/ui patterns with Tailwind CSS v4

Files:

  • src/routes/admin/vectorization.tsx
🧠 Learnings (5)
📚 Learning: 2025-12-19T06:04:43.442Z
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-19T06:04:43.442Z
Learning: Applies to **/routes/admin/**/*.tsx : When making an admin page, protect it using `assertIsAdminFn()` in the `beforeLoad` hook with `createFileRoute`

Applied to files:

  • src/routes/admin/vectorization.tsx
📚 Learning: 2025-12-19T06:04:43.442Z
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-19T06:04:43.442Z
Learning: Routes are auto-generated to `/src/routeTree.gen.ts` and should never be edited manually

Applied to files:

  • src/routes/admin/vectorization.tsx
📚 Learning: 2025-12-26T04:51:53.750Z
Learnt from: amoscicki
Repo: webdevcody/agentic-jumpstart PR: 46
File: src/fn/dev/middleware.ts:3-8
Timestamp: 2025-12-26T04:51:53.750Z
Learning: In this codebase, dev-mode protection uses different strategies by context: server functions (using DevGuardMiddleware) throw errors that are caught by error boundaries, while route-level protection (like assertDevModeFn in beforeLoad) redirects to homepage for better UX. This prevents users from seeing error pages when accidentally accessing dev routes in production.

Applied to files:

  • src/routes/admin/vectorization.tsx
📚 Learning: 2025-12-19T06:04:43.442Z
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-19T06:04:43.442Z
Learning: Applies to **/routes/**/*.tsx : Pages should use the `Page` component and `PageHeader` component when possible

Applied to files:

  • src/routes/admin/vectorization.tsx
📚 Learning: 2025-12-26T04:53:33.715Z
Learnt from: amoscicki
Repo: webdevcody/agentic-jumpstart PR: 46
File: .github/workflows/test.yml:40-40
Timestamp: 2025-12-26T04:53:33.715Z
Learning: In CI workflow files under .github/workflows, if there is a documented platform-specific npm bug (e.g., npm bug #4828 with optional dependencies), use npm install with rm -f package-lock.json to resolve platform binaries as described in repository comments. This is a deliberate trade-off against npm ci. Ensure the rationale and any caveats (potential longer install times and possible lockfile drift) are documented in the workflow comments, and apply this pattern only where the workaround is validated by maintainers.

Applied to files:

  • .github/workflows/test.yml
🧬 Code graph analysis (2)
src/utils/openai.ts (1)
src/utils/env.ts (1)
  • env (27-47)
src/lib/openai.ts (2)
src/utils/env.ts (1)
  • env (27-47)
src/lib/video-processing-worker.ts (1)
  • sleep (354-356)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Run Playwright Tests
🔇 Additional comments (7)
src/utils/openai.ts (1)

6-8: LGTM! Centralizing environment variable access.

The change to use the env utility instead of direct process.env access is a good practice that centralizes environment variable management and improves testability with fallback values.

src/lib/openai.ts (5)

1-6: LGTM: Clean integration with centralized environment configuration.

The OpenAI client initialization correctly uses the centralized env module, which provides proper environment variable handling with test fallbacks.


13-23: LGTM: Well-designed custom error class.

The EmbeddingError class provides rich contextual information (code, status, context) that will be valuable for debugging and error handling in upstream code.


29-76: LGTM: Robust retry mechanism with exponential backoff.

The withRetry function implements a well-designed retry strategy:

  • Handles transient OpenAI API errors (rate limits, server errors)
  • Exponential backoff prevents overwhelming the API
  • Rich error context for debugging
  • Type-safe generic implementation

78-108: Excellent implementation with comprehensive error handling.

The function now includes all the error handling that was flagged in previous reviews:

  • Thorough input validation (type, empty, whitespace-only)
  • Retry logic via withRetry wrapper for transient API failures
  • Response validation to ensure embedding data exists
  • Rich contextual error information for debugging

The previous review concerns have been fully addressed.


110-170: Excellent batch processing implementation with thorough validation.

The function comprehensively addresses all concerns from previous reviews:

  • Array and per-item input validation with precise error indices
  • Proper batch processing respecting MAX_BATCH_SIZE (100) to avoid API limits
  • Retry logic for each batch via withRetry wrapper
  • Multi-level response validation (batch length and individual embeddings)
  • Rich error context including batch indices for debugging

Sequential batch processing is appropriate here—parallel requests could trigger rate limits. The implementation is robust and production-ready.

src/routes/admin/vectorization.tsx (1)

32-35: Admin protection correctly implemented.

The route now includes the required assertIsAdminFn() call in the beforeLoad hook, addressing the previous review concern.

Based on coding guidelines and past review comments.

services:
postgres-test:
image: postgres:17
image: pgvector/pgvector:pg17
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify pgvector/pgvector:pg17 image availability and check for vulnerabilities

# Check if the Docker image exists on Docker Hub
echo "=== Checking Docker Hub for pgvector/pgvector image tags ==="
curl -s "https://hub.docker.com/v2/repositories/pgvector/pgvector/tags?page_size=100" | \
  jq -r '.results[] | select(.name == "pg17" or .name | startswith("17")) | .name' | head -20

Repository: webdevcody/agentic-jumpstart

Length of output: 211


🌐 Web query:

pgvector/pgvector:pg17 Docker image security advisories vulnerabilities

💡 Result:

I found public security scan results showing the Docker tag pgvector/pgvector:pg17 contains multiple OS-package vulnerabilities (several High and a few Critical/Severe) and at least some identified CVEs. Sources:

  • Official pgvector repo (shows pg17 tag exists and recommended tags). [1]
  • Trivy scan results for docker.io/pgvector/pgvector:pg17 listing many findings (example CVE entries: CVE-2023-45853, CVE-2024-26462, CVE-2023-2953). [2]
  • Mirror/image metadata page with the Trivy summary (counts: dozens of Low/Medium, multiple High, several Severe). [3]
  • Snyk entry for a pgvector-related CVE (package-level advisory). [4]

Recommendation (concise): avoid running the pg17 image in production until you (a) pull a newer tag or rebuild from source with updated base packages, (b) scan the exact image you plan to use (Trivy/Clair/etc.), and (c) apply fixes/updates for the listed CVEs.

Sources:

  1. pgvector GitHub (image tags). [1]
  2. Trivy scan details for docker.io/pgvector/pgvector:pg17. [2]
  3. Docker image metadata/scan summary for pgvector:pg17. [3]
  4. Snyk advisory for a pgvector package CVE. [4]

🌐 Web query:

pgvector Docker Hub official image maintenance

💡 Result:

Short answer: The pgvector project publishes and maintains Docker images from its GitHub repository (pgvector/pgvector) — those are the project’s “official” images — and there are also multiple third‑party/community pgvector images on Docker Hub. See the project README (Docker tags and docker pull instructions) and examples of community images on Docker Hub. [1][2]

Sources:

  • pgvector GitHub (Docker usage, supported tags). [1]
  • Example community Docker Hub pgvector repositories. [2]

LGTM—but verify image security before merging.

The update from postgres:17 to pgvector/pgvector:pg17 is appropriate for testing vector search functionality. The image exists and is officially maintained by the pgvector GitHub repository.

Note: Security scans show the pg17 tag contains multiple OS-package vulnerabilities (High and Critical severity), including CVE-2023-45853, CVE-2024-26462, and CVE-2023-2953. While this is a test environment (lower risk), consider:

  • Using a newer pgvector tag if available (e.g., latest or a more recent version)
  • Documenting that this image is for testing only and updating when patches are released

The pgvector extension initialization is properly handled in migrations as mentioned in the PR summary.

🤖 Prompt for AI Agents
In .github/workflows/test.yml around line 17, the workflow switches the test DB
image to pgvector/pgvector:pg17 which is known to contain High/Critical
OS-package CVEs; update the image to a newer, patched pgvector tag (or explicit
digest) that fixes those CVEs, validate compatibility by running the test matrix
locally/CI, then update the workflow line to that tag/digest; additionally add a
short comment in the workflow noting this image is test-only and create or
enable periodic image security scans (or Dependabot) to track future security
fixes.

Comment thread src/routes/admin/vectorization.tsx
- Modified the `handleVectorizeAll` function to pass an empty data object to the mutation, ensuring proper data structure for the vectorization process.
- Enhanced the `isSegmentCompleted` function in `ModulePanel` to prioritize local uncompleted segment states, improving immediate UI feedback for segment completion status.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/routes/admin/vectorization.tsx (1)

139-147: Minor: Title-based module lookup.

The segment grouping uses module.title matching (line 140) which is O(n×m) and could be fragile if titles aren't unique. If the API can return moduleId directly on segments, that would be more robust. For an admin page with modest data volumes, this works fine.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 488e2d3 and f77046c.

📒 Files selected for processing (2)
  • src/routes/admin/vectorization.tsx
  • src/routes/learn/-components/module-panel.tsx
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.{ts,tsx}: When invoking a TanStack server function, pass an object with a data property containing parameters
All card components should use the shadcn Card component with CardTitle, CardDescription, and other shadcn Card subcomponents
All server-side operations should use TanStack Start server functions created with createServerFn instead of direct backend logic
Form handling should use React Hook Form with Zod validation

**/*.{ts,tsx}: Never hard code magic numbers into code; consolidate them to the top of the file or in /src/config/index.ts to keep the code clean
Never allow a file to exceed 1,000 lines of code; split overly large files into smaller modular components to maintain code maintainability

Files:

  • src/routes/learn/-components/module-panel.tsx
  • src/routes/admin/vectorization.tsx
**/routes/**/*.tsx

📄 CodeRabbit inference engine (CLAUDE.md)

**/routes/**/*.tsx: Pages should use the Page component and PageHeader component when possible
Implement error boundaries at the route level using DefaultCatchBoundary

Files:

  • src/routes/learn/-components/module-panel.tsx
  • src/routes/admin/vectorization.tsx
**/{components,routes}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Component styling should follow shadcn/ui patterns with Tailwind CSS v4

Files:

  • src/routes/learn/-components/module-panel.tsx
  • src/routes/admin/vectorization.tsx
**/routes/admin/**/*.tsx

📄 CodeRabbit inference engine (CLAUDE.md)

When making an admin page, protect it using assertIsAdminFn() in the beforeLoad hook with createFileRoute

Files:

  • src/routes/admin/vectorization.tsx
🧠 Learnings (4)
📚 Learning: 2025-12-19T06:04:43.442Z
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-19T06:04:43.442Z
Learning: Applies to **/routes/admin/**/*.tsx : When making an admin page, protect it using `assertIsAdminFn()` in the `beforeLoad` hook with `createFileRoute`

Applied to files:

  • src/routes/admin/vectorization.tsx
📚 Learning: 2025-12-19T06:04:43.442Z
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-19T06:04:43.442Z
Learning: Routes are auto-generated to `/src/routeTree.gen.ts` and should never be edited manually

Applied to files:

  • src/routes/admin/vectorization.tsx
📚 Learning: 2025-12-26T04:51:53.750Z
Learnt from: amoscicki
Repo: webdevcody/agentic-jumpstart PR: 46
File: src/fn/dev/middleware.ts:3-8
Timestamp: 2025-12-26T04:51:53.750Z
Learning: In this codebase, dev-mode protection uses different strategies by context: server functions (using DevGuardMiddleware) throw errors that are caught by error boundaries, while route-level protection (like assertDevModeFn in beforeLoad) redirects to homepage for better UX. This prevents users from seeing error pages when accidentally accessing dev routes in production.

Applied to files:

  • src/routes/admin/vectorization.tsx
📚 Learning: 2025-12-19T06:04:43.442Z
Learnt from: CR
Repo: webdevcody/agentic-jumpstart PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-19T06:04:43.442Z
Learning: Applies to **/routes/**/*.tsx : Pages should use the `Page` component and `PageHeader` component when possible

Applied to files:

  • src/routes/admin/vectorization.tsx
🧬 Code graph analysis (2)
src/routes/learn/-components/module-panel.tsx (2)
src/routes/learn/-components/segment-context.tsx (1)
  • useSegment (58-64)
src/db/schema.ts (1)
  • progress (232-250)
src/routes/admin/vectorization.tsx (3)
src/fn/vector-search.ts (3)
  • getVectorizationStatusFn (31-35)
  • vectorizeAllSegmentsFn (18-22)
  • vectorizeSegmentFn (11-16)
src/routes/admin/-components/page.tsx (1)
  • Page (7-21)
src/db/schema.ts (2)
  • modules (115-126)
  • segments (128-153)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Run Playwright Tests
🔇 Additional comments (8)
src/routes/learn/-components/module-panel.tsx (1)

7-7: LGTM! Clean implementation of optimistic UI updates.

The integration of local segment state via the useSegment hook is well-implemented. The precedence logic in isSegmentCompleted correctly prioritizes local uncompleted state for immediate UI feedback, then falls back to combining server progress with local completed state. The comments clearly explain the behavior, and the implementation provides a smooth user experience by avoiding UI lag when users mark segments as complete or incomplete.

Also applies to: 38-51

src/routes/admin/vectorization.tsx (7)

32-35: Admin protection correctly implemented.

The route properly uses assertIsAdminFn() in the beforeLoad hook to protect this admin page, satisfying the coding guidelines. Based on learnings and coding guidelines.


37-44: LGTM!

Clean query options setup with proper type inference using Awaited<ReturnType<...>> pattern.


53-102: Mutations follow the correct invocation pattern.

Both mutation handlers properly use the { data: ... } pattern as per coding guidelines (lines 96, 101). Error handling correctly clears the processing state to prevent UI from getting stuck. Query invalidation properly refreshes the status data.


104-122: Loading and error states look good.

The loading and error UI properly wraps content in the Page component. Consider whether you need to add errorComponent: DefaultCatchBoundary to the route definition for consistent error handling across the admin routes.


149-176: Good use of PageHeader with conditional actions.

Proper use of Page and PageHeader components per coding guidelines. The button correctly disables based on mutation state and data conditions (lines 158-160).


178-234: Statistics cards follow shadcn patterns.

All five statistic cards properly use Card, CardHeader, CardTitle, and CardContent subcomponents as per coding guidelines. Color coding (green for vectorized, amber for needs work) provides clear visual feedback.


271-343: Well-structured SegmentRow component.

Clean separation with typed props interface. The conditional rendering covers all segment states clearly (needs vectorization, completed, no transcript) with appropriate visual indicators. The processing state properly disables the button and shows feedback.

- Introduced local state management for segment completion status by integrating the useSegment context.
- Updated isSegmentCompleted function to prioritize local uncompleted segments for immediate UI feedback.
- Adjusted moduleProgress calculation to include locally completed and uncompleted segment IDs, improving accuracy in progress representation.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/routes/learn/-components/module-accordion-header.tsx (1)

70-80: Consider memoizing or inlining the isSegmentCompleted helper.

The isSegmentCompleted function is defined in the component body but not wrapped in useCallback, which means it's recreated on every render. While the logic is correct and will work as intended (since all dependencies are properly tracked in the useMemo at line 103), ESLint's exhaustive-deps rule will likely warn that the function itself is missing from the dependency array.

🔎 Refactor options

Option 1: Define the helper inside the useMemo (recommended for this use case):

  const moduleProgress = useMemo(() => {
+   const isSegmentCompleted = (segmentId: number) => {
+     if (locallyUncompletedSegmentIds.has(segmentId)) {
+       return false;
+     }
+     return (
+       progress.some((p) => p.segmentId === segmentId) ||
+       locallyCompletedSegmentIds.has(segmentId)
+     );
+   };
+
    if (!module.segments || !Array.isArray(module.segments)) {
      return {
        completed: 0,
        total: 0,
        percentage: 0,
      };
    }

    const completedSegments = module.segments.filter((segment) =>
      isSegmentCompleted(segment.id)
    ).length;
    return {
      completed: completedSegments,
      total: module.segments.length,
      percentage:
        module.segments.length > 0
          ? (completedSegments / module.segments.length) * 100
          : 0,
    };
  }, [module.segments, progress, locallyCompletedSegmentIds, locallyUncompletedSegmentIds]);
-
- const isSegmentCompleted = (segmentId: number) => {
-   // Check local uncompleted state first (takes precedence)
-   if (locallyUncompletedSegmentIds.has(segmentId)) {
-     return false;
-   }
-   // Check both server progress and locally completed segments (for immediate UI feedback)
-   return (
-     progress.some((p) => p.segmentId === segmentId) ||
-     locallyCompletedSegmentIds.has(segmentId)
-   );
- };

Option 2: Wrap in useCallback if needed elsewhere:

+ const isSegmentCompleted = useCallback(
+   (segmentId: number) => {
+     if (locallyUncompletedSegmentIds.has(segmentId)) {
+       return false;
+     }
+     return (
+       progress.some((p) => p.segmentId === segmentId) ||
+       locallyCompletedSegmentIds.has(segmentId)
+     );
+   },
+   [progress, locallyCompletedSegmentIds, locallyUncompletedSegmentIds]
+ );

  const moduleProgress = useMemo(() => {
    // ... rest of the code
- }, [module.segments, progress, locallyCompletedSegmentIds, locallyUncompletedSegmentIds]);
+ }, [module.segments, progress, locallyCompletedSegmentIds, locallyUncompletedSegmentIds, isSegmentCompleted]);

Also applies to: 93-93, 103-103

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f77046c and e22e0f1.

📒 Files selected for processing (1)
  • src/routes/learn/-components/module-accordion-header.tsx
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.{ts,tsx}: When invoking a TanStack server function, pass an object with a data property containing parameters
All card components should use the shadcn Card component with CardTitle, CardDescription, and other shadcn Card subcomponents
All server-side operations should use TanStack Start server functions created with createServerFn instead of direct backend logic
Form handling should use React Hook Form with Zod validation

**/*.{ts,tsx}: Never hard code magic numbers into code; consolidate them to the top of the file or in /src/config/index.ts to keep the code clean
Never allow a file to exceed 1,000 lines of code; split overly large files into smaller modular components to maintain code maintainability

Files:

  • src/routes/learn/-components/module-accordion-header.tsx
**/routes/**/*.tsx

📄 CodeRabbit inference engine (CLAUDE.md)

**/routes/**/*.tsx: Pages should use the Page component and PageHeader component when possible
Implement error boundaries at the route level using DefaultCatchBoundary

Files:

  • src/routes/learn/-components/module-accordion-header.tsx
**/{components,routes}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Component styling should follow shadcn/ui patterns with Tailwind CSS v4

Files:

  • src/routes/learn/-components/module-accordion-header.tsx
🧬 Code graph analysis (1)
src/routes/learn/-components/module-accordion-header.tsx (2)
src/routes/learn/-components/segment-context.tsx (1)
  • useSegment (58-64)
src/db/schema.ts (1)
  • progress (232-250)
🔇 Additional comments (1)
src/routes/learn/-components/module-accordion-header.tsx (1)

34-34: LGTM! Clean integration of segment context.

The import and usage of the useSegment hook follows React patterns correctly and enables the desired local state tracking for immediate UI feedback.

Also applies to: 62-63

@webdevcody webdevcody merged commit dc66287 into main Jan 1, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant