Async Chunked Export for Large Databases#132
Open
suletetes wants to merge 14 commits intoouterbase:mainfrom
Open
Async Chunked Export for Large Databases#132suletetes wants to merge 14 commits intoouterbase:mainfrom
suletetes wants to merge 14 commits intoouterbase:mainfrom
Conversation
…rt processing, added createExportJob/getExportJob RPC methods, stored R2 bucket reference
…rt processing, added createExportJob/getExportJob RPC methods, stored R2 bucket reference
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces an asynchronous, chunked export system that enables StarbaseDB to export databases of any size up to the planned 10GB Durable Object SQLite limit without hitting the 30-second Cloudflare Workers request timeout or blocking the Durable Object from serving other requests.
Problem
The existing export endpoints (
GET /export/dump,GET /export/json/:tableName,GET /export/csv/:tableName) load the entire dataset into memory and return it in a single synchronous response. This fails for large databases because:Reproduction
On a database large enough that the export exceeds 30 seconds, this returns a network error instead of a dump file.
Solution
A two-tier approach that preserves full backward compatibility:
Tier 1 Synchronous (unchanged)
For small databases that complete within the 30-second window, the existing
GET /export/dump,GET /export/json/:tableName, andGET /export/csv/:tableNameendpoints continue to work exactly as before. Zero changes for existing users.Tier 2 Asynchronous Chunked Export (new)
For large databases, a new
POST /export/dumpendpoint initiates a background export job that:202 Acceptedimmediately with ajobIdandstatusUrltmp_export_jobsinternal table (current table, offset, bytes written)New API Endpoints
POST /export/dumpStart Async ExportRequest:
{ "async": true, "format": "sql", "callbackUrl": "https://your-webhook.com/export-done" }asynctrueto use async exportformatsql(default),json, orcsvcallbackUrlResponse (202 Accepted):
{ "result": { "jobId": "export_20240101-170000_abc123", "status": "pending", "statusUrl": "/export/jobs/export_20240101-170000_abc123", "estimatedTables": 15 } }GET /export/jobs/:jobIdCheck Job StatusResponse (200):
{ "result": { "jobId": "export_20240101-170000_abc123", "status": "completed", "format": "sql", "completedTables": 15, "totalTables": 15, "bytesWritten": 524288000, "createdAt": "2024-01-01 17:00:00", "completedAt": "2024-01-01 17:02:34", "errorMessage": null } }Job statuses:
pending→in_progress→completed|failedGET /export/jobs/:jobId/downloadDownload Export FileReturns the completed export file from R2 with appropriate
Content-TypeandContent-Dispositionheaders. Returns400if the job is not yet completed,404if the job doesn't exist.Webhook Callback
When a
callbackUrlis provided, the system POSTs to it on completion or failure:On completion:
{ "jobId": "export_20240101-170000_abc123", "status": "completed", "downloadUrl": "/export/jobs/export_20240101-170000_abc123/download" }On failure:
{ "jobId": "export_20240101-170000_abc123", "status": "failed", "error_message": "Export job timed out after 10 minutes" }Infrastructure Requirements
R2 Bucket
This feature requires a Cloudflare R2 bucket. The binding is already configured in
wrangler.toml:Users must create the bucket before using async exports:
If the
EXPORT_BUCKETbinding is not configured, the async export endpoint returns a clear400error:{ "error": "Async exports require the EXPORT_BUCKET R2 binding to be configured" }The synchronous
GETexport endpoints are completely unaffected by whether R2 is configured or not.Files Changed
New Files
src/export/job.tscreateExportJob,processExportChunk,completeExportJob,failExportJob,getExportJob,deliverCallback,generateJobIdsrc/export/format.tsformatChunkAsSQL,formatChunkAsJSON,formatChunkAsCSVsrc/export/job.test.tssrc/export/dump.async.test.tssrc/export/preservation.test.tsModified Files
wrangler.toml[[r2_buckets]]binding forEXPORT_BUCKETworker-configuration.d.tsEXPORT_BUCKET: R2BuckettoEnvinterfacesrc/index.tsEXPORT_BUCKET?: R2BuckettoEnvinterface; passes it toDataSourceasr2ExportBucketsrc/types.tsr2ExportBucket?: R2BuckettoDataSourcetypesrc/do.tstmp_export_jobstable creation; extendedalarm()handler for export job processing with stuck job detection and retry logic; stored R2 bucket reference; exposedcreateExportJobandgetExportJobviainit()RPCsrc/export/dump.tsasyncDumpDatabaseRoute,getExportJobRoute,downloadExportJobRoutehandlers; existingdumpDatabaseRouteunchangedsrc/handler.tsPOST /export/dump,GET /export/jobs/:jobId,GET /export/jobs/:jobId/downloadroutes withisInternalSourcemiddlewaresrc/rls/index.test.tsREADME.mdInternal Table Schema
Breathing Strategy
Each Durable Object alarm cycle follows this pattern to prevent blocking:
tmp_export_jobsThe 100ms gap between alarm cycles allows the Durable Object to process any queued requests (WebSocket messages, RPC calls, other HTTP requests), keeping the system responsive during long exports.
Error Handling
EXPORT_BUCKETnot configured400with descriptive error messagein_progress>10 minutesfailedwith timeout messagefailed, multipart upload abortedGET /export/jobs/:idreturns404GET /export/jobs/:id/downloadreturns400Backward Compatibility
This change is fully backward compatible:
GET /export/*endpoints work identicallyPOST /export/dumproute does not conflict with the existingGET /export/dumproute (different HTTP methods)EXPORT_BUCKETR2 binding is optional if not configured, only the async export returns an error; all other functionality is unaffectedDataSourcetype break existing code (the newr2ExportBucketfield is optional)Test Results
Test Coverage Breakdown
job.test.tsgenerateJobIduniqueness,createExportJob(R2 init, job store, missing bucket, target table, callback),processExportChunk(state transitions, R2 upload, progress tracking, all 3 formats),completeExportJob(R2 finalize, empty export handling),failExportJob(status + R2 abort),getExportJob(found/not found),deliverCallback(success/failure/null/network error), full lifecycle integration for SQL/JSON/CSVdump.async.test.tspreservation.test.tsLive Verification
The feature was tested end-to-end on a running
wrangler devinstance with a real R2 bucket:POST /queryGET /export/dumpstill returns full SQL dump ✓POST /export/dumpwith{"async": true, "format": "sql"}received202withjobId✓GET /export/jobs/:jobIdstatus progressed frompendingtocompletedin ~1 second ✓GET /export/jobs/:jobId/downloadreceived complete SQL dump with CREATE TABLE and INSERT statements ✓Related Issues
Fixes the database export timeout issue for large databases as described in the original bug report.
Here's the video description to add at the end of your PR:
Screenshot
The Screenshot below shows the full test suite running with
npx vitest --runall 199 tests across 22 test files pass successfully, including the new async chunked export tests (bug condition exploration, preservation property tests, and 36 comprehensive job lifecycle tests covering SQL, JSON, and CSV format exports through the complete create → process → complete flow)./claim #59