# AI Video Editor

`03 · ai-video-editor · R&D`

Local AI video editor as an MCP-style substrate — file-level tool API with hash-verified diffs and audit trails. Timeline, scripts and async TTS as code, Gemini as orchestrator.

**Scope:** Solo · 3 weeks  
**Role:** Local Remotion IDE

**Video:** [YouTube](https://www.youtube.com/watch?v=0FoXG0a_KBw) · [RuTube](https://rutube.ru/video/private/53d47519e665a3a008181653123bfa56/?p=HQuscnTnLSXaLZbs-RyJ8Q)

## Video walkthrough

Local AI video editor built as a substrate for agent-driven workflows — timeline, previews, scripts and async jobs all live in code, so an agent reads, edits and reverts them line by line through hash-verified changesets. Custom timeline with frame-to-hours zoom, four-pane CodeMirror script editor, async TTS pipeline; everything runs locally.

Local AI video editor — built as a substrate for agent-driven workflows. Timeline, previews, scripts and async jobs all live in code, so an agent can read, edit and revert them line by line.

The timeline is custom — drag, blade, magnet snap, and zoom levels from frame to hours.

The script editor opens any markdown or JSON file from the project tree up to four CodeMirror panes side by side.

Ask the agent to rewrite a paragraph and the editor updates live. Every edit is a hash-verified changeset — apply, revert, or open the diff before touching the disk. Ask it to cut and re-arrange the timeline and the moves play back step by step.

Audio generation is the first async pipeline wired up — the clip shows its progress as the job runs.

Everything runs locally. The substrate is in place — the rest stacks on top.

---

## Context

> Timeline in code — otherwise the agent stays a chat window.

Programmatically driving a video timeline from outside a commercial editor is a thin surface. Premiere Pro’s ExtendScript exposes import, export and basic timeline operations, but the editing surface is thinner than what an agent orchestrating generation needs — documentation thins out fast, and the parts a real workflow has to lean on sit outside the scriptable area. DaVinci’s scripting is clean and well-documented, but only in the paid edition; the rest of the market is closed or prototype. For an agent that orchestrates generation across a timeline, "use the existing editor’s plugin layer" stops being an option early.

Scripted video, increasingly, is code. Remotion compositions are TSX. Scenario lines and timings live in markdown and JSON. Where editing means moving lines and re-cutting timing, an AI coauthor can use the same tools a developer would — read a slice, replace a line range, diff, revert — if the editor itself treats the script as a file rather than a chat transcript. And the timeline itself has to live in code too, or the agent has nothing to write into.

## Facts

| | |
|---|---|
| **Scope** | 21 days solo |
| **Surfaces** | Browser editor + local FastAPI media-service · monorepo + docker-compose · runs locally |
| **Timeline** | Zoom 0.01×–50× · magnet snap · blade · undo/redo |
| **Composition** | Remotion 4.0 · Babel-standalone on-demand compile · per-clip error overlay |
| **AI agent** | Gemini 3 + 2.5 · audit-first changesets · sha256 hash-verified apply/revert · SSE |
| **Audio** | Async TTS pipeline · queue + state machine on the clip · two-speaker dialogue · scenario versioning |
| **Status** | Substrate built end-to-end · script edits and async TTS live · agent-orchestrator and export — next slices |

## Architecture

### AI edit lifecycle

```text
 1  User                          Type message + Enter
        │
 2  Frontend
        │  POST /ai/chat                       [threadId, model, system_instr]
        ▼
 3  media_service  (FastAPI)
        │  insert thread + message + run[pending]
        │  ThreadPoolExecutor.submit(run_chat_job)
        ▼
 4  run_chat_job  (worker thread)
        │  client.generate_stream(model, contents, tools, config)
        ▼
 5  Gemini  ──►  text  /  thought  /  functionCall
                │
                ▼
              tool_call  (file_read_slices, text_replace_lines, …)
                │
                ▼
              text_tools  ──►  audit  {before, after, meta}
        │
 6  agent_changes_v2
        │  append_file_change · sha256 base/after · unified diff
        ▼
 7  Frontend  ◄── SSE /ai/chat/{run_id}/stream      [streaming|tools|done]
        │
 8  apply_changeset(forward)
        │  verify sha256(file) == base_hash  →  write OR 409 hash_mismatch
        ▼
 9  File written  ·  changeset.status = applied
```

**Audit-payload always.** Every tool call returns an audit payload (before / after / meta) even when apply=false. Backend persists before_text, after_text, sha256 base/after, unified diff and byte size to SQLite before the file is touched. This gives a manual-review preview mode for free.

**Hash-verification.** Forward-apply compares current sha256(file) with base_hash; reverse-apply with after_hash. On mismatch — 409 with {path, expected, actual, direction}. force=true skips the check explicitly when the user accepts the clobber.

**SSE, not WebSocket.** Single SSE endpoint streams events {running, streaming, tools, retrying, complete, error}. sessionStorage holds the active run_id so F5 reattaches without losing the stream. Cancel is a separate POST.

### Component layout

```text
   Browser                       Local services            Storage
   ───────                       ──────────────            ───────
   Frontend  (Vite · React)      media_service             /projects/<id>/
        │                        (FastAPI :8000)              │
        Zustand · PlaybackStore        │                      ├ project.json
        Remotion Player                ▼                      ├ assets/<sha>.<ext>
        CodeMirror             ┌──────────────────┐           ├ previews/  proxies/
        │                      │  routes (~50)    │           ├ remotion/<aid>/
        │  HTTPS  /api/*       │  ThreadPool(5)   │           │     manifest.json
        ├─────────────────────►│  asyncio.Sem.    │           └ scripts/
        ◄──── SSE stream ──────┤  ffmpeg subproc  │             ├ *.md  /  *.json
                               │  google-genai    │             ├ workflows/*.json
                               └────┬─────────────┘             └ Scenario_TTS.json
                                    │
                                    ▼
                          SQLite (WAL · 11 tables)         External
                          ────────────────────────         ────────
                          threads · messages · runs        Gemini API
                          usage_records · tool_events      google-cloud-speech
                          changesets_v2 + file_changes_v2
                          tts_jobs · stt_jobs
```

**Two services, no broker.** Frontend + media_service in one docker-compose, plus a /projects volume. No Redis, no Celery — SQLite WAL + ThreadPool plays the queue role. See §04 / D4 for the fork.

**Project lives on the FS.** project.json is the single source of truth. Assets are content-addressable (sha256 as filename). On restart media_service rebuilds runtime state from FS + SQLite, no warm-up cache to invalidate.

**External — provider-agnostic at the call site.** Current substrate uses google-genai (chat + thinking + tools + TTS) and google-cloud-speech (STT) — swapping a model is adding a worker, not rewriting the substrate. Pricing per modality is computed locally from usage_metadata before persisting.

### Project + state model

```text
PROJECT (project.json)             RUNTIME STATE (in-memory)
──────────────────────             ─────────────────────────
Project                            Zustand store
  ├── Asset (×N)                     ├── project        (committed)
  │     · type:                      ├── editorState    (committed playhead)
  │       video|audio|image|         ├── history.past   (≤20 snapshots)
  │       remotion|text|callout      └── openTextContents
  │     · hash · originalPath
  │     · audio.{generationStage,    PlaybackStore  (mini-store, not Zustand)
  │              progress, error}      ├── status   playing|paused|buffering
  │                                    ├── timeSeconds  (live, every frame)
  ├── Folder (×N)                      ├── frame        (live)
  │                                    └── lastUpdateTs  (poller fallback)
  ├── Track (×M)
  │     ├── kind  Video | Audio
  │     └── Clip (×K)
  │           · type  Media | Text | Remotion
  │           · trackId · start · duration
  │           · linkedClipId   (V↔A pair)
  │           · transform      (overlay state)
  │
  └── EditorState
         · playhead    (committed, paused position)
         · zoom · tool · selectedClipIds


SQLITE  (usage.sqlite · WAL · 11 tables)
─────────────────────────────────────────
threads          messages          runs               usage_records
   └── usage           └── status       └── thinking_     └── prompt/output/
       totals              run_id           preset            cached/thought

agent_changesets_v2 ──┬── agent_file_changes_v2
                      │      · seq · path
                      │      · base_hash · after_hash
                      │      · before_text · after_text · patch_text
                      │      · status  pending|ready|applied|reverted
                      │
tool_events           tts_jobs           stt_jobs
   · run_id · seq        · status           · status
   · payload_json        · result_json      · result_json
```

**Two stores, two cadences.** Zustand holds committed state (paused playhead, undo stack, selectedClipIds). PlaybackStore holds live time. Live updates never reach Zustand — the parent tree around the Remotion Player does not re-render at frame rate.

**linkedClipId — one model for V↔A.** When a video with audio is dropped, the importer demuxes a separate audio asset and pairs the clips through mutual linkedClipId. Cut, move and delete cascade across the pair; a blade cut splits both halves cleanly.

**Audit-trail in SQLite.** Every AI tool call writes to changesets_v2 + file_changes_v2 before the file is touched. The status column (pending|ready|applied|reverted) is the edit timeline of the project — replayable forward and reverse.

## Key engineering decisions

### 01 · Audit-first changesets with hash-verification

**Decision.** Every AI tool call returns an audit payload (before / after / meta) even when apply=false. Backend persists before_text, after_text, sha256 base/after, unified diff and byte size in SQLite. apply_changeset compares the current sha256(file) against base_hash and rejects with 409 hash_mismatch when the file has drifted.

**Why.** An AI agent writing to files is a race condition by default. A user editing the same file in CodeMirror between generation and apply silently clobbers one side or the other. Hash verification makes the race visible — UI shows expected vs actual and a force-override button — and the persisted before/after/diff doubles as a manual-preview mode for prompts you do not yet trust.

**Cost.** Every tool call does extra work (sha256 + diff + persist) even when the patch is dropped. Schema grew by two tables. apply is a transaction over N files with rollback on the first mismatch — more code than a straight write, more edge cases (force, partial apply, reverse direction).

### 02 · Committed playhead in the store, live playback in a side ticker

**Decision.** editorState.playhead in Zustand holds only the paused/scrub position. Live playback time lives in a separate PlaybackStore (a mini external store, not Zustand), updated every frame via onFrameUpdate plus a 200 ms poller fallback. PlaybackController is the single integration point with Remotion PlayerRef.

**Why.** A naive design keeps a single playhead in Zustand and updates it every frame. That re-renders the parent tree around the Remotion Player 60 times a second; on long timelines under load the Player picks up periodic micro-resyncs and stutters. Splitting committed and live state is the only way to keep the timecode UI live without re-rendering the Player.

**Cost.** Two sources of truth for time. PlaybackController has to mediate every play / pause / seek, buffer pendingSeek and pendingPlay until the player attaches, and run a fallback poller for missed onFrameUpdate events. UI subscribers ride a custom subscribe pattern, not Zustand selectors — extra glue on top of an already custom store.

### 03 · Babel-standalone on-demand for Remotion clips

**Decision.** A user-authored Remotion clip is stored as code in a manifest.json next to the asset and compiled at runtime through @babel/standalone (presets: react + transform-modules-commonjs). The output is wrapped via new Function with React and a minimal Remotion API injected. Cache key is assetId + hash(code) + length. Compile and runtime errors are caught and rendered as a per-clip error overlay — they do not break the rest of the composition.

**Why.** The alternatives — pre-compile via Vite, or run a TS runtime in a Web Worker — both push compilation off the editing path. With Babel-standalone the user edits code in CodeMirror, hits save, and the next preview frame reflects it. Babel ships as a separate lazy chunk loaded only when the first Remotion clip mounts; for an R&D tool where author and user are the same person, new Function isolation is enough.

**Cost.** Babel-standalone is a chunky dependency, lazy-loaded with a visible "Loading Remotion compiler…" indicator on first use. The PRELUDE injects only a minimal Remotion subset (AbsoluteFill, Sequence, Audio, Video, Img, useCurrentFrame, useVideoConfig, interpolate, Easing, spring) — anything else has to be added explicitly. new Function is enough for an R&D tool; multi-tenant production would move compile into a Worker.

### 04 · System prompts as live project files

**Decision.** System prompts live in scripts/text-editor/workflows/*.json inside the project tree, edited through the same CodeMirror surface as everything else. The frontend re-reads the selected workflow file before each request and threads its system_instruction into generation_config; runs.generation_config_json and usage_records.meta_json.system_prompt persist {id, path, name} so every run remembers which prompt was on at the time.

**Why.** Prompts are AI-side artifacts, but the rest of the project already treats AI artifacts as files (audit-first changesets work the same way). Putting prompts on the same plane lets them version with the code, edit through DnD-import, change without restart, and show up in usage analytics next to model and token cost. Most production AI tools never close that loop.

**Cost.** Prompt file is re-read from disk on every request — fine while the file stays small. No schema validation on the workflow JSON; a malformed file fails at runtime. Prompt history lives in git plus per-run snapshots; there is no dedicated prompt-history UI.

### 05 · SQLite + ThreadPool as the queue, not Celery + Redis

**Decision.** AI runs, TTS jobs and STT jobs all live as rows in SQLite tables (runs, tts_jobs, stt_jobs) with an atomic claim_next_*_job() — a SELECT followed by an UPDATE setting status to running. Workers are a ThreadPoolExecutor(max_workers=5) for chat runs and asyncio.Semaphore(10/6) for TTS/STT inside the FastAPI event loop. SQLite runs in WAL mode. Pending jobs survive a media_service restart and get re-claimed on boot.

**Why.** The default ML-stack reflex is Celery + Redis + Flower. This product is single-machine — media_service runs alongside the frontend in docker-compose, peak load is ~5 concurrent AI runs and ~10 TTS jobs. A broker would add two services to compose, an extra failure mode, and operational weight for load that does not exist.

**Cost.** Does not scale horizontally — multiple worker machines cannot share one SQLite. Visibility is hand-rolled — no Flower dashboard, you read usage_records and tool_events directly. The frontend long-polls /tts/jobs/{id} until success/error instead of getting a push event. Schema migrations are absent (rebuilt at startup) — adding a column means deleting usage.sqlite.

## Stack

| | |
|---|---|
| **Frontend** | React 18 · Vite · TypeScript · Zustand+Immer · Remotion 4 · CodeMirror · Tailwind · shadcn/ui |
| **Backend** | Python · FastAPI · pydantic · sqlite3 (raw SQL · WAL) · ffmpeg subprocess |
| **AI** | Gemini 3 (pro/flash preview) + 2.5 · google-genai 1.55 · streaming · thinking · tools · TTS · STT (google-cloud-speech) |
| **Composition** | Remotion 4.0 · @babel/standalone (lazy) · per-clip error overlay |
| **Concurrency** | ThreadPoolExecutor(5) · asyncio.Semaphore(10/6) · ffmpeg sem (global=6, per-asset=3) · LRU caches |
| **Scale** | ~30K LOC TS — 3K timeline + 2.4K AI panel + 2K media + 2.3K properties (each a domain-rich UI surface) · ~8K LOC Py · 16 docs · ~50 routes · 11 SQLite tables |

## Lessons & status

### Carry forward

- Audit-first changesets — every broken AI prompt rolled back in one click without losing earlier work. The single thing I would carry into any next AI-IDE 1:1.
- Committed vs live playhead split — the pattern scales to any heavy-parent player. Without it, Remotion Player picks up periodic micro-resyncs under load that no amount of memoization fixes.
- Geometry in a shared utils module — normalizeTransform / resolveFittedBox live in src/utils/geometry.ts and feed both EditorComposition and the transform-overlay. Zero drift between preview and the on-canvas bbox.
- SQLite + ThreadPool as the queue — three weeks of daily use on a single-machine product, never once regretted not having Celery. The brokerless shape kept compose small and the failure modes legible.
- Bias toward open code surfaces — Remotion compositions are TSX, scenarios are markdown + JSON, workflow prompts are JSON files. The inverse of plugin-layer editors: the agent can author every layer because every layer was put on the keyboard from day one.

### Would change

- Vitest was dropped early — would not do that again. Babel-on-the-fly clip compilation and hash-verified changesets both need a unit harness; lint and build do not catch regressions in compiled-clip output or in apply/revert hash math. I would carry it back from day one.
- The schema is rebuilt at startup, no migrations. Fine for solo R&D (change the schema, delete usage.sqlite); a real handoff blocker, and a friction for picking the project back up a month later. Alembic is one day; saves every other day after.
- Frontend long-polls /tts/jobs/{id} to success or error; chat runs ride SSE. Two transports where one would do — re-using the chat SSE channel for any background job from day one would have made the second async pipeline (TTS) free, and the third (image / video) free again.

R&D · substrate built end-to-end · runs locally in Docker. Script edits and async TTS live; agent-as-orchestrator and export integration are the next slices on the same audit-tracked substrate.

---

Source: https://ilyadev.xyz/cases/ai-video-editor (HTML) · /cases/ai-video-editor.md (this file)
Previous: 02 — Restaurant Stock AI Agent → https://ilyadev.xyz/cases/ai-warehouse.md
Up next: 04 — Bullet Reign · Roblox → https://ilyadev.xyz/cases/roblox-game.md
Index: https://ilyadev.xyz/llms.txt — full case-study list
Author: Ilya Kazantsev — https://ilyadev.xyz/index.md
