Developer Tutorials

Add Subtitles to Video with JSON: API Tutorial

Add subtitles to video with JSON using Zvid's API. Generate timed captions, style subtitles, sync audio, and render captioned videos.

Published May 14, 2026

Add Subtitles to Video with JSON: API Tutorial

Add Subtitles to Video with JSON: API Tutorial

To add subtitles to a video with JSON, define caption segments with start, end, text, and words, add optional subtitle styles, then send the full video payload to a rendering API. In Zvid, subtitles live in the top-level subtitle object of the render payload, so captions can be generated, edited, versioned, translated, and rendered alongside the rest of the video scene.

The practical flow is simple: create the video layout, add timed caption data, submit the render job to POST https://api.zvid.io/api/render/api-key, and poll GET https://api.zvid.io/api/jobs/{id} until the captioned output is ready. If you start with an already recorded video or audio file, transcribe it upstream with a speech-to-text service, then convert that subtitle output into Zvid's captions and words structure before rendering. Keep the Subtitle reference, Caption reference, Word reference, and SubtitleStyles reference nearby while you implement.

Hero image for adding subtitles to video with JSON in the Zvid API

Subtitles become easier to automate when transcript output becomes structured caption data.

If you are new to Zvid's overall render model, start with the JSON Structure overview, then compare this subtitle-focused workflow with How to Generate a Video from JSON and the broader JSON to Video API guide.

Here is the API loop most developers need:

curl -X POST https://api.zvid.io/api/render/api-key \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d @captioned-render.json

curl -X GET https://api.zvid.io/api/jobs/$JOB_ID \
  -H "x-api-key: YOUR_API_KEY"

The JSON subtitle model

Zvid subtitles are not a separate file upload in the common API pattern. They are part of the render payload. That makes them useful for applications that already produce transcripts, voiceover scripts, product narration, or short-form social captions as data.

The top-level subtitle object contains:

  • captions: an array of timed caption segments.
  • styles: optional typography, placement, and display mode settings.

Each caption segment contains:

  • start: when the caption appears, in seconds.
  • end: when the caption disappears, in seconds.
  • text: the full caption text for that segment.
  • words: an array of word-level timings.

Each word item contains start, end, and text. Word timings power interactive-looking subtitle modes such as one-word, karaoke, and progressive captions. They also make your data easier to inspect when a caption feels early, late, or hard to read.

This is different from uploading an external WebVTT, SRT, or closed captioning file after export. With Zvid, the caption text, timing metadata, font, position, and display mode can travel with the same JSON to video template that controls the visual scene. That gives your app full control over how subtitles appear in the final video, instead of merging captions manually in a separate editor.

Copy-paste Zvid caption payload

The example below renders a short captioned explainer card. The visual layout is intentionally simple so the subtitle data is the main thing to study.

{
  "name": "add-subtitles-json-demo",
  "resolution": "hd",
  "duration": 8,
  "frameRate": 30,
  "outputFormat": "mp4",
  "backgroundColor": "#101623",
  "visuals": [
    {
      "type": "SVG",
      "width": 1280,
      "height": 720,
      "svg": "<svg width='1280' height='720' viewBox='0 0 1280 720' xmlns='http://www.w3.org/2000/svg'><defs><linearGradient id='bg' x1='0' y1='0' x2='1' y2='1'><stop offset='0' stop-color='#101623'/><stop offset='1' stop-color='#243047'/></linearGradient><linearGradient id='accent' x1='0' y1='0' x2='1' y2='0'><stop offset='0' stop-color='#2DD4BF'/><stop offset='1' stop-color='#FADD46'/></linearGradient></defs><rect width='1280' height='720' fill='url(#bg)'/><rect x='54' y='54' width='1172' height='612' rx='30' fill='rgba(255,255,255,0.045)' stroke='rgba(255,255,255,0.12)'/><rect x='96' y='92' width='186' height='38' rx='19' fill='rgba(45,212,191,0.14)' stroke='rgba(45,212,191,0.42)'/><text x='189' y='117' text-anchor='middle' fill='#A7FFF6' font-family='Arial' font-size='14' font-weight='700'>ZVID SUBTITLES</text><rect x='112' y='168' width='420' height='160' rx='24' fill='rgba(8,13,26,0.62)' stroke='rgba(255,255,255,0.12)'/><text x='152' y='215' fill='#FFFFFF' font-family='Arial' font-size='25' font-weight='800'>Caption data</text><text x='152' y='255' fill='#A7FFF6' font-family='Arial' font-size='18' font-weight='700'>{ start, end, text }</text><rect x='152' y='282' width='290' height='14' rx='7' fill='rgba(255,255,255,0.18)'/><rect x='152' y='308' width='224' height='14' rx='7' fill='rgba(255,255,255,0.13)'/><rect x='748' y='168' width='420' height='160' rx='24' fill='rgba(8,13,26,0.62)' stroke='rgba(255,255,255,0.12)'/><text x='788' y='215' fill='#FFFFFF' font-family='Arial' font-size='25' font-weight='800'>Rendered video</text><rect x='788' y='238' width='300' height='66' rx='18' fill='rgba(255,255,255,0.08)' stroke='rgba(255,255,255,0.13)'/><rect x='818' y='261' width='232' height='18' rx='9' fill='url(#accent)' opacity='0.9'/><rect x='818' y='288' width='190' height='11' rx='6' fill='rgba(255,255,255,0.18)'/><line x1='582' y1='252' x2='700' y2='252' stroke='#FADD46' stroke-width='7' stroke-linecap='round'/><polygon points='700,252 680,239 680,265' fill='#FADD46'/><text x='641' y='229' text-anchor='middle' fill='#FADD46' font-family='Arial' font-size='18' font-weight='700'>render</text></svg>"
    },
    {
      "type": "TEXT",
      "x": 640,
      "y": 116,
      "width": 840,
      "anchor": "center-center",
      "track": 4,
      "enterBegin": 0.2,
      "enterEnd": 0.8,
      "enterAnimation": "fade",
      "exitBegin": 7.2,
      "exitEnd": 7.7,
      "exitAnimation": "fade",
      "html": "<div style='text-align:center; color:#ffffff; font-size:40px; font-weight:800; line-height:1.12;'>JSON subtitles that render</div>"
    },
    {
      "type": "TEXT",
      "x": 640,
      "y": 566,
      "width": 920,
      "anchor": "center-center",
      "track": 5,
      "enterBegin": 0.6,
      "enterEnd": 1.1,
      "enterAnimation": "fade",
      "exitBegin": 7.1,
      "exitEnd": 7.6,
      "exitAnimation": "fade",
      "html": "<div style='text-align:center; color:#D7DEF6; font-size:24px; line-height:1.35;'>Keep captions, word timing, styles, and render output in one API payload.</div>"
    }
  ],
  "subtitle": {
    "captions": [
      {
        "start": 1,
        "end": 3.2,
        "text": "Build captions as JSON data",
        "words": [
          { "start": 1, "end": 1.45, "text": "Build" },
          { "start": 1.45, "end": 1.95, "text": "captions" },
          { "start": 1.95, "end": 2.3, "text": "as" },
          { "start": 2.3, "end": 2.75, "text": "JSON" },
          { "start": 2.75, "end": 3.2, "text": "data" }
        ]
      },
      {
        "start": 3.4,
        "end": 5.6,
        "text": "Control every word on the timeline",
        "words": [
          { "start": 3.4, "end": 3.8, "text": "Control" },
          { "start": 3.8, "end": 4.15, "text": "every" },
          { "start": 4.15, "end": 4.55, "text": "word" },
          { "start": 4.55, "end": 4.85, "text": "on" },
          { "start": 4.85, "end": 5.1, "text": "the" },
          { "start": 5.1, "end": 5.6, "text": "timeline" }
        ]
      },
      {
        "start": 5.8,
        "end": 7.4,
        "text": "Render the captioned video with Zvid",
        "words": [
          { "start": 5.8, "end": 6.1, "text": "Render" },
          { "start": 6.1, "end": 6.35, "text": "the" },
          { "start": 6.35, "end": 6.8, "text": "captioned" },
          { "start": 6.8, "end": 7.1, "text": "video" },
          { "start": 7.1, "end": 7.4, "text": "with Zvid" }
        ]
      }
    ],
    "styles": {
      "color": "#FFFFFF",
      "background": "#000000CC",
      "isBold": true,
      "fontSize": 42,
      "fontFamily": "Montserrat",
      "position": "center-center",
      "marginV": 0,
      "marginH": 0,
      "mode": "karaoke",
      "activeWord": {
        "color": "#FADD46"
      }
    }
  }
}

Zvid subtitle JSON payload visual generated from the article example

The payload visual is generated from the real renderable subtitle example in this tutorial.

For a public API request, wrap that project inside a top-level payload field:

{
  "payload": {
    "name": "add-subtitles-json-demo",
    "resolution": "hd",
    "duration": 8,
    "frameRate": 30,
    "outputFormat": "mp4",
    "visuals": [],
    "subtitle": {
      "captions": [
        {
          "start": 1,
          "end": 3.2,
          "text": "Build captions as JSON data",
          "words": [
            { "start": 1, "end": 1.45, "text": "Build" }
          ]
        }
      ]
    }
  }
}

How the subtitle workflow works

The workflow is easiest when you treat subtitle generation as a data pipeline, not as a manual editing step.

Workflow diagram for automated video captions with JSON and the Zvid API

Keep transcript creation, subtitle format conversion, render submission, and job polling as separate steps.

A reliable implementation usually looks like this:

  1. Generate or import a transcript from a speech-to-text service.
  2. Convert the service output into Zvid caption segments with start, end, text, and words.
  3. Preserve word-level timings from the transcription output where available.
  4. Choose subtitle style settings for placement, color, background, and mode.
  5. Add the subtitle object to the same payload that defines the video scene.
  6. Submit the render job and poll for completion.

This separation matters because captions often need independent review. A designer may care about line length and placement. A localization team may care about translation quality. A developer may care about timestamp accuracy and retry behavior. JSON keeps those concerns explicit.

If an AI transcription service or speech-to-text tool creates the first draft, treat that output as source data rather than the final subtitle layer. Your app should map the provider's response into Zvid's subtitle format, including caption-level start, end, and text fields plus each caption's words array. Review proper nouns, numbers, speaker phrasing, punctuation, language, and line breaks before rendering.

Subtitle display modes and styles

Subtitle style settings control how the captions look and where they appear. Zvid supports common style fields such as color, background, fontSize, fontFamily, bold and italic flags, position, margins, display mode, and an activeWord color for word-aware modes.

Diagram showing how caption text, word timing, style settings, and display mode combine into Zvid subtitles

A subtitle object combines text, timing, display mode, and style rules.

Use a solid or semi-transparent background when the underlying video is busy. Use center placement when the subtitle is the main teaching element, and move captions to the top or side when lower-third graphics, product details, or call-to-action buttons need that space.

For most API-driven workflows, start with one brand-safe caption preset:

{
  "styles": {
    "color": "#FFFFFF",
    "background": "#000000CC",
    "fontSize": 42,
    "fontFamily": "Montserrat",
    "isBold": true,
    "position": "center-center",
    "marginV": 0,
    "marginH": 0,
    "mode": "karaoke",
    "activeWord": {
      "color": "#FADD46"
    }
  }
}

Once that preset works, reuse it across templates. Change it only when the output channel, language, aspect ratio, or visual layout requires a different caption treatment.

For translated or localized subtitles, expect text length to change. A short English caption may become longer in another language, which can affect font size, line breaks, and position. Test the longest expected language before you generate subtitles automatically across many clips.

API workflow for rendering captions

After the payload is ready, submit it through the same render endpoint used for other Zvid videos. The Submit render job reference documents POST /api/render/api-key, and the Get render job status reference documents GET /api/jobs/{id}.

curl -X POST https://api.zvid.io/api/render/api-key \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d @captioned-render.json

Save the returned job ID, then poll for status:

curl -X GET https://api.zvid.io/api/jobs/$JOB_ID \
  -H "x-api-key: YOUR_API_KEY"

The important application behavior is to treat rendering as asynchronous. Submit the job, store the payload version, poll the job state, and only publish or notify the user after the completed result is available.

This is also where Zvid fits well with repeatable content systems. The same caption template can support social clips, onboarding videos, course snippets, translated product demos, and programmatic explainers. If the surrounding workflow starts from feeds or spreadsheet data, the pattern is similar to How to Create Product Videos from a CSV or Product Feed, except the dynamic fields include captions and word timings.

Adding subtitles to an existing video file

If your input is an already recorded video file, the JSON workflow still applies. Use a public media URL for the source video, add it as a VIDEO visual, then add the subtitle object with synchronized subtitles for the spoken audio. Zvid can create a video output with the captions rendered into the final frame.

Your app can automatically generate the first transcript with an AI or speech-to-text service, but the render payload should contain the reviewed result in Zvid's subtitle structure: clean text, caption segments, and word arrays that match the audio track. That is the difference between generating raw transcription and generating video subtitles that are readable in the finished clip.

For teams that need accessible video content, keep the source transcript and caption metadata even after export. The rendered subtitles help viewers watch the video, while a separate platform-specific caption file may still be useful when a publishing destination supports external accessibility tracks.

Common mistakes

The most common mistake is modeling subtitles as plain text instead of timed data. Plain transcript text is useful, but it does not tell the renderer when each caption should appear or how word-level modes should behave.

Other mistakes show up often:

  • Putting word timings outside the parent caption's time range.
  • Creating captions that are too long for the available screen time.
  • Using small font sizes on mobile-first outputs.
  • Leaving captions over important lower-third graphics.
  • Forgetting that render jobs are asynchronous.
  • Treating translated captions as a drop-in replacement without checking line length.
  • Using a busy video background without a readable caption background or outline.
  • Assuming automatic transcription is already perfect without editing the generated captions.
  • Confusing embedded visual subtitles with a separate accessibility caption file.
  • Forgetting to adjust subtitle animation, font size, or default placement for a new output format.

Comparison chart showing manual subtitle editing versus JSON-driven video subtitles through an API

JSON-driven subtitles are strongest when captions are generated, reviewed, and rendered repeatedly.

The fix is to make the subtitle object part of your template contract. Store the caption data, the source transcript, the payload version, and the render job ID together. That makes subtitle bugs much easier to reproduce.

When to use Zvid

Use Zvid when your application needs to render captioned videos from structured data instead of manually editing every output. It is a strong fit for developer teams, short-form video tools, agencies, course platforms, ecommerce workflows, and automation systems that already know what the caption text should say.

Use-case illustration for teams generating captioned videos from JSON with the Zvid API

Caption data can travel with the same JSON payload that controls the rest of the video.

Zvid is especially useful when you need:

  • Repeatable subtitle styling across many videos.
  • Timed captions generated from transcripts, scripts, or localization workflows.
  • Word-level subtitle modes for short-form videos.
  • A hosted API workflow for render submission and job polling.
  • One JSON payload that contains scene layout, media, captions, and output settings.

If your team only needs one hand-edited video, a manual editor may be enough. If you need captioned videos generated from product data, scripts, transcripts, lessons, or campaign templates, JSON subtitles are the better system boundary.

Start with one short payload, render it through Zvid, and inspect the subtitle output before you scale the same template across more videos.

FAQs

What fields are required for Zvid subtitles?

A subtitle object needs a captions array. Each caption needs start, end, text, and words. Each word needs start, end, and text.

Do I need word-level timings?

Yes, for Zvid subtitle captions that use the documented subtitle shape. Word timings are especially important for one-word, karaoke, and progressive display modes.

Can I style subtitles with JSON?

Yes. The styles object can define color, background, font size, font family, bold or italic text, placement, margins, display mode, and active word color.

Where does the subtitle object go?

Place subtitle at the top level of the Zvid render payload, alongside fields such as name, resolution, duration, frameRate, outputFormat, visuals, and audios.

Can I use subtitles with generated or feed-based videos?

Yes. Caption text and timing can be generated from transcripts, scripts, product data, localization workflows, or other structured inputs before you submit the render payload.

How should I validate subtitle JSON?

Check that every caption has start, end, text, and words, every word fits inside its parent caption, and the caption position does not cover important visual content.

How do I add subtitles to an already recorded video?

Transcribe the recorded audio first, convert the speech-to-text output into Zvid caption segments and word arrays, then add that subtitle object to the same Zvid payload that references the video or recreates the scene.

Can AI tools generate subtitles for this workflow?

Yes. AI transcription or speech-to-text tools can generate the first subtitle draft, but your app should still convert the result into Zvid's subtitle structure and review text accuracy, language, punctuation, and line length before sending the JSON payload to Zvid.

Are subtitles the same as closed captions?

Not always. In this article, subtitles are visual text rendered into the video. Closed captioning can also mean a separate accessibility track or file, depending on the publishing platform.

Share