视频字幕制作

一些背景知识

容器（封装格式）与编码（codec）：常见 .mp4（最通用）、.mkv（封装灵活）、.mov（常见于 Apple 生态）、.webm（常与 VP9/Opus 搭配）等是容器，里面可以封装不同视频编码（如 H.264/AVC、H.265/HEVC、VP9、AV1）和音频编码（如 AAC、Opus、FLAC）。同一后缀名不一定代表同一种编码；做字幕时通常只关心能否从中无损或稳定地抽出音轨。可参阅 FFmpeg：封装格式与维基 Comparison of video container formats。
FFmpeg：开源多媒体处理框架，命令行下可查看流信息、抽取/转码音视频、拼接剪辑等。本流程里主要用它从视频导出 16 kHz 单声道 WAV 供识别。官网与手册：FFmpeg、ffmpeg-all。

本文记录在 macOS 上使用 Whisper 制作视频字幕，用命令行完成「抽音频 → 识别 → 出字幕 →（可选）翻译」的流程。Whisper 指 OpenAI 开源的语音识别模型；本机常用实现为 whisper.cpp（whisper-cli），与官方 Python openai-whisper 二选一即可，思路相同。

环境准备

brew install ffmpeg whisper-cpp

ffmpeg：从视频抽取 Whisper 易处理的 16 kHz、单声道、16-bit PCM WAV。
whisper-cpp：提供 whisper-cli（及可选的 whisper-server）。若仅用 CLI，不必起服务。

下载模型

模型为二进制文件，例如 ggml-small.bin、ggml-large-v3.bin。

可使用以下脚本

sh bin/download-ggml-model.sh small /path/to/models/dir

脚本内容

#!/bin/sh

# This script downloads Whisper model files that have already been converted to ggml format.
# This way you don't have to convert them yourself.

#src="https://ggml.ggerganov.com"
#pfx="ggml-model-whisper"

src="https://huggingface.co/ggerganov/whisper.cpp"
pfx="resolve/main/ggml"

BOLD="\033[1m"
RESET='\033[0m'

# get the path of this script
get_script_path() {
    if [ -x "$(command -v realpath)" ]; then
        dirname "$(realpath "$0")"
    else
        _ret="$(cd -- "$(dirname "$0")" >/dev/null 2>&1 || exit ; pwd -P)"
        echo "$_ret"
    fi
}

script_path="$(get_script_path)"

# Check if the script is inside a /bin/ directory
case "$script_path" in
    */bin) default_download_path="$PWD" ;;  # Use current directory as default download path if in /bin/
    *) default_download_path="$script_path" ;;  # Otherwise, use script directory
esac

models_path="${2:-$default_download_path}"

# Whisper models
models="tiny
tiny.en
tiny-q5_1
tiny.en-q5_1
tiny-q8_0
base
base.en
base-q5_1
base.en-q5_1
base-q8_0
small
small.en
small.en-tdrz
small-q5_1
small.en-q5_1
small-q8_0
medium
medium.en
medium-q5_0
medium.en-q5_0
medium-q8_0
large-v1
large-v2
large-v2-q5_0
large-v2-q8_0
large-v3
large-v3-q5_0
large-v3-turbo
large-v3-turbo-q5_0
large-v3-turbo-q8_0"

# list available models
list_models() {
    printf "\n"
    printf "Available models:"
    model_class=""
    for model in $models; do
        this_model_class="${model%%[.-]*}"
        if [ "$this_model_class" != "$model_class" ]; then
            printf "\n "
            model_class=$this_model_class
        fi
        printf " %s" "$model"
    done
    printf "\n\n"
}

if [ "$#" -lt 1 ] || [ "$#" -gt 2 ]; then
    printf "Usage: %s <model> [models_path]\n" "$0"
    list_models
    printf "___________________________________________________________\n"
    printf "${BOLD}.en${RESET} = english-only ${BOLD}-q5_[01]${RESET} = quantized ${BOLD}-tdrz${RESET} = tinydiarize\n"

    exit 1
fi

model=$1

if ! echo "$models" | grep -q -w "$model"; then
    printf "Invalid model: %s\n" "$model"
    list_models

    exit 1
fi

# check if model contains `tdrz` and update the src and pfx accordingly
if echo "$model" | grep -q "tdrz"; then
    src="https://huggingface.co/akashmjn/tinydiarize-whisper.cpp"
    pfx="resolve/main/ggml"
fi

echo "$model" | grep -q '^"tdrz"*$'

# download ggml model

printf "Downloading ggml model %s from '%s' ...\n" "$model" "$src"

cd "$models_path" || exit

if [ -f "ggml-$model.bin" ]; then
    printf "Model %s already exists. Skipping download.\n" "$model"
    exit 0
fi

if [ -x "$(command -v wget2)" ]; then
    wget2 --no-config --progress bar -O ggml-"$model".bin $src/$pfx-"$model".bin
elif [ -x "$(command -v wget)" ]; then
    wget --no-config --quiet --show-progress -O ggml-"$model".bin $src/$pfx-"$model".bin
elif [ -x "$(command -v curl)" ]; then
    curl -L --output ggml-"$model".bin $src/$pfx-"$model".bin
else
    printf "Either wget or curl is required to download models.\n"
    exit 1
fi

if [ $? -ne 0 ]; then
    printf "Failed to download ggml model %s \n" "$model"
    printf "Please try again later or download the original Whisper model files and convert them yourself.\n"
    exit 1
fi

# Check if 'whisper-cli' is available in the system PATH
if command -v whisper-cli >/dev/null 2>&1; then
    # If found, use 'whisper-cli' (relying on PATH resolution)
    whisper_cmd="whisper-cli"
else
    # If not found, use the local build version
    whisper_cmd="./build/bin/whisper-cli"
fi

printf "Done! Model '%s' saved in '%s/ggml-%s.bin'\n" "$model" "$models_path" "$model"
printf "You can now use it like this:\n\n"
printf "  $ %s -m %s/ggml-%s.bin -f samples/jfk.wav\n" "$whisper_cmd" "$models_path" "$model"
printf "\n"

或从 Hugging Face ggerganov/whisper.cpp 手动下载对应 ggml-*.bin 到固定目录（如 ~/whisper-models/）。

模型越大一般越准、越慢；tiny / base 适合试通流程，small 及以上适合正式内容。

整体流程

视频文件
   │  ffmpeg（重采样为 16 kHz 单声道 PCM WAV）
   ▼
临时 WAV
   │  whisper-cli（指定模型与输出格式）
   ▼
字幕文件（如 .srt）
   │  （可选）翻译工具 / API
   ▼
中文字幕或其它语言 SRT

步骤一：用 ffmpeg 抽取音频

不直接把 MKV/MP4 丢给 whisper-cli 当「WAV」用，容易报 failed to read audio data as wav。应先导出标准 WAV：

ffmpeg -y -i "/path/to/video.mp4" -ar 16000 -ac 1 -c:a pcm_s16le "/tmp/whisper_audio.wav"

参数说明：

-ar 16000：采样率 16 kHz（与 Whisper 训练条件一致）。
-ac 1：单声道。
-c:a pcm_s16le：16-bit 小端 PCM。

步骤二：用 whisper-cli 生成字幕

whisper-cli \
  -m "/path/to/ggml-small.bin" \
  -f "/tmp/whisper_audio.wav" \
  -osrt \
  -of "/tmp/out_basename"

说明：

-m：ggml 模型路径（建议绝对路径）。
-f：上一步生成的 WAV。
-osrt：输出 SRT（具体选项以 whisper-cli --help 为准，不同版本可能略有差异）。
-of：输出文件主文件名前缀（不含扩展名），实际可能生成 out_basename.srt 等。

可选：翻译成英语（Whisper translate 任务）

若你的 whisper-cli 支持翻译模式（常见为 --translate 或文档中的等价参数），可对非英语语音输出英语文本。这是 Whisper 内置能力，不是中英任意互译。

# 示例：以本机 whisper-cli --help 为准
whisper-cli -m "/path/to/model.bin" -f "/tmp/whisper_audio.wav" --translate -osrt -of "/tmp/out_en"

步骤三（可选）：英文字幕 → 中文字幕

Whisper / whisper-cli 不负责「英译中」字幕。可在得到英文 SRT 后：

使用 Subtitle Edit 等工具的自动翻译插件；
或使用可信的 SRT 在线翻译（注意隐私与条款）；
或使用翻译 API（DeepL、Azure、OpenAI 等）自行写脚本按条翻译并写回时间轴。

完成后在播放器（如 IINA）中加载外挂字幕即可

常见问题

failed to read audio data as wav
先按第 4 节用 ffmpeg 生成 16 kHz 单声道 PCM WAV，再喂给 whisper-cli。
模型路径
避免依赖 ./@data/ 等易混淆的相对路径；建议模型与 WAV 均用绝对路径。
GPU
日志里 no GPU found 时会用 CPU，仅更慢，一般仍可完成。
与 OpenAI 云端 API 的区别
本文主写本地 whisper.cpp。若使用 OpenAI Audio Transcriptions API，则是上传音频、走网络与计费，流程不同，需 API Key 与相应客户端或 curl，不在此展开。

一些背景知识​

环境准备​

下载模型​

整体流程​

步骤一：用 ffmpeg 抽取音频​

步骤二：用 whisper-cli 生成字幕​

可选：翻译成英语（Whisper translate 任务）​

步骤三（可选）：英文字幕 → 中文字幕​

常见问题​

参考​