AI and automationTechnicalTechnical article

How to transcribe video to text with FFmpeg and faster-whisper

MorenaTechTechnical users and small businesses working with recordingsPracticalabout 10 min

Published: June 11, 2026

If you have a meeting recording, call, webinar, or training video and want to turn it into text, you could play it back, pause every few seconds, and type it manually. It makes more sense to automate it.

Before you read

This is a technical article. It is useful if you care about architecture, integrations, or the implementation layer of AI solutions. The guide-style versions are in the “For small business” section.

In this guide we turn video into audio with FFmpeg, run the audio through a local Whisper model, and save the result to a text file with timestamps. The example is prepared for Windows and PowerShell, but the same pattern works similarly on macOS and Linux.

Why do transcription at all?

Transcription is useful anywhere a recording contains knowledge, decisions, or agreements. In a small business this often means client meetings, training sessions, webinars, consultations, and working recordings.

You can use it for:

preparing meeting notes
writing up a client conversation
developing webinar material
creating an article draft from a recording
finding specific fragments of a conversation
preparing a task list after a meeting
archiving knowledge in the company

The key point is that transcription does not need to be the end of the process. It can be the start of automation: a recording becomes text, the text becomes a summary, the summary becomes a task list, and the task list becomes a concrete workflow in the company.

What do you need?

For this example we will use three things:

FFmpeg, a tool for working with audio and video
Python
the faster-whisper library, which lets you run a Whisper model locally

We assume you have a file called film.mp4 in your working directory. In the same folder we will create a Python environment, an audio file, and the transcription script.

Technical sources

This article is based on Microsoft Learn for winget install, the official FFmpeg documentation and the README of the faster-whisper project.

Step 1. Install the tools

On Windows, FFmpeg can be installed with Windows Package Manager, which is winget. Microsoft documents the winget install command, the --id option, and -e / --exact for exact package matching.

winget install --id Gyan.FFmpeg -e
python -m venv .venv
.venv\Scripts\activate
pip install faster-whisper

What is happening here?

winget install --id Gyan.FFmpeg -e installs FFmpeg with the exact package identifier
python -m venv .venv creates a local Python environment
.venv\Scripts\activate activates the environment in PowerShell
pip install faster-whisper installs the transcription library

Step 2. Convert the video to audio

We do not need the image for transcription. The model recognizes speech, so first we prepare a simple audio file.

ffmpeg -i "film.mp4" -vn -ac 1 -ar 16000 -c:a pcm_s16le "audio.wav"

This command takes film.mp4 and creates audio.wav. FFmpeg interprets input and output options depending on where they appear in the command, so order matters.

Fragment	Meaning
ffmpeg	runs the FFmpeg program
-i "film.mp4"	points to the input file
-vn	skips video and keeps audio only
-ac 1	sets one audio channel, so mono
-ar 16000	sets the sample rate to 16 kHz
-c:a pcm_s16le	saves audio as 16-bit PCM WAV
"audio.wav"	sets the output filename

Why mono and 16 kHz? For speech recognition you usually do not need stereo or music-grade quality. It matters more that the voice is clean and the file is simple to process.

Optional: start from a specific moment in the video

Sometimes a recording has a few minutes of intro, silence, a title screen, or microphone setup. In that case you do not need to process the whole file from the beginning.

You can use the -ss flag before the input:

ffmpeg -ss 00:05:39 -i "film.mp4" -vn -ac 1 -ar 16000 -c:a pcm_s16le "audio.wav"

In this example FFmpeg starts at 00:05:39. This is optional. For standard full-video transcription, use the version without -ss.

Step 3. Python script for transcription

Create a file called transkrybuj.py in the same folder as audio.wav.

from pathlib import Path
from faster_whisper import WhisperModel

AUDIO_FILE = Path("audio.wav")
OUTPUT_FILE = Path("transkrypcja.txt")
MODEL_NAME = "large-v3"

if not AUDIO_FILE.exists():
    raise FileNotFoundError(f"Nie znaleziono pliku: {AUDIO_FILE}")

model = WhisperModel(MODEL_NAME, device="auto", compute_type="auto")

segments, info = model.transcribe(
    str(AUDIO_FILE),
    language="pl",
    vad_filter=True,
)

print(f"Język: {info.language}, prawdopodobieństwo: {info.language_probability:.2f}")

with OUTPUT_FILE.open("w", encoding="utf-8") as file:
    for segment in segments:
        text = segment.text.strip()
        if not text:
            continue
        line = f"[{segment.start:.2f} - {segment.end:.2f}] {text}"
        print(line)
        file.write(line + "\n")

print(f"Zapisano: {OUTPUT_FILE}")

After running the script, the result will be saved to transkrypcja.txt. Each line has a simple format:

[12.34 - 18.90] Recognized speech text.

This gives you timestamps instead of one wall of text, which is convenient when you need to return to a specific fragment of the recording.

What do the settings in the script mean?

The key line creates the model:

model = WhisperModel(MODEL_NAME, device="auto", compute_type="auto")

MODEL_NAME = "large-v3" means we use a large Whisper model. It is more accurate, but it can be heavy for a weaker machine. The first run may also take time because the model has to be downloaded.

If your computer struggles or transcription takes too long, start with a smaller model:

MODEL_NAME = "medium"

or:

MODEL_NAME = "small"

Smaller models usually run faster, but they may make more mistakes, especially with poor audio, multiple speakers, or specialist vocabulary.

device="auto" lets the library choose how to run the model. If a compatible GPU is available and configured correctly, transcription may run faster. Otherwise the script will use the CPU.

compute_type="auto" allows the computation mode to be selected automatically. In practice this affects performance, memory usage, and hardware compatibility.

Transcription and the segments generator

In the transcription step we have:

segments, info = model.transcribe(
    str(AUDIO_FILE),
    language="pl",
    vad_filter=True,
)

language="pl" tells the model that the recording is in Polish, so it does not need to guess the language.

vad_filter=True enables voice activity detection. The tool tries to skip fragments without speech, such as silence or longer pauses.

segments contains the recognized text fragments. Each segment includes, among other things, the start time, end time, and spoken text.

Important technical detail

In faster-whisper, segments is a generator. That means the real transcription happens only when you iterate over the segments in a for loop. That is why this script writes the output file inside the loop.

How do you run the full process?

If you have film.mp4 and transkrybuj.py in the folder, run:

ffmpeg -i "film.mp4" -vn -ac 1 -ar 16000 -c:a pcm_s16le "audio.wav"
python transkrybuj.py

After it finishes, you should get:

transkrypcja.txt

What can go wrong?

The ffmpeg command does not work

First check whether FFmpeg is installed:

ffmpeg -version

If PowerShell does not recognize the command, close and reopen the terminal. Sometimes the system has to refresh PATH after installation.

Python cannot find the faster_whisper library

Make sure the .venv environment is activated:

.venv\Scripts\activate

Only then install the library:

pip install faster-whisper

Transcription takes a very long time

That is normal for large models and long recordings. Try changing MODEL_NAME = "large-v3" to "medium" or "small".

The result has errors

Automatic transcription is not a perfect stenogram. The result depends heavily on recording quality.

background noise
echo
multiple people speaking at once
a quiet microphone
music under speech
specialist terminology
names of people and companies

For important material, the transcript should still be reviewed manually.

What about company data?

If you transcribe company meetings, client calls, or recordings containing personal data, you need to think about file access and where the data is processed.

Running the model locally has one clear advantage: the audio file does not need to be uploaded to an external service. That still does not remove the need to think about security. Audio files and transcripts should live in an organized location with access only for the people who should actually see them.

What can you do with this next?

The transkrypcja.txt file is useful on its own, but the bigger value starts one step later.

You can extend the process, for example, to:

automatically pull recordings from a selected folder
transcribe new files
generate a short meeting summary
extract decisions and tasks
save the result to Google Drive
create an entry in a sheet or CRM
email the note to the team
build a knowledge base from recordings

At that point transcription stops being a one-off script. It becomes part of a process.

Summary

For a basic video transcription flow, three steps are enough:

extract audio from the video with FFmpeg
run the audio through faster-whisper
save the result to a text file

This is a simple example, but it shows the broader automation pattern well: turn a recording into text, text into knowledge, and knowledge into an organized process.

Want to turn recordings into useful notes and tasks?

MorenaTech helps build practical AI workflows: recording transcription, meeting summaries, extracting decisions, task lists, and an organized knowledge base from company materials.

See the AI for business service Contact me

Git when working with AI: how not to lose control of the project

Published: June 10, 2026

AI speeds up code changes, but Git lets you inspect the diff, work on a branch, save decisions in commits, and avoid turning the project into a chain of random experiments.

MCP and agents•Technical•about 10 min

MAPI-local-medium: a local MCP server that gives a model memory, tools, and boundaries

Published: May 1, 2026

A technical explanation of what MAPI-local-medium is: a local MCP server that lets a language model use project memory, tools, context bootstrap, and a controlled working environment.

AI for business•Technical•about 8 min

Why AI needs memory. RAG is not enough when continuity of work matters

Published: May 25, 2026

A context search engine alone is not enough when AI has to work for longer than a few minutes. See why assistants need memory, how it differs from RAG, and why many implementations end in operational amnesia without it.

Back to blog