AI and automationTechnicalTechnical article

How to transcribe video to text with FFmpeg and faster-whisper

MorenaTechTechnical users and small businesses working with recordingsPracticalabout 10 min
Published:

If you have a meeting recording, call, webinar, or training video and want to turn it into text, you could play it back, pause every few seconds, and type it manually. It makes more sense to automate it.

Before you read

This is a technical article. It is useful if you care about architecture, integrations, or the implementation layer of AI solutions. The guide-style versions are in the “For small business” section.
In this guide we turn video into audio with FFmpeg, run the audio through a local Whisper model, and save the result to a text file with timestamps. The example is prepared for Windows and PowerShell, but the same pattern works similarly on macOS and Linux.

Why do transcription at all?

Transcription is useful anywhere a recording contains knowledge, decisions, or agreements. In a small business this often means client meetings, training sessions, webinars, consultations, and working recordings.

You can use it for:

  • preparing meeting notes
  • writing up a client conversation
  • developing webinar material
  • creating an article draft from a recording
  • finding specific fragments of a conversation
  • preparing a task list after a meeting
  • archiving knowledge in the company

The key point is that transcription does not need to be the end of the process. It can be the start of automation: a recording becomes text, the text becomes a summary, the summary becomes a task list, and the task list becomes a concrete workflow in the company.

What do you need?

For this example we will use three things:

  • FFmpeg, a tool for working with audio and video
  • Python
  • the faster-whisper library, which lets you run a Whisper model locally

We assume you have a file called film.mp4 in your working directory. In the same folder we will create a Python environment, an audio file, and the transcription script.

Technical sources

This article is based on Microsoft Learn for winget install, the official FFmpeg documentation and the README of the faster-whisper project.

Step 1. Install the tools

On Windows, FFmpeg can be installed with Windows Package Manager, which is winget. Microsoft documents the winget install command, the --id option, and -e / --exact for exact package matching.

winget install --id Gyan.FFmpeg -e
python -m venv .venv
.venv\Scripts\activate
pip install faster-whisper

What is happening here?

  • winget install --id Gyan.FFmpeg -e installs FFmpeg with the exact package identifier
  • python -m venv .venv creates a local Python environment
  • .venv\Scripts\activate activates the environment in PowerShell
  • pip install faster-whisper installs the transcription library

Step 2. Convert the video to audio

We do not need the image for transcription. The model recognizes speech, so first we prepare a simple audio file.

ffmpeg -i "film.mp4" -vn -ac 1 -ar 16000 -c:a pcm_s16le "audio.wav"

This command takes film.mp4 and creates audio.wav. FFmpeg interprets input and output options depending on where they appear in the command, so order matters.

FragmentMeaning
ffmpegruns the FFmpeg program
-i "film.mp4"points to the input file
-vnskips video and keeps audio only
-ac 1sets one audio channel, so mono
-ar 16000sets the sample rate to 16 kHz
-c:a pcm_s16lesaves audio as 16-bit PCM WAV
"audio.wav"sets the output filename

Why mono and 16 kHz? For speech recognition you usually do not need stereo or music-grade quality. It matters more that the voice is clean and the file is simple to process.

Optional: start from a specific moment in the video

Sometimes a recording has a few minutes of intro, silence, a title screen, or microphone setup. In that case you do not need to process the whole file from the beginning.

You can use the -ss flag before the input:

ffmpeg -ss 00:05:39 -i "film.mp4" -vn -ac 1 -ar 16000 -c:a pcm_s16le "audio.wav"

In this example FFmpeg starts at 00:05:39. This is optional. For standard full-video transcription, use the version without -ss.

Step 3. Python script for transcription

Create a file called transkrybuj.py in the same folder as audio.wav.

from pathlib import Path
from faster_whisper import WhisperModel

AUDIO_FILE = Path("audio.wav")
OUTPUT_FILE = Path("transkrypcja.txt")
MODEL_NAME = "large-v3"

if not AUDIO_FILE.exists():
    raise FileNotFoundError(f"Nie znaleziono pliku: {AUDIO_FILE}")

model = WhisperModel(MODEL_NAME, device="auto", compute_type="auto")

segments, info = model.transcribe(
    str(AUDIO_FILE),
    language="pl",
    vad_filter=True,
)

print(f"Język: {info.language}, prawdopodobieństwo: {info.language_probability:.2f}")

with OUTPUT_FILE.open("w", encoding="utf-8") as file:
    for segment in segments:
        text = segment.text.strip()
        if not text:
            continue
        line = f"[{segment.start:.2f} - {segment.end:.2f}] {text}"
        print(line)
        file.write(line + "\n")

print(f"Zapisano: {OUTPUT_FILE}")

After running the script, the result will be saved to transkrypcja.txt. Each line has a simple format:

[12.34 - 18.90] Recognized speech text.

This gives you timestamps instead of one wall of text, which is convenient when you need to return to a specific fragment of the recording.

What do the settings in the script mean?

The key line creates the model:

model = WhisperModel(MODEL_NAME, device="auto", compute_type="auto")

MODEL_NAME = "large-v3" means we use a large Whisper model. It is more accurate, but it can be heavy for a weaker machine. The first run may also take time because the model has to be downloaded.

If your computer struggles or transcription takes too long, start with a smaller model:

MODEL_NAME = "medium"

or:

MODEL_NAME = "small"

Smaller models usually run faster, but they may make more mistakes, especially with poor audio, multiple speakers, or specialist vocabulary.

device="auto" lets the library choose how to run the model. If a compatible GPU is available and configured correctly, transcription may run faster. Otherwise the script will use the CPU.

compute_type="auto" allows the computation mode to be selected automatically. In practice this affects performance, memory usage, and hardware compatibility.

Transcription and the segments generator

In the transcription step we have:

segments, info = model.transcribe(
    str(AUDIO_FILE),
    language="pl",
    vad_filter=True,
)

language="pl" tells the model that the recording is in Polish, so it does not need to guess the language.

vad_filter=True enables voice activity detection. The tool tries to skip fragments without speech, such as silence or longer pauses.

segments contains the recognized text fragments. Each segment includes, among other things, the start time, end time, and spoken text.

Important technical detail

In faster-whisper, segments is a generator. That means the real transcription happens only when you iterate over the segments in a for loop. That is why this script writes the output file inside the loop.

How do you run the full process?

If you have film.mp4 and transkrybuj.py in the folder, run:

ffmpeg -i "film.mp4" -vn -ac 1 -ar 16000 -c:a pcm_s16le "audio.wav"
python transkrybuj.py

After it finishes, you should get:

transkrypcja.txt

What can go wrong?

The ffmpeg command does not work

First check whether FFmpeg is installed:

ffmpeg -version

If PowerShell does not recognize the command, close and reopen the terminal. Sometimes the system has to refresh PATH after installation.

Python cannot find the faster_whisper library

Make sure the .venv environment is activated:

.venv\Scripts\activate

Only then install the library:

pip install faster-whisper

Transcription takes a very long time

That is normal for large models and long recordings. Try changing MODEL_NAME = "large-v3" to "medium" or "small".

The result has errors

Automatic transcription is not a perfect stenogram. The result depends heavily on recording quality.

  • background noise
  • echo
  • multiple people speaking at once
  • a quiet microphone
  • music under speech
  • specialist terminology
  • names of people and companies

For important material, the transcript should still be reviewed manually.

What about company data?

If you transcribe company meetings, client calls, or recordings containing personal data, you need to think about file access and where the data is processed.

Running the model locally has one clear advantage: the audio file does not need to be uploaded to an external service. That still does not remove the need to think about security. Audio files and transcripts should live in an organized location with access only for the people who should actually see them.

What can you do with this next?

The transkrypcja.txt file is useful on its own, but the bigger value starts one step later.

You can extend the process, for example, to:

  • automatically pull recordings from a selected folder
  • transcribe new files
  • generate a short meeting summary
  • extract decisions and tasks
  • save the result to Google Drive
  • create an entry in a sheet or CRM
  • email the note to the team
  • build a knowledge base from recordings

At that point transcription stops being a one-off script. It becomes part of a process.

Summary

For a basic video transcription flow, three steps are enough:

  1. extract audio from the video with FFmpeg
  2. run the audio through faster-whisper
  3. save the result to a text file

This is a simple example, but it shows the broader automation pattern well: turn a recording into text, text into knowledge, and knowledge into an organized process.

Want to turn recordings into useful notes and tasks?

MorenaTech helps build practical AI workflows: recording transcription, meeting summaries, extracting decisions, task lists, and an organized knowledge base from company materials.

Read more in Technical