How to transcribe video to text with FFmpeg and faster-whisper
If you have a meeting recording, call, webinar, or training video and want to turn it into text, you could play it back, pause every few seconds, and type it manually. It makes more sense to automate it.
Before you read
Why do transcription at all?
Transcription is useful anywhere a recording contains knowledge, decisions, or agreements. In a small business this often means client meetings, training sessions, webinars, consultations, and working recordings.
You can use it for:
- preparing meeting notes
- writing up a client conversation
- developing webinar material
- creating an article draft from a recording
- finding specific fragments of a conversation
- preparing a task list after a meeting
- archiving knowledge in the company
The key point is that transcription does not need to be the end of the process. It can be the start of automation: a recording becomes text, the text becomes a summary, the summary becomes a task list, and the task list becomes a concrete workflow in the company.
What do you need?
For this example we will use three things:
- FFmpeg, a tool for working with audio and video
- Python
- the faster-whisper library, which lets you run a Whisper model locally
We assume you have a file called film.mp4 in your working directory. In the same folder we will create a Python environment, an audio file, and the transcription script.
Technical sources
This article is based on Microsoft Learn for winget install, the official FFmpeg documentation and the README of the faster-whisper project.
Step 1. Install the tools
On Windows, FFmpeg can be installed with Windows Package Manager, which is winget. Microsoft documents the winget install command, the --id option, and -e / --exact for exact package matching.
winget install --id Gyan.FFmpeg -e
python -m venv .venv
.venv\Scripts\activate
pip install faster-whisperWhat is happening here?
- winget install --id Gyan.FFmpeg -e installs FFmpeg with the exact package identifier
- python -m venv .venv creates a local Python environment
- .venv\Scripts\activate activates the environment in PowerShell
- pip install faster-whisper installs the transcription library
Step 2. Convert the video to audio
We do not need the image for transcription. The model recognizes speech, so first we prepare a simple audio file.
ffmpeg -i "film.mp4" -vn -ac 1 -ar 16000 -c:a pcm_s16le "audio.wav"This command takes film.mp4 and creates audio.wav. FFmpeg interprets input and output options depending on where they appear in the command, so order matters.
| Fragment | Meaning |
|---|---|
| ffmpeg | runs the FFmpeg program |
| -i "film.mp4" | points to the input file |
| -vn | skips video and keeps audio only |
| -ac 1 | sets one audio channel, so mono |
| -ar 16000 | sets the sample rate to 16 kHz |
| -c:a pcm_s16le | saves audio as 16-bit PCM WAV |
| "audio.wav" | sets the output filename |
Why mono and 16 kHz? For speech recognition you usually do not need stereo or music-grade quality. It matters more that the voice is clean and the file is simple to process.
Optional: start from a specific moment in the video
Sometimes a recording has a few minutes of intro, silence, a title screen, or microphone setup. In that case you do not need to process the whole file from the beginning.
You can use the -ss flag before the input:
ffmpeg -ss 00:05:39 -i "film.mp4" -vn -ac 1 -ar 16000 -c:a pcm_s16le "audio.wav"In this example FFmpeg starts at 00:05:39. This is optional. For standard full-video transcription, use the version without -ss.
Step 3. Python script for transcription
Create a file called transkrybuj.py in the same folder as audio.wav.
from pathlib import Path
from faster_whisper import WhisperModel
AUDIO_FILE = Path("audio.wav")
OUTPUT_FILE = Path("transkrypcja.txt")
MODEL_NAME = "large-v3"
if not AUDIO_FILE.exists():
raise FileNotFoundError(f"Nie znaleziono pliku: {AUDIO_FILE}")
model = WhisperModel(MODEL_NAME, device="auto", compute_type="auto")
segments, info = model.transcribe(
str(AUDIO_FILE),
language="pl",
vad_filter=True,
)
print(f"Język: {info.language}, prawdopodobieństwo: {info.language_probability:.2f}")
with OUTPUT_FILE.open("w", encoding="utf-8") as file:
for segment in segments:
text = segment.text.strip()
if not text:
continue
line = f"[{segment.start:.2f} - {segment.end:.2f}] {text}"
print(line)
file.write(line + "\n")
print(f"Zapisano: {OUTPUT_FILE}")After running the script, the result will be saved to transkrypcja.txt. Each line has a simple format:
[12.34 - 18.90] Recognized speech text.This gives you timestamps instead of one wall of text, which is convenient when you need to return to a specific fragment of the recording.
What do the settings in the script mean?
The key line creates the model:
model = WhisperModel(MODEL_NAME, device="auto", compute_type="auto")MODEL_NAME = "large-v3" means we use a large Whisper model. It is more accurate, but it can be heavy for a weaker machine. The first run may also take time because the model has to be downloaded.
If your computer struggles or transcription takes too long, start with a smaller model:
MODEL_NAME = "medium"or:
MODEL_NAME = "small"Smaller models usually run faster, but they may make more mistakes, especially with poor audio, multiple speakers, or specialist vocabulary.
device="auto" lets the library choose how to run the model. If a compatible GPU is available and configured correctly, transcription may run faster. Otherwise the script will use the CPU.
compute_type="auto" allows the computation mode to be selected automatically. In practice this affects performance, memory usage, and hardware compatibility.
Transcription and the segments generator
In the transcription step we have:
segments, info = model.transcribe(
str(AUDIO_FILE),
language="pl",
vad_filter=True,
)language="pl" tells the model that the recording is in Polish, so it does not need to guess the language.
vad_filter=True enables voice activity detection. The tool tries to skip fragments without speech, such as silence or longer pauses.
segments contains the recognized text fragments. Each segment includes, among other things, the start time, end time, and spoken text.
Important technical detail
In faster-whisper, segments is a generator. That means the real transcription happens only when you iterate over the segments in a for loop. That is why this script writes the output file inside the loop.
How do you run the full process?
If you have film.mp4 and transkrybuj.py in the folder, run:
ffmpeg -i "film.mp4" -vn -ac 1 -ar 16000 -c:a pcm_s16le "audio.wav"
python transkrybuj.pyAfter it finishes, you should get:
transkrypcja.txtWhat can go wrong?
The ffmpeg command does not work
First check whether FFmpeg is installed:
ffmpeg -versionIf PowerShell does not recognize the command, close and reopen the terminal. Sometimes the system has to refresh PATH after installation.
Python cannot find the faster_whisper library
Make sure the .venv environment is activated:
.venv\Scripts\activateOnly then install the library:
pip install faster-whisperTranscription takes a very long time
That is normal for large models and long recordings. Try changing MODEL_NAME = "large-v3" to "medium" or "small".
The result has errors
Automatic transcription is not a perfect stenogram. The result depends heavily on recording quality.
- background noise
- echo
- multiple people speaking at once
- a quiet microphone
- music under speech
- specialist terminology
- names of people and companies
For important material, the transcript should still be reviewed manually.
What about company data?
If you transcribe company meetings, client calls, or recordings containing personal data, you need to think about file access and where the data is processed.
Running the model locally has one clear advantage: the audio file does not need to be uploaded to an external service. That still does not remove the need to think about security. Audio files and transcripts should live in an organized location with access only for the people who should actually see them.
What can you do with this next?
The transkrypcja.txt file is useful on its own, but the bigger value starts one step later.
You can extend the process, for example, to:
- automatically pull recordings from a selected folder
- transcribe new files
- generate a short meeting summary
- extract decisions and tasks
- save the result to Google Drive
- create an entry in a sheet or CRM
- email the note to the team
- build a knowledge base from recordings
At that point transcription stops being a one-off script. It becomes part of a process.
Summary
For a basic video transcription flow, three steps are enough:
- extract audio from the video with FFmpeg
- run the audio through faster-whisper
- save the result to a text file
This is a simple example, but it shows the broader automation pattern well: turn a recording into text, text into knowledge, and knowledge into an organized process.
Want to turn recordings into useful notes and tasks?
MorenaTech helps build practical AI workflows: recording transcription, meeting summaries, extracting decisions, task lists, and an organized knowledge base from company materials.
Read more in Technical
Jak zrobić transkrypcję filmu do tekstu za pomocą FFmpeg i faster-whisper
Praktyczny poradnik: film zamieniamy na audio przez FFmpeg, audio przepuszczamy przez lokalny model Whisper, a wynik zapisujemy do pliku tekstowego ze znacznikami czasu.
Git przy pracy z AI: jak nie stracić kontroli nad projektem
AI przyspiesza zmiany w kodzie, ale Git pozwala sprawdzić diff, pracować na branchu, zapisywać decyzje w commitach i nie zamieniać projektu w serię przypadkowych eksperymentów.
MAPI-local-medium: lokalny serwer MCP, który daje modelowi pamięć, narzędzia i granice
Techniczne wyjaśnienie, czym jest MAPI-local-medium: lokalny serwer MCP, który pozwala modelowi językowemu korzystać z pamięci projektowej, narzędzi, bootstrapu kontekstu i kontrolowanego środowiska pracy.