Transcribe and Correct with AI
Reading time: approx. 10 min
What You Will Learn
This is the core of the process. We will use two different AI models. First KB-Whisper to perform the actual speech-to-text conversion on our audio files. Then we use a text correction model to automatically add punctuation (periods, commas) and correct capitalization, which makes the text much more readable.
The Basics: Two-Stage Rocket
- Transcription:
KBLab/kb-whisper-largelistens to the audio and writes down the words it hears. The result is raw, unpunctuated text with timestamps. - Punctuation:
sdadas/byt5-text-correctionreads the raw text and uses its understanding of grammar and sentence structure to add periods, commas, and capital letters.
How We Do It: The Scripts That Do the Job
Step 1: Transcribe the Segments (transcribe.py)
Create the file transcribe.py in your project folder and paste in the code below. The code loads the KB-Whisper model into your graphics card memory (if you have one) and then feeds it one audio file at a time.
import os, warnings
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import torch
warnings.filterwarnings("ignore", message=".*deprecated.*")
# --- Settings ---
MODEL = "KBLab/kb-whisper-large"
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
DTYPE = torch.float16 if torch.cuda.is_available() else torch.float32
DIR = "chunks"
OUTFILE = "transcript_with_timestamps.txt"
CHUNK_DURATION_S = 30
# Load the model and processor
print(f"Loading model {MODEL} to {DEVICE}...")
model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL, torch_dtype=DTYPE, use_safetensors=True).to(DEVICE)
processor = AutoProcessor.from_pretrained(MODEL)
asr_pipeline = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
device=DEVICE,
torch_dtype=DTYPE
)
# Helper function to format time
def format_time(seconds):
h, r = divmod(seconds, 3600)
m, s = divmod(r, 60)
return f"{int(h):02d}:{int(m):02d}:{int(s):02d}"
# Fetch and sort the audio files
files = sorted([f for f in os.listdir(DIR) if f.endswith(".wav")])
# Open a file to write the result to
with open(OUTFILE, "w", encoding="utf-8") as out:
for filename in files:
chunk_index = int(filename.split("_")[1].split(".")[0])
start_time = chunk_index * CHUNK_DURATION_S
end_time = start_time + CHUNK_DURATION_S
filepath = os.path.join(DIR, filename)
print(f"Transcribing {filename}...")
result = asr_pipeline(filepath, generate_kwargs={"language": "sv"})
text = result["text"].strip()
out.write(f"[{format_time(start_time)} - {format_time(end_time)}]\n{text}\n\n")
print(f"Transcription complete! Result saved in: {OUTFILE}")
Step 2: Add Punctuation and Capitalization (punctuate.py)
Create the file punctuate.py. This script reads the raw text file, sends the text blocks to the correction model, and then writes a new, cleaner text file.
import re
from transformers import pipeline
# --- Settings ---
INPUT_FILE = "transcript_with_timestamps.txt"
OUTPUT_FILE = "transcript_punctuated.txt"
MODEL = "sdadas/byt5-text-correction"
BATCH_SIZE = 8 # Process 8 text blocks at a time for efficiency
# Function to ensure capital letter after period
def capitalize_sentences(text):
parts = re.split(r'([.?!]\s*)', text)
return "".join(p.capitalize() for p in parts)
print(f"Loading punctuation model {MODEL}...")
punctuation_pipeline = pipeline(
"text2text-generation",
model=MODEL,
tokenizer=MODEL,
device=0, # Use first GPU
batch_size=BATCH_SIZE
)
# Read in the raw text
with open(INPUT_FILE, 'r', encoding='utf-8') as f:
lines = f.read().splitlines()
# Separate timestamps from text to be processed
timestamps, texts_to_process = [], []
for line in lines:
if line.startswith("[") or not line.strip():
timestamps.append(line)
else:
timestamps.append(None) # Marker for text
texts_to_process.append(line)
print(f"Correcting {len(texts_to_process)} text segments...")
corrected_texts = punctuation_pipeline(texts_to_process)
# Assemble the final text
output_lines = []
text_index = 0
for ts in timestamps:
if ts is not None:
output_lines.append(ts)
else:
# Fetch corrected text and apply capitalization
corrected_text = corrected_texts[text_index]['generated_text'].strip()
final_text = capitalize_sentences(corrected_text)
output_lines.append(final_text)
text_index += 1
with open(OUTPUT_FILE, 'w', encoding='utf-8') as f:
f.write("\n".join(output_lines))
print(f"Punctuation complete! Result saved in: {OUTPUT_FILE}")
How to Run the Scripts
Make sure your virtual environment is active (source .venv/bin/activate). Then run the scripts in order:
Run the transcription:
python transcribe.pyThis can take a while depending on your computer's power and the length of the audio file.
Run the punctuation:
python punctuate.pyThis usually goes much faster.
You now have a file called transcript_punctuated.txt with a clean and timestamped transcript.
Next Step
The text is ready! But having it in a .txt file is not always the most useful format. In the final moment we learn to convert our text into nice documents in formats like Markdown, Word (.docx), and HTML.

