MLB Broadcast Transcripts
Research · not officialAuto-transcribed MLB game broadcasts plus a derived word-mention dataset for 21 tracked terms modeled on Kalshi NEWMENTION markets — ~1,100 games and growing.
Files, columns & methodology
What's in the bundle
Three CSV files joined on a short gameid key — a per-game
transcript table (with game-boundary markers), a per-game×term summary,
and a per-mention detail table. Row counts are approximate and update
from the latest snapshot.
term_game_indicators.csv
One row per (game × term). Whether each of the 21 terms was said, how many times, and where in the game.
Columns & sample rows
gameid— short stable game key (join key across all files)date, away_team, home_team, season— game identityterm— one of the 21 tracked words/phrasessaid— 1 if said in-game, else 0count— number of in-game mentionsfirst_char_index, first_word_index— position of first mention in the transcriptfirst_game_pos, last_game_pos— normalized position (0 = first pitch, 1 = final out)boundary_confidence— high / medium / low / none (filterable)
| gameid | date | away | home | term | said | count | first_game_pos | last_game_pos | boundary_confidence |
|---|---|---|---|---|---|---|---|---|---|
| cc7acdb0d5b9 | 2025-10-29 | Toronto Blue Jays | Los Angeles Dodgers | Bullpen | 1 | 11 | 0.1984 | 0.9553 | low |
| cc7acdb0d5b9 | 2025-10-29 | Toronto Blue Jays | Los Angeles Dodgers | Double Play | 1 | 6 | 0.0464 | 0.6895 | low |
| cc7acdb0d5b9 | 2025-10-29 | Toronto Blue Jays | Los Angeles Dodgers | Perfect Game | 0 | 0 | low |
term_occurrences.csv
One row per individual mention — granular and locatable. Each row reverse-engineers back to the exact spot in the transcript.
Columns & sample rows
gameid— short stable game key (join key across all files)date, away_team, home_team, season— game identityterm— the tracked word/phrasemention_index— 1-based index of this mention within the gamechar_index, word_index— offset of the mention in the full transcriptgame_pos— normalized position (0 = first pitch, 1 = final out)matched_text— the exact text that matchedsnippet— surrounding transcript text for context
| gameid | date | term | mention_index | char_index | word_index | game_pos | matched_text | snippet |
|---|---|---|---|---|---|---|---|---|
| cc7acdb0d5b9 | 2025-10-29 | Bullpen | 1 | 26144 | 4956 | 0.1984 | bullpen | …going to mechanics in the bullpen, and you just get that… |
| cc7acdb0d5b9 | 2025-10-29 | Bullpen | 2 | 61230 | 11581 | 0.5487 | bullpen | …the Blue Jays have had the edge. The bullpen you figured coming in… |
mlb_transcripts.csv
The raw asset — one row per game with the full auto-transcribed
broadcast (~100k chars each). Needed to resolve
char_index / word_index back to text.
Game-boundary markers are folded in as columns.
Columns & sample row
gameid— short stable game key (join key across all files)game_url— source game URL (human reference)date, away_team, home_team, season— game identitytranscript— full transcribed text- Game-boundary markers (folded in):
boundary_start_char, boundary_end_char— character span of live in-game commentaryfirst_pitch_quote, final_out_quote— the detected boundary quotesboundary_confidence— high / medium / low (blank if none)boundary_notes— detection notes
| gameid | date | away | home | transcript | boundary_confidence |
|---|---|---|---|---|---|
| cc7acdb0d5b9 | 2025-10-29 | Toronto Blue Jays | Los Angeles Dodgers | Welcome to Dodger Stadium, and welcome to the 2025 World Series… | low |
Same gameid as the indicator and occurrence rows above — that's how you join.
The 21 tracked terms
Baseball-broadcast words/phrases modeled on Kalshi NEWMENTION markets.
Methodology
- Transcription: audio from publicly available MLB game-replay broadcasts, transcribed with Deepgram nova-2.
- Game boundaries: Claude Haiku detects first-pitch and final-out markers so counting is scoped to live in-game commentary (pregame/postgame trimmed).
- Term matching: follows Kalshi NEWMENTION resolution rules — exact word/phrase + plural + possessive count; tense/derivation do not; open & hyphenated compounds count, fused compounds do not. Unit-tested.
- Reproducible & idempotent: the whole pipeline re-runs cleanly and is refreshed nightly as new games are added.
Known limitations — please read
- AI-generated transcripts. ASR will contain errors. Not verbatim.
- Non-announcer / commercial audio included. Pregame and postgame are trimmed via boundaries; in-game ad reads are not.
- Boundary confidence. Most games detected well; a
minority are flagged
low/mediumwhere the start/end may be slightly off, and some havenoneand fall back to the full transcript. Theboundary_confidencecolumn lets you filter. - Source feed not guaranteed. Whichever broadcast feed was publicly available — not guaranteed to be any specific official feed.
- Phonetic / name-spelling variants. Matching is exact-spelling on the transcript. Mispronunciations and ASR variants of names may be missed.
- Not official / not affiliated. Independent research product. Not affiliated with, endorsed by, or sourced from Kalshi, MLB, or any broadcaster. Not a substitute for any official market resolution or official statistics.
- Coverage is partial & growing. Not every MLB game is included; coverage depends on broadcast availability.