Datasets

Independent research datasets for prediction-market analysis. CSV, UTF-8, updated as new data comes in.

MLB Broadcast Transcripts

Research · not official

Auto-transcribed MLB game broadcasts plus a derived word-mention dataset for 21 tracked terms modeled on Kalshi NEWMENTION markets — ~1,100 games and growing.

Loading catalog…
Not sure yet? Download a free 10-game sample — same four files, same columns.
Files, columns & methodology

What's in the bundle

Three CSV files joined on a short gameid key — a per-game transcript table (with game-boundary markers), a per-game×term summary, and a per-mention detail table. Row counts are approximate and update from the latest snapshot.

Headline

term_game_indicators.csv

One row per (game × term). Whether each of the 21 terms was said, how many times, and where in the game.

23,079rows · full
10,437rows · 2026
Columns & sample rows
  • gameid — short stable game key (join key across all files)
  • date, away_team, home_team, season — game identity
  • term — one of the 21 tracked words/phrases
  • said — 1 if said in-game, else 0
  • count — number of in-game mentions
  • first_char_index, first_word_index — position of first mention in the transcript
  • first_game_pos, last_game_pos — normalized position (0 = first pitch, 1 = final out)
  • boundary_confidence — high / medium / low / none (filterable)
gameiddateawayhometermsaidcountfirst_game_poslast_game_posboundary_confidence
cc7acdb0d5b92025-10-29Toronto Blue JaysLos Angeles DodgersBullpen1110.19840.9553low
cc7acdb0d5b92025-10-29Toronto Blue JaysLos Angeles DodgersDouble Play160.04640.6895low
cc7acdb0d5b92025-10-29Toronto Blue JaysLos Angeles DodgersPerfect Game00low

term_occurrences.csv

One row per individual mention — granular and locatable. Each row reverse-engineers back to the exact spot in the transcript.

46,183rows · full
20,057rows · 2026
Columns & sample rows
  • gameid — short stable game key (join key across all files)
  • date, away_team, home_team, season — game identity
  • term — the tracked word/phrase
  • mention_index — 1-based index of this mention within the game
  • char_index, word_index — offset of the mention in the full transcript
  • game_pos — normalized position (0 = first pitch, 1 = final out)
  • matched_text — the exact text that matched
  • snippet — surrounding transcript text for context
gameiddatetermmention_indexchar_indexword_indexgame_posmatched_textsnippet
cc7acdb0d5b92025-10-29Bullpen12614449560.1984bullpen…going to mechanics in the bullpen, and you just get that…
cc7acdb0d5b92025-10-29Bullpen261230115810.5487bullpen…the Blue Jays have had the edge. The bullpen you figured coming in…

mlb_transcripts.csv

The raw asset — one row per game with the full auto-transcribed broadcast (~100k chars each). Needed to resolve char_index / word_index back to text. Game-boundary markers are folded in as columns.

1,173games · full
497games · 2026
Columns & sample row
  • gameid — short stable game key (join key across all files)
  • game_url — source game URL (human reference)
  • date, away_team, home_team, season — game identity
  • transcript — full transcribed text
  • Game-boundary markers (folded in):
  • boundary_start_char, boundary_end_char — character span of live in-game commentary
  • first_pitch_quote, final_out_quote — the detected boundary quotes
  • boundary_confidence — high / medium / low (blank if none)
  • boundary_notes — detection notes
gameiddateawayhometranscriptboundary_confidence
cc7acdb0d5b92025-10-29Toronto Blue JaysLos Angeles DodgersWelcome to Dodger Stadium, and welcome to the 2025 World Series…low

Same gameid as the indicator and occurrence rows above — that's how you join.

The 21 tracked terms

Baseball-broadcast words/phrases modeled on Kalshi NEWMENTION markets.

Bases Loaded Bullpen Bunt / Bunted Dodger Stadium Double Play Error Extra Inning Fenway Park Grand Slam MVP Ohtani Oriole Park PNC Park Perfect Game Pinch Hitter / Pinch Hit Pitch Clock Trade / Traded Triple Walk Off What a Catch Wild Pitch

Methodology

  • Transcription: audio from publicly available MLB game-replay broadcasts, transcribed with Deepgram nova-2.
  • Game boundaries: Claude Haiku detects first-pitch and final-out markers so counting is scoped to live in-game commentary (pregame/postgame trimmed).
  • Term matching: follows Kalshi NEWMENTION resolution rules — exact word/phrase + plural + possessive count; tense/derivation do not; open & hyphenated compounds count, fused compounds do not. Unit-tested.
  • Reproducible & idempotent: the whole pipeline re-runs cleanly and is refreshed nightly as new games are added.

Known limitations — please read

  1. AI-generated transcripts. ASR will contain errors. Not verbatim.
  2. Non-announcer / commercial audio included. Pregame and postgame are trimmed via boundaries; in-game ad reads are not.
  3. Boundary confidence. Most games detected well; a minority are flagged low/medium where the start/end may be slightly off, and some have none and fall back to the full transcript. The boundary_confidence column lets you filter.
  4. Source feed not guaranteed. Whichever broadcast feed was publicly available — not guaranteed to be any specific official feed.
  5. Phonetic / name-spelling variants. Matching is exact-spelling on the transcript. Mispronunciations and ASR variants of names may be missed.
  6. Not official / not affiliated. Independent research product. Not affiliated with, endorsed by, or sourced from Kalshi, MLB, or any broadcaster. Not a substitute for any official market resolution or official statistics.
  7. Coverage is partial & growing. Not every MLB game is included; coverage depends on broadcast availability.