Skip to content

notnews/nbc_transcripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

NBC transcripts

NBC used to provide transcripts of some of its shows at the now defunct http://www.nbcnews.com/id/3719710. Check out this archive.org page https://web.archive.org/web/20170601234403/http://www.nbcnews.com/id/3719710.

nbc_crawl.py crawls all the links to news transcripts. The script produces a list of all links. And nbc_extract.py downloads and parses the news transcripts and appends some meta data and dumps it to a CSV file.

The raw html files and the final csv can be downloaded from http://dx.doi.org/10.7910/DVN/ND1TCV.

And a list of all the links along with the title of the show and the date, see here.

Here's the yearly breakdown of the final dataset (5,369 rows):

2008 2009 2010 2011 2012 2013 2014 
  76  434  752 1042 1164 1177  724 

Notes

  • Scripts from 2014.
  • Some news transcripts had a typo in the date string, e.g. 'Thusday','Februrary', etc. That caused the script to fail to fill in the date column.