omscs-scraper

A web scraper to grab video transcripts from Georgia Tech's OMSCS videos on Canvas (Kultura) and Youtube.

Kaltura (Canvas) Scraper

Requires Python 3, Pip, and yt-dlp

Start playing the Kultura video and double click to 'View Frame' (this option will not show up if the video is not playing). Copy the link from the 'View Frame' tab
Run the canvas_scraper.py file with the copied link from step 1. This will download a srt file.
Run the srt_to_text.py to convert the srt file to a text file. This will create a new text file.

Some of the videos for OMSCS are available on Youtube.

Run npm install to install all dependencies
Get the video ID from the video. In the case of a Youtube video, the ID can be found after the /watch?v= part of the URL. (ex. https://www.youtube.com/watch?v=Apmks_6b584 the video ID for this would be Apmks_6b584)
Edit the youtube_scraper.js file. In this file, there is a vids array that has a list of video IDs that can be modified. NOTE: If you only want to scrape one video, delete all other IDs in the array.
Run node youtube_scraper.js to run the script. The result of the script will be a file created called Output.txt

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
Untitled-1.ini		Untitled-1.ini
canvas_scraper.py		canvas_scraper.py
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt
srttotext.py		srttotext.py
tsconfig.json		tsconfig.json
youtube_scraper.js		youtube_scraper.js
youtube_scraper.js.map		youtube_scraper.js.map
youtube_scraper.ts		youtube_scraper.ts