⚡ Turn hours of YouTube videos into clean, structured text in minutes.
A python tool for fetching thousands of videos fast from a Youtube channel along with structured transcripts and additional metadata. Export data easily as CSV, TXT, or JSON.
- Installation
- Quick CLI Usage
- Features
- Basic Usage (Python API)
- Using Different Fetchers
- Retreive Different Languages
- Fetching Only Manually Created Transcripts
- Exporting
- Other Methods
- Proxy Configuration
- Advanced HTTP Configuration (Optional)
- CLI (Advanced)
- Contributing
- Running Tests
- Related Projects
- License
- Contributors
Install from PyPI:
pip install ytfetcherFetch 50 video transcripts + metadata from a channel and save as JSON:
ytfetcher from_channel -c TheOffice -m 50 -f jsonYTFetcher comes with a simple CLI so you can fetch data directly from your terminal.
ytfetcher -husage: ytfetcher [-h] {from_channel,from_video_ids} ...
Fetch YouTube transcripts for a channel
positional arguments:
{from_channel,from_video_ids}
from_channel Fetch data from channel handle with max_results.
from_playlist_id Fetch data from a specific playlist id.
from_video_ids Fetch data from your custom video ids.
options:
-h, --help show this help message and exit- Fetch full transcripts from a YouTube channel.
- Get video metadata: title, description, thumbnails, published date.
- Async support for high performance.
- Export fetched data as txt, csv or json.
- CLI support.
Note: When specifying the channel, you should provide the exact channel handle without the @ symbol, channel URL, or display name.
For example, use TheOffice instead of @TheOffice or https://www.youtube.com/c/TheOffice.
Here’s how you can get transcripts and metadata information like channel name, description, published date, etc. from a single channel with from_channel method:
from ytfetcher import YTFetcher
from ytfetcher.models.channel import ChannelData
import asyncio
fetcher = YTFetcher.from_channel(
channel_handle="TheOffice",
max_results=2
)
async def get_channel_data() -> list[ChannelData]:
channel_data = await fetcher.fetch_youtube_data()
return channel_data
if __name__ == '__main__':
data = asyncio.run(get_channel_data())
print(data)This will return a list of ChannelData with metadata in DLSnippet objects:
[
ChannelData(
video_id='video1',
transcripts=[
Transcript(
text="Hey there",
start=0.0,
duration=1.54
),
Transcript(
text="Happy coding!",
start=1.56,
duration=4.46
)
]
metadata=DLSnippet(
title='VideoTitle',
description='VideoDescription',
url='https://youtu.be/video1',
duration=120,
view_count=1000,
thumbnails=[{'url': 'thumbnail_url'}]
)
),
# Other ChannelData objects...
]ytfetcher also supports different fetcher so you can fetch with channel_handle, custom video_ids or from a playlist_id
Here's how you can fetch bulk transcripts from a specific playlist_id using ytfetcher.
from ytfetcher import YTFetcher
import asyncio
fetcher = YTFetcher.from_playlist_id(
playlist_id="playlistid1254"
)
# Rest is same ...Initialize ytfetcher with custom video IDs using from_video_ids method:
from ytfetcher import YTFetcher
import asyncio
fetcher = YTFetcher.from_video_ids(
video_ids=['video1', 'video2', 'video3']
)
# Rest is same ...You can use the languages param to retrieve your desired language. (Default en)
fetcher = YTFetcher.from_video_ids(video_ids=video_ids, languages=["tr", "en"])Also here's a quick CLI command for languages param.
ytfetcher from_channel -c TheOffice -m 50 -f csv --print --languages tr enytfetcher first tries to fetch the Turkish transcript. If it's not available, it falls back to English.
ytfetcher allows you to fetch only manually created transcripts from a channel which allows you to get more precise transcripts.
fetcher = YTFetcher.from_channel(channel_handle="TEDx", manually_created=True) # Set manually_created flag to TrueYou can also easily enable this feature with --manually-created argument in CLI.
ytfetcher from_channel -c TEDx -f csv --manually-createdUse the Exporter class to export ChannelData in csv, json, or txt:
from ytfetcher.services import Exporter
channel_data = asyncio.run(fetcher.fetch_youtube_data())
exporter = Exporter(
channel_data=channel_data,
allowed_metadata_list=['title'], # You can customize this
timing=True, # Include transcript start/duration
filename='my_export', # Base filename
output_dir='./exports' # Optional output directory
)
exporter.export_as_json() # or .export_as_txt(), .export_as_csv()You can also specify arguments when exporting which allows you to decide whether to exclude timings and choose desired metadata.
ytfetcher from_channel -c TheOffice -m 20 -f json --no-timing --metadata title descriptionThis will exclude timings from transcripts and keep only title and description as metadata.
You can also fetch only transcript data or metadata with video IDs using fetch_transcripts and fetch_snippets.
fetcher = YTFetcher.from_channel(channel_handle="TheOffice", max_results=2)
async def get_transcript_data():
return await fetcher.fetch_transcripts()
data = asyncio.run(get_transcript_data())
print(data)async def get_snippets():
return await fetcher.fetch_snippets()
data = asyncio.run(get_snippets())
print(data)YTFetcher supports proxy usage for fetching YouTube transcripts:
from ytfetcher import YTFetcher
from ytfetcher.config import GenericProxyConfig
fetcher = YTFetcher.from_channel(
channel_handle="TheOffice",
max_results=3,
proxy_config=GenericProxyConfig()
)YTfetcher already uses custom headers for mimic real browser behavior but if want to change it you can use a custom HTTPConfig class.
from ytfetcher import YTFetcher
from ytfetcher.config import HTTPConfig
custom_config = HTTPConfig(
timeout=4.0,
headers={"User-Agent": "ytfetcher/1.0"}
)
fetcher = YTFetcher.from_channel(
channel_handle="TheOffice",
max_results=10,
http_config=custom_config
)ytfetcher from_channel -c <CHANNEL_HANDLE> -m <MAX_RESULTS> -f <FORMAT>ytfetcher from_video_ids -v video_id1 video_id2 ... -f jsonytfetcher from_playlist_id -p playlistid123 -f csv -m 25ytfetcher from_channel -c <CHANNEL_HANDLE> -f json --webshare-proxy-username "<USERNAME>" --webshare-proxy-password "<PASSWORD>"ytfetcher from_channel -c <CHANNEL_HANDLE> -f json --http-proxy "http://user:pass@host:port" --https-proxy "https://user:pass@host:port"ytfetcher from_channel -c <CHANNEL_HANDLE> --http-timeout 4.2 --http-headers "{'key': 'value'}"git clone https://github.com/kaya70875/ytfetcher.git
cd ytfetcher
poetry installpoetry run pytestThis project is licensed under the MIT License — see the LICENSE file for details.
Thanks to everyone who has contributed to ytfetcher ❤️
⭐ If you find this useful, please star the repo or open an issue with feedback!