Go to file
Danny Aziz 7eb9e38eb6 Added more proxies tests 2019-03-20 17:06:05 +00:00
youtube_transcript_api Added more proxies tests 2019-03-20 17:06:05 +00:00
.coveragerc test suite and corresponding tooling added 2019-03-14 17:04:34 +01:00
.gitignore test suite and corresponding tooling added 2019-03-14 17:04:34 +01:00
.travis.yml added coveralls to ci pipeline 2019-03-14 17:41:43 +01:00
LICENSE LICENSE added 2018-07-03 11:22:32 +02:00
README.md Rebased on PR #11 and added tests 2019-03-16 16:23:42 +00:00
coverage.sh test suite and corresponding tooling added 2019-03-14 17:04:34 +01:00
requirements.txt added coveralls to ci pipeline 2019-03-14 17:41:43 +01:00
setup.py added coveralls to ci pipeline 2019-03-14 17:41:43 +01:00

README.md

YouTube Transcript/Subtitle API (including automatically generated subtitles)

This is an python API which allows you to get the transcripts/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require a headless browser, like other selenium based solutions do!

Install

It is recommended to install this module by using pip:

pip install youtube_transcript_api

If you want to use it from source, you'll have to install the dependencies manually:

pip install -r requirements.txt

How to use it

You could either integrate this module into an existing application, or just use it via an CLI

In code

To get a transcript for a given video you can do:

from youtube_transcript_api import YouTubeTranscriptApi

YouTubeTranscriptApi.get_transcript(video_id)

This will return a list of dictionaries looking somewhat like this:

[
    {
        'text': 'Hey there',
        'start': 7.58,
        'duration': 6.13
    },
    {
        'text': 'how are you',
        'start': 14.08,
        'duration': 7.58
    },
    # ...
]

You can also add the languages param if you want to make sure the transcripts are retrieved in your desired language (it usually defaults to english).

YouTubeTranscriptApi.get_transcripts(video_ids, languages=['de', 'en'])

It's a list of language codes in a descending priority. In this example it will first try to fetch the german transcript ('de') and then fetch the english transcript ('en') if it fails to do so. As I can't provide a complete list of all working language codes with full certainty, you may have to play around with the language codes a bit, to find the one which is working for you!

To get transcripts for a list fo video ids you can call:

YouTubeTranscriptApi.get_transcripts(video_ids, languages=['de', 'en'])

languages also is optional here.

CLI

Execute the CLI script using the video ids as parameters and the results will be printed out to the command line:

youtube_transcript_api <first_video_id> <second_video_id> ...

The CLI also gives you the option to provide a list of preferred languages:

youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en

If you would prefer to write it into a file or pipe it into another application, you can also output the results as json using the following line:

youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en --json > transcripts.json

Proxy

You can pass a proxy to use during the network requests

Code:

from youtube_transcript_api import YouTubeTranscriptApi

YouTubeTranscriptApi.get_transcript(video_id, proxy={"http": "http://user:pass@domain:port", "https": "https://user:pass@domain:port"})

CLI:

youtube_transcript_api <first_video_id> <second_video_id> --http-proxy http://user:pass@domain:port --https-proxy https://user:pass@domain:port

Find out more about using proxies and the type of proxies you can use here: http://docs.python-requests.org/en/master/user/advanced/#proxies

Warning

This code uses an undocumented part of the YouTube API, which is called by the YouTube web-client. So there is no guarantee that it won't stop working tomorrow, if they change how things work. I will however do my best to make things working again as soon as possible if that happens. So if it stops working, let me know!