diff --git a/README.md b/README.md index 5eaff5a..4292061 100644 --- a/README.md +++ b/README.md @@ -1,121 +1,232 @@ -# YouTube Transcript/Subtitle API (including automatically generated subtitles) -[![Donate](https://img.shields.io/badge/Donate-PayPal-green.svg)](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BAENLEW8VUJ6G&source=url) -[![Build Status](https://travis-ci.org/jdepoix/youtube-transcript-api.svg)](https://travis-ci.org/jdepoix/youtube-transcript-api) -[![Coverage Status](https://coveralls.io/repos/github/jdepoix/youtube-transcript-api/badge.svg?branch=master)](https://coveralls.io/github/jdepoix/youtube-transcript-api?branch=master) -[![MIT license](http://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](http://opensource.org/licenses/MIT) -[![image](https://img.shields.io/pypi/v/youtube-transcript-api.svg)](https://pypi.org/project/youtube-transcript-api/) -[![image](https://img.shields.io/pypi/pyversions/youtube-transcript-api.svg)](https://pypi.org/project/youtube-transcript-api/) +# YouTube Transcript/Subtitle API (including automatically generated subtitles and subtitle translations) + +[![Donate](https://img.shields.io/badge/Donate-PayPal-green.svg)](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BAENLEW8VUJ6G&source=url) +[![Build Status](https://travis-ci.org/jdepoix/youtube-transcript-api.svg)](https://travis-ci.org/jdepoix/youtube-transcript-api) +[![Coverage Status](https://coveralls.io/repos/github/jdepoix/youtube-transcript-api/badge.svg?branch=master)](https://coveralls.io/github/jdepoix/youtube-transcript-api?branch=master) +[![MIT license](http://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](http://opensource.org/licenses/MIT) +[![image](https://img.shields.io/pypi/v/youtube-transcript-api.svg)](https://pypi.org/project/youtube-transcript-api/) +[![image](https://img.shields.io/pypi/pyversions/youtube-transcript-api.svg)](https://pypi.org/project/youtube-transcript-api/) + +This is an python API which allows you to get the transcripts/subtitles for a given YouTube video. It also works for automatically generated subtitles, supports translating subtitles and it does not require a headless browser, like other selenium based solutions do! + +## Install + +It is recommended to [install this module by using pip](https://pypi.org/project/youtube-transcript-api/): + +``` +pip install youtube_transcript_api +``` + +If you want to use it from source, you'll have to install the dependencies manually: + +``` +pip install -r requirements.txt +``` -This is an python API which allows you to get the transcripts/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require a headless browser, like other selenium based solutions do! +You can either integrate this module [into an existing application](#api), or just use it via an [CLI](#cli). + +## API + +The easiest way to get a transcript for a given video is to execute: + +```python +from youtube_transcript_api import YouTubeTranscriptApi + +YouTubeTranscriptApi.get_transcript(video_id) +``` + +This will return a list of dictionaries looking somewhat like this: + +```python +[ + { + 'text': 'Hey there', + 'start': 7.58, + 'duration': 6.13 + }, + { + 'text': 'how are you', + 'start': 14.08, + 'duration': 7.58 + }, + # ... +] +``` + +You can also add the `languages` param if you want to make sure the transcripts are retrieved in your desired language (it defaults to english). + +```python +YouTubeTranscriptApi.get_transcripts(video_ids, languages=['de', 'en']) +``` + +It's a list of language codes in a descending priority. In this example it will first try to fetch the german transcript (`'de'`) and then fetch the english transcript (`'en'`) if it fails to do so. If you want to find out which languages are available first, [have a look at `list_transcripts()`](#list-available-transcripts) + +To get transcripts for a list of video ids you can call: + +```python +YouTubeTranscriptApi.get_transcripts(video_ids, languages=['de', 'en']) +``` + +`languages` also is optional here. -## Install +### List available transcripts -It is recommended to [install this module by using pip](https://pypi.org/project/youtube-transcript-api/): - -``` -pip install youtube_transcript_api -``` - -If you want to use it from source, you'll have to install the dependencies manually: - -``` -pip install -r requirements.txt -``` - -## How to use it - -You could either integrate this module into an existing application, or just use it via an CLI - -### In code - -To get a transcript for a given video you can do: +If you want to list all transcripts which are available for a given video you can call: ```python -from youtube_transcript_api import YouTubeTranscriptApi - -YouTubeTranscriptApi.get_transcript(video_id) +transcript_list = YouTubeTranscriptApi.list_transcripts(video_id, languages=['de', 'en']) ``` -This will return a list of dictionaries looking somewhat like this: +This will return a `TranscriptList` object which is iterable and provides methods to filter the list of transcripts for specific languages and types, like: ```python -[ - { - 'text': 'Hey there', - 'start': 7.58, - 'duration': 6.13 - }, - { - 'text': 'how are you', - 'start': 14.08, - 'duration': 7.58 - }, - # ... -] +transcript = transcript_list.find_transcript(['de', 'en']) ``` -You can also add the `languages` param if you want to make sure the transcripts are retrieved in your desired language (it defaults to english). +By default this module always picks manually created transcripts over automatically created ones, if a transcript in the requested language is available both manually created and generated. The `TranscriptList` allows you to bypass this default behaviour by searching for specific transcript types: ```python -YouTubeTranscriptApi.get_transcripts(video_ids, languages=['de', 'en']) +# filter for manually created transcripts +transcript = transcript_list.find_manually_created_transcript(['de', 'en']) + +# or automatically generated ones +transcript = transcript_list.find_generated_transcript(['de', 'en']) ``` -It's a list of language codes in a descending priority. In this example it will first try to fetch the german transcript (`'de'`) and then fetch the english transcript (`'en'`) if it fails to do so. As I can't provide a complete list of all working language codes with full certainty, you may have to play around with the language codes a bit, to find the one which is working for you! - -To get transcripts for a list fo video ids you can call: +The methods `find_generated_transcript`, `find_manually_created_transcript`, `find_generated_transcript` return `Transcript` objects. They contain metadata regarding the transcript: ```python -YouTubeTranscriptApi.get_transcripts(video_ids, languages=['de', 'en']) +print( + transcript.video_id, + transcript.language, + transcript.language_code, + # whether it has been manually created or generated by YouTube + transcript.is_generated, + # whether this transcript can be translated or not + transcript.is_translatable, + # a list of languages the transcript can be translated to + transcript.translation_languages, +) ``` -`languages` also is optional here. - -### CLI - -Execute the CLI script using the video ids as parameters and the results will be printed out to the command line: - -``` -youtube_transcript_api ... -``` - -The CLI also gives you the option to provide a list of preferred languages: - -``` -youtube_transcript_api ... --languages de en -``` - -If you would prefer to write it into a file or pipe it into another application, you can also output the results as json using the following line: - -``` -youtube_transcript_api ... --languages de en --json > transcripts.json -``` - -### Proxy - -You can specify a https/http proxy, which will be used during the requests to YouTube: +and provide the method, which allows you to fetch the actual transcript data: ```python -from youtube_transcript_api import YouTubeTranscriptApi - -YouTubeTranscriptApi.get_transcript(video_id, proxies={"http": "http://user:pass@domain:port", "https": "https://user:pass@domain:port"}) +transcript.fetch() ``` -As the `proxies` dict is passed on to the `requests.get(...)` call, it follows the [format used by the requests library](http://docs.python-requests.org/en/master/user/advanced/#proxies). +### Translate transcript -Using the CLI: +YouTube has a feature which allows you to automatically translate subtitles. This module also makes it possible to access this feature. To do so `Transcript` objects provide a `translate()` method, which returns a new translated `Transcript` object: -``` -youtube_transcript_api --http-proxy http://user:pass@domain:port --https-proxy https://user:pass@domain:port +```python +transcript = transcript_list.find_transcript(['en']) +translated_transcript = transcript.translate('de') +print(translated_transcript.fetch()) ``` +### By example +```python +# retrieve the available transcripts +transcript_list = YouTubeTranscriptApi.get('video_id') + +# iterate over all available transcripts +for transcript in transcript_list: -## Warning + # the Transcript object provides metadata properties + print( + transcript.video_id, + transcript.language, + transcript.language_code, + # whether it has been manually created or generated by YouTube + transcript.is_generated, + # whether this transcript can be translated or not + transcript.is_translatable, + # a list of languages the transcript can be translated to + transcript.translation_languages, + ) + + # fetch the actual transcript data + print(transcript.fetch()) + + # translating the transcript will return another transcript object + print(transcript.translate('en').fetch()) + +# you can also directly filter for the language you are looking for, using the transcript list +transcript = transcript_list.find_transcript(['de', 'en']) + +# or just filter for manually created transcripts +transcript = transcript_list.find_manually_created_transcript(['de', 'en']) + +# or automatically generated ones +transcript = transcript_list.find_generated_transcript(['de', 'en']) +``` + +## CLI + +Execute the CLI script using the video ids as parameters and the results will be printed out to the command line: + +``` +youtube_transcript_api ... +``` + +The CLI also gives you the option to provide a list of preferred languages: + +``` +youtube_transcript_api ... --languages de en +``` - This code uses an undocumented part of the YouTube API, which is called by the YouTube web-client. So there is no guarantee that it won't stop working tomorrow, if they change how things work. I will however do my best to make things working again as soon as possible if that happens. So if it stops working, let me know! +You can also specify if you want to exclude automatically generated or manually created subtitles: -## Donation +``` +youtube_transcript_api ... --languages de en --exclude-generated +youtube_transcript_api ... --languages de en --exclude-manually-created +``` + +If you would prefer to write it into a file or pipe it into another application, you can also output the results as json using the following line: + +``` +youtube_transcript_api ... --languages de en --json > transcripts.json +``` -If this project makes you happy by reducing your development time, you can make me happy by treating me to a cup of coffee :) +Translating transcripts using the CLI is also possible: -[![Donate](https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif)](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BAENLEW8VUJ6G&source=url) +``` +youtube_transcript_api ... --languages en --translate de +``` + +If you are not sure which languages are available for a given video you can call, to list all available transcripts: + +``` +youtube_transcript_api --list-transcripts +``` + +## Proxy + +You can specify a https/http proxy, which will be used during the requests to YouTube: + +```python +from youtube_transcript_api import YouTubeTranscriptApi + +YouTubeTranscriptApi.get_transcript(video_id, proxies={"http": "http://user:pass@domain:port", "https": "https://user:pass@domain:port"}) +``` + +As the `proxies` dict is passed on to the `requests.get(...)` call, it follows the [format used by the requests library](http://docs.python-requests.org/en/master/user/advanced/#proxies). + +Using the CLI: + +``` +youtube_transcript_api --http-proxy http://user:pass@domain:port --https-proxy https://user:pass@domain:port +``` + + +## Warning + + This code uses an undocumented part of the YouTube API, which is called by the YouTube web-client. So there is no guarantee that it won't stop working tomorrow, if they change how things work. I will however do my best to make things working again as soon as possible if that happens. So if it stops working, let me know! + +## Donation + +If this project makes you happy by reducing your development time, you can make me happy by treating me to a cup of coffee :) + +[![Donate](https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif)](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BAENLEW8VUJ6G&source=url) \ No newline at end of file diff --git a/youtube_transcript_api/__init__.py b/youtube_transcript_api/__init__.py index c9bb4eb..34e9ba7 100644 --- a/youtube_transcript_api/__init__.py +++ b/youtube_transcript_api/__init__.py @@ -1,3 +1,11 @@ from ._api import YouTubeTranscriptApi from ._transcripts import TranscriptList, Transcript -from ._errors import TranscriptsDisabled, NoTranscriptFound, CouldNotRetrieveTranscript, VideoUnavailable +from ._errors import ( + TranscriptsDisabled, + NoTranscriptFound, + CouldNotRetrieveTranscript, + VideoUnavailable, + NotTranslatable, + TranslationLanguageNotAvailable, + NoTranscriptAvailable, +) diff --git a/youtube_transcript_api/_api.py b/youtube_transcript_api/_api.py index 3476b9b..c1519ae 100644 --- a/youtube_transcript_api/_api.py +++ b/youtube_transcript_api/_api.py @@ -4,17 +4,68 @@ from ._transcripts import TranscriptListFetcher class YouTubeTranscriptApi(): + @classmethod + def list_transcripts(cls, video_id, proxies=None): + """ + Retrieves the list of transcripts which are available for a given video. It returns a `TranscriptList` object + which is iterable and provides methods to filter the list of transcripts for specific languages. While iterating + over the `TranscriptList` the individual transcripts are represented by `Transcript` objects, which provide + metadata and can either be fetched by calling `transcript.fetch()` or translated by calling + `transcript.translate('en')`. Example:: + + # retrieve the available transcripts + transcript_list = YouTubeTranscriptApi.get('video_id') + + # iterate over all available transcripts + for transcript in transcript_list: + # the Transcript object provides metadata properties + print( + transcript.video_id, + transcript.language, + transcript.language_code, + # whether it has been manually created or generated by YouTube + transcript.is_generated, + # a list of languages the transcript can be translated to + transcript.translation_languages, + ) + + # fetch the actual transcript data + print(transcript.fetch()) + + # translating the transcript will return another transcript object + print(transcript.translate('en').fetch()) + + # you can also directly filter for the language you are looking for, using the transcript list + transcript = transcript_list.find_transcript(['de', 'en']) + + # or just filter for manually created transcripts + transcript = transcript_list.find_manually_created_transcript(['de', 'en']) + + # or automatically generated ones + transcript = transcript_list.find_generated_transcript(['de', 'en']) + + :param video_id: the youtube video id + :type video_id: str + :param proxies: a dictionary mapping of http and https proxies to be used for the network requests + :type proxies: {'http': str, 'https': str} - http://docs.python-requests.org/en/master/user/advanced/#proxies + :return: the list of available transcripts + :rtype TranscriptList: + """ + with requests.Session() as http_client: + http_client.proxies = proxies if proxies else {} + return TranscriptListFetcher(http_client).fetch(video_id) + @classmethod def get_transcripts(cls, video_ids, languages=('en',), continue_after_error=False, proxies=None): """ Retrieves the transcripts for a list of videos. :param video_ids: a list of youtube video ids - :type video_ids: [str] + :type video_ids: list[str] :param languages: A list of language codes in a descending priority. For example, if this is set to ['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails to do so. - :type languages: [str] + :type languages: list[str] :param continue_after_error: if this is set the execution won't be stopped, if an error occurs while retrieving one of the video transcripts :type continue_after_error: bool @@ -22,7 +73,7 @@ class YouTubeTranscriptApi(): :type proxies: {'http': str, 'https': str} - http://docs.python-requests.org/en/master/user/advanced/#proxies :return: a tuple containing a dictionary mapping video ids onto their corresponding transcripts, and a list of video ids, which could not be retrieved - :rtype: ({str: [{'text': str, 'start': float, 'end': float}]}, [str]}) + :rtype ({str: [{'text': str, 'start': float, 'end': float}]}, [str]}): """ data = {} unretrievable_videos = [] @@ -41,19 +92,19 @@ class YouTubeTranscriptApi(): @classmethod def get_transcript(cls, video_id, languages=('en',), proxies=None): """ - Retrieves the transcript for a single video. + Retrieves the transcript for a single video. This is just a shortcut for calling:: + + YouTubeTranscriptApi.list_transcripts(video_id, proxies).find_transcript(languages).fetch() :param video_id: the youtube video id :type video_id: str :param languages: A list of language codes in a descending priority. For example, if this is set to ['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails to do so. - :type languages: [str] + :type languages: list[str] :param proxies: a dictionary mapping of http and https proxies to be used for the network requests :type proxies: {'http': str, 'https': str} - http://docs.python-requests.org/en/master/user/advanced/#proxies :return: a list of dictionaries containing the 'text', 'start' and 'duration' keys - :rtype: [{'text': str, 'start': float, 'end': float}] + :rtype [{'text': str, 'start': float, 'end': float}]: """ - with requests.Session() as http_client: - http_client.proxies = proxies if proxies else {} - return TranscriptListFetcher(http_client).fetch(video_id).find_transcript(languages).fetch() + return cls.list_transcripts(video_id, proxies).find_transcript(languages).fetch() diff --git a/youtube_transcript_api/_cli.py b/youtube_transcript_api/_cli.py index c0fcf40..21f816b 100644 --- a/youtube_transcript_api/_cli.py +++ b/youtube_transcript_api/_cli.py @@ -14,22 +14,45 @@ class YouTubeTranscriptCli(): def run(self): parsed_args = self._parse_args() + if parsed_args.exclude_manually_created and parsed_args.exclude_generated: + return '' + proxies = None if parsed_args.http_proxy != '' or parsed_args.https_proxy != '': proxies = {"http": parsed_args.http_proxy, "https": parsed_args.https_proxy} - transcripts, unretrievable_videos = YouTubeTranscriptApi.get_transcripts( - parsed_args.video_ids, - languages=parsed_args.languages, - continue_after_error=True, - proxies=proxies - ) + transcripts = [] + exceptions = [] + + for video_id in parsed_args.video_ids: + try: + transcripts.append(self._fetch_transcript(parsed_args, proxies, video_id)) + except Exception as exception: + exceptions.append(exception) return '\n\n'.join( - [str(YouTubeTranscriptApi.CouldNotRetrieveTranscript(video_id)) for video_id in unretrievable_videos] + [str(exception) for exception in exceptions] + ([json.dumps(transcripts) if parsed_args.json else pprint.pformat(transcripts)] if transcripts else []) ) + def _fetch_transcript(self, parsed_args, proxies, video_id): + transcript_list = YouTubeTranscriptApi.list_transcripts(video_id, proxies=proxies) + + if parsed_args.list_transcripts: + return str(transcript_list) + + if parsed_args.exclude_manually_created: + transcript = transcript_list.find_generated_transcript(parsed_args.languages) + elif parsed_args.exclude_generated: + transcript = transcript_list.find_manually_created_transcript(parsed_args.languages) + else: + transcript = transcript_list.find_transcript(parsed_args.languages) + + if parsed_args.translate: + transcript = transcript.translate(parsed_args.translate) + + return transcript.fetch() + def _parse_args(self): parser = argparse.ArgumentParser( description=( @@ -38,6 +61,13 @@ class YouTubeTranscriptCli(): 'other selenium based solutions do!' ) ) + parser.add_argument( + '--list-transcripts', + action='store_const', + const=True, + default=False, + help='This will list the languages in which the given videos are available in.', + ) parser.add_argument('video_ids', nargs='+', type=str, help='List of YouTube video IDs.') parser.add_argument( '--languages', @@ -46,11 +76,25 @@ class YouTubeTranscriptCli(): type=str, help=( 'A list of language codes in a descending priority. For example, if this is set to "de en" it will ' - 'first try to fetch the german transcript (de) and then fetch the english transcipt (en) if it fails ' + 'first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails ' 'to do so. As I can\'t provide a complete list of all working language codes with full certainty, you ' 'may have to play around with the language codes a bit, to find the one which is working for you!' ), ) + parser.add_argument( + '--exclude-generated', + action='store_const', + const=True, + default=False, + help='If this flag is set transcripts which have been generated by YouTube will not be retrieved.', + ) + parser.add_argument( + '--exclude-manually-created', + action='store_const', + const=True, + default=False, + help='If this flag is set transcripts which have been manually created will not be retrieved.', + ) parser.add_argument( '--json', action='store_const', @@ -59,13 +103,24 @@ class YouTubeTranscriptCli(): help='If this flag is set the output will be JSON formatted.', ) parser.add_argument( - '--http-proxy', dest='http_proxy', - default='', metavar='URL', + '--translate', + default='', + help=( + 'The language code for the language you want this transcript to be translated to. Use the ' + '--list-transcripts feature to find out which languages are translatable and which translation ' + 'languages are available.' + ) + ) + parser.add_argument( + '--http-proxy', + default='', + metavar='URL', help='Use the specified HTTP proxy.' ) parser.add_argument( - '--https-proxy', dest='https_proxy', - default='', metavar='URL', + '--https-proxy', + default='', + metavar='URL', help='Use the specified HTTPS proxy.' ) diff --git a/youtube_transcript_api/_errors.py b/youtube_transcript_api/_errors.py index 5dc4d8e..2b67e9e 100644 --- a/youtube_transcript_api/_errors.py +++ b/youtube_transcript_api/_errors.py @@ -11,7 +11,7 @@ class CouldNotRetrieveTranscript(Exception): GITHUB_REFERRAL = ( '\n\nIf you are sure that the described cause is not responsible for this error ' 'and that a transcript should be retrievable, please create an issue at ' - 'https://github.com/jdepoix/youtube-transcript-api/issues.' + 'https://github.com/jdepoix/youtube-transcript-api/issues. ' 'Please add which version of youtube_transcript_api you are using ' 'and provide the information needed to replicate the error. ' 'Also make sure that there are no open issues which already describe your problem!' @@ -43,6 +43,18 @@ class TranscriptsDisabled(CouldNotRetrieveTranscript): CAUSE_MESSAGE = 'Subtitles are disabled for this video' +class NoTranscriptAvailable(CouldNotRetrieveTranscript): + CAUSE_MESSAGE = 'No transcripts are available for this video' + + +class NotTranslatable(CouldNotRetrieveTranscript): + CAUSE_MESSAGE = 'The requested language is not translatable' + + +class TranslationLanguageNotAvailable(CouldNotRetrieveTranscript): + CAUSE_MESSAGE = 'The requested translation language is not available' + + class NoTranscriptFound(CouldNotRetrieveTranscript): CAUSE_MESSAGE = ( 'No transcripts were found for any of the requested language codes: {requested_language_codes}\n\n' diff --git a/youtube_transcript_api/_transcripts.py b/youtube_transcript_api/_transcripts.py index 0f131ad..6b767ff 100644 --- a/youtube_transcript_api/_transcripts.py +++ b/youtube_transcript_api/_transcripts.py @@ -12,7 +12,14 @@ from xml.etree import ElementTree import re from ._html_unescaping import unescape -from ._errors import VideoUnavailable, NoTranscriptFound, TranscriptsDisabled +from ._errors import ( + VideoUnavailable, + NoTranscriptFound, + TranscriptsDisabled, + NotTranslatable, + TranslationLanguageNotAvailable, + NoTranscriptAvailable, +) from ._settings import WATCH_URL @@ -36,9 +43,14 @@ class TranscriptListFetcher(): raise TranscriptsDisabled(video_id) - return json.loads(splitted_html[1].split(',"videoDetails')[0].replace('\n', ''))[ - 'playerCaptionsTracklistRenderer' - ] + captions_json = json.loads( + splitted_html[1].split(',"videoDetails')[0].replace('\n', '') + )['playerCaptionsTracklistRenderer'] + + if 'captionTracks' not in captions_json: + raise NoTranscriptAvailable(video_id) + + return captions_json def _fetch_html(self, video_id): return self._http_client.get(WATCH_URL.format(video_id=video_id)).text.replace( @@ -53,10 +65,7 @@ class TranscriptList(): This object represents a list of transcripts. It can be iterated over to list all transcripts which are available for a given YouTube video. Also it provides functionality to search for a transcript in a given language. """ - - # TODO implement iterator - - def __init__(self, video_id, manually_created_transcripts, generated_transcripts): + def __init__(self, video_id, manually_created_transcripts, generated_transcripts, translation_languages): """ The constructor is only for internal use. Use the static build method instead. @@ -66,10 +75,13 @@ class TranscriptList(): :type manually_created_transcripts: dict[str, Transcript] :param generated_transcripts: dict mapping language codes to the generated transcripts :type generated_transcripts: dict[str, Transcript] + :param translation_languages: list of languages which can be used for translatable languages + :type translation_languages: list[dict[str, str]] """ self.video_id = video_id self._manually_created_transcripts = manually_created_transcripts self._generated_transcripts = generated_transcripts + self._translation_languages = translation_languages @staticmethod def build(http_client, video_id, captions_json): @@ -83,7 +95,7 @@ class TranscriptList(): :param captions_json: the JSON parsed from the YouTube pages static HTML :type captions_json: dict :return: the created TranscriptList - :rtype TranscriptList + :rtype TranscriptList: """ translation_languages = [ { @@ -108,15 +120,19 @@ class TranscriptList(): caption['name']['simpleText'], caption['languageCode'], caption.get('kind', '') == 'asr', - translation_languages if caption['isTranslatable'] else [] + translation_languages if caption.get('isTranslatable', False) else [] ) return TranscriptList( video_id, manually_created_transcripts, generated_transcripts, + translation_languages, ) + def __iter__(self): + return iter(list(self._manually_created_transcripts.values()) + list(self._generated_transcripts.values())) + def find_transcript(self, language_codes): """ Finds a transcript for a given language code. Manually created transcripts are returned first and only if none @@ -126,9 +142,9 @@ class TranscriptList(): :param language_codes: A list of language codes in a descending priority. For example, if this is set to ['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails to do so. - :type languages: [str] + :type languages: list[str] :return: the found Transcript - :rtype: Transcript + :rtype Transcript: :raises: NoTranscriptFound """ return self._find_transcript(language_codes, [self._manually_created_transcripts, self._generated_transcripts]) @@ -140,9 +156,9 @@ class TranscriptList(): :param language_codes: A list of language codes in a descending priority. For example, if this is set to ['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails to do so. - :type languages: [str] + :type languages: list[str] :return: the found Transcript - :rtype: Transcript + :rtype Transcript: :raises: NoTranscriptFound """ return self._find_transcript(language_codes, [self._generated_transcripts,]) @@ -154,9 +170,9 @@ class TranscriptList(): :param language_codes: A list of language codes in a descending priority. For example, if this is set to ['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails to do so. - :type languages: [str] + :type languages: list[str] :return: the found Transcript - :rtype: Transcript + :rtype Transcript: :raises: NoTranscriptFound """ return self._find_transcript(language_codes, [self._manually_created_transcripts,]) @@ -179,22 +195,28 @@ class TranscriptList(): '(MANUALLY CREATED)\n' '{available_manually_created_transcript_languages}\n\n' '(GENERATED)\n' - '{available_generated_transcripts}' + '{available_generated_transcripts}\n\n' + '(TRANSLATION LANGUAGES)\n' + '{available_translation_languages}' ).format( video_id=self.video_id, available_manually_created_transcript_languages=self._get_language_description( - self._manually_created_transcripts.values() + str(transcript) for transcript in self._manually_created_transcripts.values() ), available_generated_transcripts=self._get_language_description( - self._generated_transcripts.values() + str(transcript) for transcript in self._generated_transcripts.values() ), + available_translation_languages=self._get_language_description( + '{language_code} ("{language}")'.format( + language=translation_language['language'], + language_code=translation_language['language_code'], + ) for translation_language in self._translation_languages + ) ) - def _get_language_description(self, transcripts): - return '\n'.join( - ' - {transcript}'.format(transcript=str(transcript)) - for transcript in transcripts - ) if transcripts else 'None' + def _get_language_description(self, transcript_strings): + description = '\n'.join(' - {transcript}'.format(transcript=transcript) for transcript in transcript_strings) + return description if description else 'None' class Transcript(): @@ -220,45 +242,49 @@ class Transcript(): self.language_code = language_code self.is_generated = is_generated self.translation_languages = translation_languages + self._translation_languages_dict = { + translation_language['language_code']: translation_language['language'] + for translation_language in translation_languages + } def fetch(self): """ Loads the actual transcript data. :return: a list of dictionaries containing the 'text', 'start' and 'duration' keys - :rtype: [{'text': str, 'start': float, 'end': float}] + :rtype [{'text': str, 'start': float, 'end': float}]: """ return _TranscriptParser().parse( self._http_client.get(self._url).text ) def __str__(self): - return '{language_code} ("{language}")'.format( + return '{language_code} ("{language}"){translation_description}'.format( language=self.language, language_code=self.language_code, + translation_description='[TRANSLATABLE]' if self.is_translatable else '' ) -# TODO integrate translations in future release -# @property -# def is_translatable(self): -# return len(self.translation_languages) > 0 -# -# -# class TranslatableTranscript(Transcript): -# def __init__(self, http_client, url, translation_languages): -# super(TranslatableTranscript, self).__init__(http_client, url) -# self._translation_languages = translation_languages -# self._translation_language_codes = {language['language_code'] for language in translation_languages} -# -# -# def translate(self, language_code): -# if language_code not in self._translation_language_codes: -# raise TranslatableTranscript.TranslationLanguageNotAvailable() -# -# return Transcript( -# self._http_client, -# '{url}&tlang={language_code}'.format(url=self._url, language_code=language_code) -# ) + @property + def is_translatable(self): + return len(self.translation_languages) > 0 + + def translate(self, language_code): + if not self.is_translatable: + raise NotTranslatable(self.video_id) + + if language_code not in self._translation_languages_dict: + raise TranslationLanguageNotAvailable(self.video_id) + + return Transcript( + self._http_client, + self.video_id, + '{url}&tlang={language_code}'.format(url=self._url, language_code=language_code), + self._translation_languages_dict[language_code], + language_code, + True, + [], + ) class _TranscriptParser(): @@ -269,7 +295,7 @@ class _TranscriptParser(): { 'text': re.sub(self.HTML_TAG_REGEX, '', unescape(xml_element.text)), 'start': float(xml_element.attrib['start']), - 'duration': float(xml_element.attrib['dur']), + 'duration': float(xml_element.attrib.get('dur', '0.0')), } for xml_element in ElementTree.fromstring(plain_data) if xml_element.text is not None diff --git a/youtube_transcript_api/test/assets/youtube_no_transcript_available.html.static b/youtube_transcript_api/test/assets/youtube_no_transcript_available.html.static new file mode 100644 index 0000000..1ea7ed2 --- /dev/null +++ b/youtube_transcript_api/test/assets/youtube_no_transcript_available.html.static @@ -0,0 +1,1349 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + +MTG Top 10: BAD Cards That Suddenly Became Good - YouTube + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + +
+
+
+
+ DE +
+
+
+ +
+
+
+

+ + + +Wird geladen... + +

+ +
+
+
+ +
+
+
+ +
+
+ + +
+
+
+ +
+
+
+ +
+
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+
+

+ + + + + MTG Top 10: BAD Cards That Suddenly Became Good + + +

+
+
+ + +
+ + + + + +
+
305.276 Aufrufe
+
+
+
+
+
+ + + + + +
+
+ + + +
+
+
+
+

+ + + +Wird geladen... + +

+ +
+
+
+ +
+ +
+
+

+ + + +Wird geladen... + +

+ +
+
+ + +
+
+ Die Bewertungsfunktion ist nach Ausleihen des Videos verfügbar. +
+ +
+ +
+
+ Diese Funktion ist gerade nicht verfügbar. Bitte versuche es später noch einmal. +
+
+ + +
+ + +
+ + +
Am 25.07.2019 veröffentlicht

This video is sponsored by CardKingdom! Check out their awesome store here: http://www.cardkingdom.com/?utm_sourc...

Want to see me draft live? You can on Twitch! http://www.twitch.tv/Nizzahon

Want to support the channel? You can on Patreon!: https://www.patreon.com/Nizzahon_Magic

Follow me on Twitter for channel updates and other Magic musings: https://twitter.com/NizzahonMagic

Animations by Mike from Mythic Tales. Find his channel filled with awesome MTG animation here: https://www.youtube.com/user/RadioCom...

I Can Feel it Coming Kevin MacLeod (http://incompetech.com )
Licensed under Creative Commons: By Attribution 3.0 License
http://creativecommons.org/licenses/b...

+
    +
  • +

    + Kategorie +

    + +
  • + +
+
+
+ +
+ + +
+
+

+ + + +Wird geladen... + +

+ +
+ +
+ + +
+
+
+ + + +
+
+ +
+ +
+
+
+Anzeige +
+
+
+
+ + +
+
+
+
+
+ + + +Wenn Autoplay aktiviert ist, wird die Wiedergabe automatisch mit einem der aktuellen Videovorschläge fortgesetzt. + + + +
+

+ Nächstes Video +

+ + +
+
+ + +
+
+
+ +
+
+ +
+
+ +
+
+
+ + +
+ +
+ +
+
+ + +
+
+ + +
+ , um dieses Video zur Playlist "Später ansehen" hinzuzufügen. + +
+
+

+Hinzufügen +

+
+
+

+ + + + Playlists werden geladen... + +

+ +
+
+ + + + + + + \ No newline at end of file diff --git a/youtube_transcript_api/test/test_api.py b/youtube_transcript_api/test/test_api.py index b2897af..f506d33 100644 --- a/youtube_transcript_api/test/test_api.py +++ b/youtube_transcript_api/test/test_api.py @@ -5,7 +5,15 @@ import os import httpretty -from youtube_transcript_api import YouTubeTranscriptApi, VideoUnavailable, NoTranscriptFound, TranscriptsDisabled +from youtube_transcript_api import ( + YouTubeTranscriptApi, + TranscriptsDisabled, + NoTranscriptFound, + VideoUnavailable, + NoTranscriptAvailable, + NotTranslatable, + TranslationLanguageNotAvailable, +) def load_asset(filename): @@ -42,6 +50,51 @@ class TestYouTubeTranscriptApi(TestCase): ] ) + def test_list_transcripts(self): + transcript_list = YouTubeTranscriptApi.list_transcripts('GJLlxj_dtq8') + + language_codes = {transcript.language_code for transcript in transcript_list} + + self.assertEqual(language_codes, {'zh', 'de', 'en', 'hi', 'ja', 'ko', 'es', 'cs', 'en'}) + + def test_list_transcripts__find_manually_created(self): + transcript_list = YouTubeTranscriptApi.list_transcripts('GJLlxj_dtq8') + transcript = transcript_list.find_manually_created_transcript(['cs']) + + self.assertFalse(transcript.is_generated) + + + def test_list_transcripts__find_generated(self): + transcript_list = YouTubeTranscriptApi.list_transcripts('GJLlxj_dtq8') + + with self.assertRaises(NoTranscriptFound): + transcript_list.find_generated_transcript(['cs']) + + transcript = transcript_list.find_generated_transcript(['en']) + + self.assertTrue(transcript.is_generated) + + def test_translate_transcript(self): + transcript = YouTubeTranscriptApi.list_transcripts('GJLlxj_dtq8').find_transcript(['en']) + + translated_transcript = transcript.translate('af') + + self.assertEqual(translated_transcript.language_code, 'af') + self.assertIn('&tlang=af', translated_transcript._url) + + def test_translate_transcript__translation_language_not_available(self): + transcript = YouTubeTranscriptApi.list_transcripts('GJLlxj_dtq8').find_transcript(['en']) + + with self.assertRaises(TranslationLanguageNotAvailable): + transcript.translate('xyz') + + def test_translate_transcript__not_translatable(self): + transcript = YouTubeTranscriptApi.list_transcripts('GJLlxj_dtq8').find_transcript(['en']) + transcript.translation_languages = [] + + with self.assertRaises(NotTranslatable): + transcript.translate('af') + def test_get_transcript__correct_language_is_used(self): YouTubeTranscriptApi.get_transcript('GJLlxj_dtq8', ['de', 'en']) query_string = httpretty.last_request().querystring @@ -88,6 +141,16 @@ class TestYouTubeTranscriptApi(TestCase): with self.assertRaises(NoTranscriptFound): YouTubeTranscriptApi.get_transcript('GJLlxj_dtq8', languages=['cz']) + def test_get_transcript__exception_if_no_transcript_available(self): + httpretty.register_uri( + httpretty.GET, + 'https://www.youtube.com/watch', + body=load_asset('youtube_no_transcript_available.html.static') + ) + + with self.assertRaises(NoTranscriptAvailable): + YouTubeTranscriptApi.get_transcript('MwBPvcYFY2E') + def test_get_transcripts(self): video_id_1 = 'video_id_1' video_id_2 = 'video_id_2' diff --git a/youtube_transcript_api/test/test_cli.py b/youtube_transcript_api/test/test_cli.py index 6f01967..e46789e 100644 --- a/youtube_transcript_api/test/test_cli.py +++ b/youtube_transcript_api/test/test_cli.py @@ -3,10 +3,27 @@ from mock import MagicMock import json -from youtube_transcript_api._cli import YouTubeTranscriptCli, YouTubeTranscriptApi +from youtube_transcript_api import YouTubeTranscriptApi, VideoUnavailable +from youtube_transcript_api._cli import YouTubeTranscriptCli class TestYouTubeTranscriptCli(TestCase): + def setUp(self): + self.transcript_mock = MagicMock() + self.transcript_mock.fetch = MagicMock(return_value=[ + {'text': 'Hey, this is just a test', 'start': 0.0, 'duration': 1.54}, + {'text': 'this is not the original transcript', 'start': 1.54, 'duration': 4.16}, + {'text': 'just something shorter, I made up for testing', 'start': 5.7, 'duration': 3.239} + ]) + self.transcript_mock.translate = MagicMock(return_value=self.transcript_mock) + + self.transcript_list_mock = MagicMock() + self.transcript_list_mock.find_generated_transcript = MagicMock(return_value=self.transcript_mock) + self.transcript_list_mock.find_manually_created_transcript = MagicMock(return_value=self.transcript_mock) + self.transcript_list_mock.find_transcript = MagicMock(return_value=self.transcript_mock) + + YouTubeTranscriptApi.list_transcripts = MagicMock(return_value=self.transcript_list_mock) + def test_argument_parsing(self): parsed_args = YouTubeTranscriptCli('v1 v2 --json --languages de en'.split())._parse_args() self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) @@ -106,32 +123,107 @@ class TestYouTubeTranscriptCli(TestCase): self.assertEqual(parsed_args.http_proxy, '') self.assertEqual(parsed_args.https_proxy, '') + def test_argument_parsing__list_transcripts(self): + parsed_args = YouTubeTranscriptCli('--list-transcripts v1 v2'.split())._parse_args() + self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) + self.assertTrue(parsed_args.list_transcripts) + + parsed_args = YouTubeTranscriptCli('v1 v2 --list-transcripts'.split())._parse_args() + self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) + self.assertTrue(parsed_args.list_transcripts) + + def test_argument_parsing__translate(self): + parsed_args = YouTubeTranscriptCli('v1 v2 --languages de en --translate cz'.split())._parse_args() + self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) + self.assertEqual(parsed_args.json, False) + self.assertEqual(parsed_args.languages, ['de', 'en']) + self.assertEqual(parsed_args.translate, 'cz') + + parsed_args = YouTubeTranscriptCli('v1 v2 --translate cz --languages de en'.split())._parse_args() + self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) + self.assertEqual(parsed_args.json, False) + self.assertEqual(parsed_args.languages, ['de', 'en']) + self.assertEqual(parsed_args.translate, 'cz') + + def test_argument_parsing__manually_or_generated(self): + parsed_args = YouTubeTranscriptCli('v1 v2 --exclude-manually-created'.split())._parse_args() + self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) + self.assertTrue(parsed_args.exclude_manually_created) + self.assertFalse(parsed_args.exclude_generated) + + parsed_args = YouTubeTranscriptCli('v1 v2 --exclude-generated'.split())._parse_args() + self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) + self.assertFalse(parsed_args.exclude_manually_created) + self.assertTrue(parsed_args.exclude_generated) + + parsed_args = YouTubeTranscriptCli('v1 v2 --exclude-manually-created --exclude-generated'.split())._parse_args() + self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) + self.assertTrue(parsed_args.exclude_manually_created) + self.assertTrue(parsed_args.exclude_generated) + def test_run(self): - YouTubeTranscriptApi.get_transcripts = MagicMock(return_value=([], [])) YouTubeTranscriptCli('v1 v2 --languages de en'.split()).run() - YouTubeTranscriptApi.get_transcripts.assert_called_once_with( - ['v1', 'v2'], - languages=['de', 'en'], - continue_after_error=True, - proxies=None + YouTubeTranscriptApi.list_transcripts.assert_any_call('v1', proxies=None) + YouTubeTranscriptApi.list_transcripts.assert_any_call('v2', proxies=None) + + self.transcript_list_mock.find_transcript.assert_any_call(['de', 'en']) + + def test_run__failing_transcripts(self): + YouTubeTranscriptApi.list_transcripts = MagicMock(side_effect=VideoUnavailable('video_id')) + + output = YouTubeTranscriptCli('v1 --languages de en'.split()).run() + + self.assertEqual(output, str(VideoUnavailable('video_id'))) + + def test_run__exclude_generated(self): + YouTubeTranscriptCli('v1 v2 --languages de en --exclude-generated'.split()).run() + + self.transcript_list_mock.find_manually_created_transcript.assert_any_call(['de', 'en']) + + def test_run__exclude_manually_created(self): + YouTubeTranscriptCli('v1 v2 --languages de en --exclude-manually-created'.split()).run() + + self.transcript_list_mock.find_generated_transcript.assert_any_call(['de', 'en']) + + def test_run__exclude_manually_created_and_generated(self): + self.assertEqual( + YouTubeTranscriptCli('v1 v2 --languages de en --exclude-manually-created --exclude-generated'.split()).run(), + '' ) + def test_run__translate(self): + YouTubeTranscriptCli('v1 v2 --languages de en --translate cz'.split()).run(), + + self.transcript_mock.translate.assert_any_call('cz') + + def test_run__list_transcripts(self): + YouTubeTranscriptCli('--list-transcripts v1 v2'.split()).run() + + YouTubeTranscriptApi.list_transcripts.assert_any_call('v1', proxies=None) + YouTubeTranscriptApi.list_transcripts.assert_any_call('v2', proxies=None) + def test_run__json_output(self): - YouTubeTranscriptApi.get_transcripts = MagicMock(return_value=([{'boolean': True}], [])) output = YouTubeTranscriptCli('v1 v2 --languages de en --json'.split()).run() # will fail if output is not valid json json.loads(output) def test_run__proxies(self): - YouTubeTranscriptApi.get_transcripts = MagicMock(return_value=([], [])) YouTubeTranscriptCli( - 'v1 v2 --languages de en --http-proxy http://user:pass@domain:port --https-proxy https://user:pass@domain:port'.split()).run() + ( + 'v1 v2 --languages de en ' + '--http-proxy http://user:pass@domain:port ' + '--https-proxy https://user:pass@domain:port' + ).split() + ).run() - YouTubeTranscriptApi.get_transcripts.assert_called_once_with( - ['v1', 'v2'], - languages=['de', 'en'], - continue_after_error=True, + YouTubeTranscriptApi.list_transcripts.assert_any_call( + 'v1', + proxies={'http': 'http://user:pass@domain:port', 'https': 'https://user:pass@domain:port'} + ) + + YouTubeTranscriptApi.list_transcripts.assert_any_call( + 'v2', proxies={'http': 'http://user:pass@domain:port', 'https': 'https://user:pass@domain:port'} )