Merge pull request #42 from jdepoix/feature/translating-transcripts
Feature/translating transcripts
This commit is contained in:
		
						commit
						68951600d9
					
				
							
								
								
									
										133
									
								
								README.md
								
								
								
								
							
							
						
						
									
										133
									
								
								README.md
								
								
								
								
							|  | @ -1,4 +1,5 @@ | ||||||
| # YouTube Transcript/Subtitle API (including automatically generated subtitles) | 
 | ||||||
|  | # YouTube Transcript/Subtitle API (including automatically generated subtitles and subtitle translations)   | ||||||
|    |    | ||||||
| [](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BAENLEW8VUJ6G&source=url)   | [](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BAENLEW8VUJ6G&source=url)   | ||||||
| [](https://travis-ci.org/jdepoix/youtube-transcript-api)   | [](https://travis-ci.org/jdepoix/youtube-transcript-api)   | ||||||
|  | @ -7,7 +8,7 @@ | ||||||
| [](https://pypi.org/project/youtube-transcript-api/)   | [](https://pypi.org/project/youtube-transcript-api/)   | ||||||
| [](https://pypi.org/project/youtube-transcript-api/)   | [](https://pypi.org/project/youtube-transcript-api/)   | ||||||
|    |    | ||||||
| This is an python API which allows you to get the transcripts/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require a headless browser, like other selenium based solutions do! | This is an python API which allows you to get the transcripts/subtitles for a given YouTube video. It also works for automatically generated subtitles, supports translating subtitles and it does not require a headless browser, like other selenium based solutions do!   | ||||||
|    |    | ||||||
| ## Install   | ## Install   | ||||||
|    |    | ||||||
|  | @ -23,13 +24,11 @@ If you want to use it from source, you'll have to install the dependencies manua | ||||||
| pip install -r requirements.txt   | pip install -r requirements.txt   | ||||||
| ```   | ```   | ||||||
| 
 | 
 | ||||||
| ## How to use it | You can either integrate this module [into an existing application](#api), or just use it via an [CLI](#cli). | ||||||
|    |    | ||||||
| You could either integrate this module into an existing application, or just use it via an CLI | ## API | ||||||
|    |    | ||||||
| ### In code | The easiest way to get a transcript for a given video is to execute:   | ||||||
| 
 |  | ||||||
| To get a transcript for a given video you can do: |  | ||||||
|    |    | ||||||
| ```python   | ```python   | ||||||
| from youtube_transcript_api import YouTubeTranscriptApi   | from youtube_transcript_api import YouTubeTranscriptApi   | ||||||
|  | @ -61,9 +60,9 @@ You can also add the `languages` param if you want to make sure the transcripts | ||||||
| YouTubeTranscriptApi.get_transcripts(video_ids, languages=['de', 'en'])   | YouTubeTranscriptApi.get_transcripts(video_ids, languages=['de', 'en'])   | ||||||
| ```   | ```   | ||||||
|    |    | ||||||
| It's a list of language codes in a descending priority. In this example it will first try to fetch the german transcript (`'de'`) and then fetch the english transcript (`'en'`) if it fails to do so. As I can't provide a complete list of all working language codes with full certainty, you may have to play around with the language codes a bit, to find the one which is working for you! | It's a list of language codes in a descending priority. In this example it will first try to fetch the german transcript (`'de'`) and then fetch the english transcript (`'en'`) if it fails to do so. If you want to find out which languages are available first, [have a look at `list_transcripts()`](#list-available-transcripts) | ||||||
|    |    | ||||||
| To get transcripts for a list fo video ids you can call: | To get transcripts for a list of video ids you can call:   | ||||||
|    |    | ||||||
| ```python   | ```python   | ||||||
| YouTubeTranscriptApi.get_transcripts(video_ids, languages=['de', 'en'])   | YouTubeTranscriptApi.get_transcripts(video_ids, languages=['de', 'en'])   | ||||||
|  | @ -71,7 +70,100 @@ YouTubeTranscriptApi.get_transcripts(video_ids, languages=['de', 'en']) | ||||||
|    |    | ||||||
| `languages` also is optional here.   | `languages` also is optional here.   | ||||||
| 
 | 
 | ||||||
| ### CLI | ### List available transcripts | ||||||
|  | 
 | ||||||
|  | If you want to list all transcripts which are available for a given video you can call: | ||||||
|  | 
 | ||||||
|  | ```python | ||||||
|  | transcript_list = YouTubeTranscriptApi.list_transcripts(video_id, languages=['de', 'en']) | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | This will return a `TranscriptList` object  which is iterable and provides methods to filter the list of transcripts for specific languages and types, like: | ||||||
|  | 
 | ||||||
|  | ```python | ||||||
|  | transcript = transcript_list.find_transcript(['de', 'en'])   | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | By default this module always picks manually created transcripts over automatically created ones, if a transcript in the requested language is available both manually created and generated. The `TranscriptList` allows you to bypass this default behaviour by searching for specific transcript types: | ||||||
|  | 
 | ||||||
|  | ```python | ||||||
|  | # filter for manually created transcripts   | ||||||
|  | transcript = transcript_list.find_manually_created_transcript(['de', 'en'])   | ||||||
|  |    | ||||||
|  | # or automatically generated ones   | ||||||
|  | transcript = transcript_list.find_generated_transcript(['de', 'en']) | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | The methods `find_generated_transcript`, `find_manually_created_transcript`, `find_generated_transcript` return `Transcript` objects. They contain metadata regarding the transcript: | ||||||
|  | 
 | ||||||
|  | ```python | ||||||
|  | print( | ||||||
|  | 	transcript.video_id,  | ||||||
|  | 	transcript.language,  | ||||||
|  | 	transcript.language_code, | ||||||
|  | 	# whether it has been manually created or generated by YouTube  | ||||||
|  | 	transcript.is_generated, | ||||||
|  | 	# whether this transcript can be translated or not | ||||||
|  | 	transcript.is_translatable, | ||||||
|  | 	# a list of languages the transcript can be translated to  | ||||||
|  | 	transcript.translation_languages,  | ||||||
|  | ) | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | and provide the method, which allows you to fetch the actual transcript data: | ||||||
|  | 
 | ||||||
|  | ```python | ||||||
|  | transcript.fetch() | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | ### Translate transcript  | ||||||
|  | 
 | ||||||
|  | YouTube has a feature which allows you to automatically translate subtitles. This module also makes it possible to access this feature. To do so `Transcript` objects provide a `translate()` method, which returns a new translated `Transcript` object: | ||||||
|  | 
 | ||||||
|  | ```python | ||||||
|  | transcript = transcript_list.find_transcript(['en'])  | ||||||
|  | translated_transcript = transcript.translate('de') | ||||||
|  | print(translated_transcript.fetch()) | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | ### By example | ||||||
|  | ```python | ||||||
|  | # retrieve the available transcripts   | ||||||
|  | transcript_list = YouTubeTranscriptApi.get('video_id')   | ||||||
|  |    | ||||||
|  | # iterate over all available transcripts | ||||||
|  | for transcript in transcript_list: | ||||||
|  | 
 | ||||||
|  |     # the Transcript object provides metadata properties | ||||||
|  | 	print( | ||||||
|  |         transcript.video_id, | ||||||
|  |         transcript.language, | ||||||
|  | 		transcript.language_code, | ||||||
|  | 		# whether it has been manually created or generated by YouTube  | ||||||
|  | 		transcript.is_generated, | ||||||
|  | 		# whether this transcript can be translated or not | ||||||
|  | 		transcript.is_translatable, | ||||||
|  | 		# a list of languages the transcript can be translated to  | ||||||
|  | 		transcript.translation_languages,  | ||||||
|  | 	) | ||||||
|  | 	   | ||||||
|  | 	# fetch the actual transcript data  | ||||||
|  | 	print(transcript.fetch())   | ||||||
|  | 	 | ||||||
|  | 	# translating the transcript will return another transcript object | ||||||
|  | 	print(transcript.translate('en').fetch())   | ||||||
|  | 	 | ||||||
|  | # you can also directly filter for the language you are looking for, using the transcript list | ||||||
|  | transcript = transcript_list.find_transcript(['de', 'en'])   | ||||||
|  |    | ||||||
|  | # or just filter for manually created transcripts   | ||||||
|  | transcript = transcript_list.find_manually_created_transcript(['de', 'en'])   | ||||||
|  |    | ||||||
|  | # or automatically generated ones   | ||||||
|  | transcript = transcript_list.find_generated_transcript(['de', 'en']) | ||||||
|  | ``` | ||||||
|  |    | ||||||
|  | ## CLI   | ||||||
|    |    | ||||||
| Execute the CLI script using the video ids as parameters and the results will be printed out to the command line:   | Execute the CLI script using the video ids as parameters and the results will be printed out to the command line:   | ||||||
|    |    | ||||||
|  | @ -85,13 +177,32 @@ The CLI also gives you the option to provide a list of preferred languages: | ||||||
| youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en   | youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en   | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
|  | You can also specify if you want to exclude automatically generated or manually created subtitles: | ||||||
|  | 
 | ||||||
|  | ```   | ||||||
|  | youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en --exclude-generated | ||||||
|  | youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en --exclude-manually-created | ||||||
|  | ``` | ||||||
|  |    | ||||||
| If you would prefer to write it into a file or pipe it into another application, you can also output the results as json using the following line:   | If you would prefer to write it into a file or pipe it into another application, you can also output the results as json using the following line:   | ||||||
|    |    | ||||||
| ```   | ```   | ||||||
| youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en --json > transcripts.json   | youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en --json > transcripts.json   | ||||||
| ```   | ```   | ||||||
| 
 | 
 | ||||||
| ### Proxy | Translating transcripts using the CLI is also possible: | ||||||
|  | 
 | ||||||
|  | ```   | ||||||
|  | youtube_transcript_api <first_video_id> <second_video_id> ... --languages en --translate de | ||||||
|  | ```   | ||||||
|  | 
 | ||||||
|  | If you are not sure which languages are available for a given video you can call, to list all available transcripts: | ||||||
|  | 
 | ||||||
|  | ```   | ||||||
|  | youtube_transcript_api --list-transcripts <first_video_id> | ||||||
|  | ```   | ||||||
|  |    | ||||||
|  | ## Proxy   | ||||||
|    |    | ||||||
| You can specify a https/http proxy, which will be used during the requests to YouTube:   | You can specify a https/http proxy, which will be used during the requests to YouTube:   | ||||||
|    |    | ||||||
|  |  | ||||||
|  | @ -1,3 +1,11 @@ | ||||||
| from ._api import YouTubeTranscriptApi | from ._api import YouTubeTranscriptApi | ||||||
| from ._transcripts import TranscriptList, Transcript | from ._transcripts import TranscriptList, Transcript | ||||||
| from ._errors import TranscriptsDisabled, NoTranscriptFound, CouldNotRetrieveTranscript, VideoUnavailable | from ._errors import ( | ||||||
|  |     TranscriptsDisabled, | ||||||
|  |     NoTranscriptFound, | ||||||
|  |     CouldNotRetrieveTranscript, | ||||||
|  |     VideoUnavailable, | ||||||
|  |     NotTranslatable, | ||||||
|  |     TranslationLanguageNotAvailable, | ||||||
|  |     NoTranscriptAvailable, | ||||||
|  | ) | ||||||
|  |  | ||||||
|  | @ -4,17 +4,68 @@ from ._transcripts import TranscriptListFetcher | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| class YouTubeTranscriptApi(): | class YouTubeTranscriptApi(): | ||||||
|  |     @classmethod | ||||||
|  |     def list_transcripts(cls, video_id, proxies=None): | ||||||
|  |         """ | ||||||
|  |         Retrieves the list of transcripts which are available for a given video. It returns a `TranscriptList` object | ||||||
|  |         which is iterable and provides methods to filter the list of transcripts for specific languages. While iterating | ||||||
|  |         over the `TranscriptList` the individual transcripts are represented by `Transcript` objects, which provide | ||||||
|  |         metadata and can either be fetched by calling `transcript.fetch()` or translated by calling | ||||||
|  |         `transcript.translate('en')`. Example:: | ||||||
|  | 
 | ||||||
|  |             # retrieve the available transcripts | ||||||
|  |             transcript_list = YouTubeTranscriptApi.get('video_id') | ||||||
|  | 
 | ||||||
|  |             # iterate over all available transcripts | ||||||
|  |             for transcript in transcript_list: | ||||||
|  |                 # the Transcript object provides metadata properties | ||||||
|  |                 print( | ||||||
|  |                     transcript.video_id, | ||||||
|  |                     transcript.language, | ||||||
|  |                     transcript.language_code, | ||||||
|  |                     # whether it has been manually created or generated by YouTube | ||||||
|  |                     transcript.is_generated, | ||||||
|  |                     # a list of languages the transcript can be translated to | ||||||
|  |                     transcript.translation_languages, | ||||||
|  |                 ) | ||||||
|  | 
 | ||||||
|  |                 # fetch the actual transcript data | ||||||
|  |                 print(transcript.fetch()) | ||||||
|  | 
 | ||||||
|  |                 # translating the transcript will return another transcript object | ||||||
|  |                 print(transcript.translate('en').fetch()) | ||||||
|  | 
 | ||||||
|  |             # you can also directly filter for the language you are looking for, using the transcript list | ||||||
|  |             transcript = transcript_list.find_transcript(['de', 'en']) | ||||||
|  | 
 | ||||||
|  |             # or just filter for manually created transcripts | ||||||
|  |             transcript = transcript_list.find_manually_created_transcript(['de', 'en']) | ||||||
|  | 
 | ||||||
|  |             # or automatically generated ones | ||||||
|  |             transcript = transcript_list.find_generated_transcript(['de', 'en']) | ||||||
|  | 
 | ||||||
|  |         :param video_id: the youtube video id | ||||||
|  |         :type video_id: str | ||||||
|  |         :param proxies: a dictionary mapping of http and https proxies to be used for the network requests | ||||||
|  |         :type proxies: {'http': str, 'https': str} - http://docs.python-requests.org/en/master/user/advanced/#proxies | ||||||
|  |         :return: the list of available transcripts | ||||||
|  |         :rtype TranscriptList: | ||||||
|  |         """ | ||||||
|  |         with requests.Session() as http_client: | ||||||
|  |             http_client.proxies = proxies if proxies else {} | ||||||
|  |             return TranscriptListFetcher(http_client).fetch(video_id) | ||||||
|  | 
 | ||||||
|     @classmethod |     @classmethod | ||||||
|     def get_transcripts(cls, video_ids, languages=('en',), continue_after_error=False, proxies=None): |     def get_transcripts(cls, video_ids, languages=('en',), continue_after_error=False, proxies=None): | ||||||
|         """ |         """ | ||||||
|         Retrieves the transcripts for a list of videos. |         Retrieves the transcripts for a list of videos. | ||||||
| 
 | 
 | ||||||
|         :param video_ids: a list of youtube video ids |         :param video_ids: a list of youtube video ids | ||||||
|         :type video_ids: [str] |         :type video_ids: list[str] | ||||||
|         :param languages: A list of language codes in a descending priority. For example, if this is set to ['de', 'en'] |         :param languages: A list of language codes in a descending priority. For example, if this is set to ['de', 'en'] | ||||||
|         it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails to |         it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails to | ||||||
|         do so. |         do so. | ||||||
|         :type languages: [str] |         :type languages: list[str] | ||||||
|         :param continue_after_error: if this is set the execution won't be stopped, if an error occurs while retrieving |         :param continue_after_error: if this is set the execution won't be stopped, if an error occurs while retrieving | ||||||
|         one of the video transcripts |         one of the video transcripts | ||||||
|         :type continue_after_error: bool |         :type continue_after_error: bool | ||||||
|  | @ -22,7 +73,7 @@ class YouTubeTranscriptApi(): | ||||||
|         :type proxies: {'http': str, 'https': str} - http://docs.python-requests.org/en/master/user/advanced/#proxies |         :type proxies: {'http': str, 'https': str} - http://docs.python-requests.org/en/master/user/advanced/#proxies | ||||||
|         :return: a tuple containing a dictionary mapping video ids onto their corresponding transcripts, and a list of |         :return: a tuple containing a dictionary mapping video ids onto their corresponding transcripts, and a list of | ||||||
|         video ids, which could not be retrieved |         video ids, which could not be retrieved | ||||||
|         :rtype: ({str: [{'text': str, 'start': float, 'end': float}]}, [str]}) |         :rtype ({str: [{'text': str, 'start': float, 'end': float}]}, [str]}): | ||||||
|         """ |         """ | ||||||
|         data = {} |         data = {} | ||||||
|         unretrievable_videos = [] |         unretrievable_videos = [] | ||||||
|  | @ -41,19 +92,19 @@ class YouTubeTranscriptApi(): | ||||||
|     @classmethod |     @classmethod | ||||||
|     def get_transcript(cls, video_id, languages=('en',), proxies=None): |     def get_transcript(cls, video_id, languages=('en',), proxies=None): | ||||||
|         """ |         """ | ||||||
|         Retrieves the transcript for a single video. |         Retrieves the transcript for a single video. This is just a shortcut for calling:: | ||||||
|  | 
 | ||||||
|  |             YouTubeTranscriptApi.list_transcripts(video_id, proxies).find_transcript(languages).fetch() | ||||||
| 
 | 
 | ||||||
|         :param video_id: the youtube video id |         :param video_id: the youtube video id | ||||||
|         :type video_id: str |         :type video_id: str | ||||||
|         :param languages: A list of language codes in a descending priority. For example, if this is set to ['de', 'en'] |         :param languages: A list of language codes in a descending priority. For example, if this is set to ['de', 'en'] | ||||||
|         it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails to |         it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails to | ||||||
|         do so. |         do so. | ||||||
|         :type languages: [str] |         :type languages: list[str] | ||||||
|         :param proxies: a dictionary mapping of http and https proxies to be used for the network requests |         :param proxies: a dictionary mapping of http and https proxies to be used for the network requests | ||||||
|         :type proxies: {'http': str, 'https': str} - http://docs.python-requests.org/en/master/user/advanced/#proxies |         :type proxies: {'http': str, 'https': str} - http://docs.python-requests.org/en/master/user/advanced/#proxies | ||||||
|         :return: a list of dictionaries containing the 'text', 'start' and 'duration' keys |         :return: a list of dictionaries containing the 'text', 'start' and 'duration' keys | ||||||
|         :rtype: [{'text': str, 'start': float, 'end': float}] |         :rtype [{'text': str, 'start': float, 'end': float}]: | ||||||
|         """ |         """ | ||||||
|         with requests.Session() as http_client: |         return cls.list_transcripts(video_id, proxies).find_transcript(languages).fetch() | ||||||
|             http_client.proxies = proxies if proxies else {} |  | ||||||
|             return TranscriptListFetcher(http_client).fetch(video_id).find_transcript(languages).fetch() |  | ||||||
|  |  | ||||||
|  | @ -14,22 +14,45 @@ class YouTubeTranscriptCli(): | ||||||
|     def run(self): |     def run(self): | ||||||
|         parsed_args = self._parse_args() |         parsed_args = self._parse_args() | ||||||
| 
 | 
 | ||||||
|  |         if parsed_args.exclude_manually_created and parsed_args.exclude_generated: | ||||||
|  |             return '' | ||||||
|  | 
 | ||||||
|         proxies = None |         proxies = None | ||||||
|         if parsed_args.http_proxy != '' or parsed_args.https_proxy != '': |         if parsed_args.http_proxy != '' or parsed_args.https_proxy != '': | ||||||
|             proxies = {"http": parsed_args.http_proxy, "https": parsed_args.https_proxy} |             proxies = {"http": parsed_args.http_proxy, "https": parsed_args.https_proxy} | ||||||
| 
 | 
 | ||||||
|         transcripts, unretrievable_videos = YouTubeTranscriptApi.get_transcripts( |         transcripts = [] | ||||||
|             parsed_args.video_ids, |         exceptions = [] | ||||||
|             languages=parsed_args.languages, | 
 | ||||||
|             continue_after_error=True, |         for video_id in parsed_args.video_ids: | ||||||
|             proxies=proxies |             try: | ||||||
|         ) |                 transcripts.append(self._fetch_transcript(parsed_args, proxies, video_id)) | ||||||
|  |             except Exception as exception: | ||||||
|  |                 exceptions.append(exception) | ||||||
| 
 | 
 | ||||||
|         return '\n\n'.join( |         return '\n\n'.join( | ||||||
|             [str(YouTubeTranscriptApi.CouldNotRetrieveTranscript(video_id)) for video_id in unretrievable_videos] |             [str(exception) for exception in exceptions] | ||||||
|             + ([json.dumps(transcripts) if parsed_args.json else pprint.pformat(transcripts)] if transcripts else []) |             + ([json.dumps(transcripts) if parsed_args.json else pprint.pformat(transcripts)] if transcripts else []) | ||||||
|         ) |         ) | ||||||
| 
 | 
 | ||||||
|  |     def _fetch_transcript(self, parsed_args, proxies, video_id): | ||||||
|  |         transcript_list = YouTubeTranscriptApi.list_transcripts(video_id, proxies=proxies) | ||||||
|  | 
 | ||||||
|  |         if parsed_args.list_transcripts: | ||||||
|  |             return str(transcript_list) | ||||||
|  | 
 | ||||||
|  |         if parsed_args.exclude_manually_created: | ||||||
|  |             transcript = transcript_list.find_generated_transcript(parsed_args.languages) | ||||||
|  |         elif parsed_args.exclude_generated: | ||||||
|  |             transcript = transcript_list.find_manually_created_transcript(parsed_args.languages) | ||||||
|  |         else: | ||||||
|  |             transcript = transcript_list.find_transcript(parsed_args.languages) | ||||||
|  | 
 | ||||||
|  |         if parsed_args.translate: | ||||||
|  |             transcript = transcript.translate(parsed_args.translate) | ||||||
|  | 
 | ||||||
|  |         return transcript.fetch() | ||||||
|  | 
 | ||||||
|     def _parse_args(self): |     def _parse_args(self): | ||||||
|         parser = argparse.ArgumentParser( |         parser = argparse.ArgumentParser( | ||||||
|             description=( |             description=( | ||||||
|  | @ -38,6 +61,13 @@ class YouTubeTranscriptCli(): | ||||||
|                 'other selenium based solutions do!' |                 'other selenium based solutions do!' | ||||||
|             ) |             ) | ||||||
|         ) |         ) | ||||||
|  |         parser.add_argument( | ||||||
|  |             '--list-transcripts', | ||||||
|  |             action='store_const', | ||||||
|  |             const=True, | ||||||
|  |             default=False, | ||||||
|  |             help='This will list the languages in which the given videos are available in.', | ||||||
|  |         ) | ||||||
|         parser.add_argument('video_ids', nargs='+', type=str, help='List of YouTube video IDs.') |         parser.add_argument('video_ids', nargs='+', type=str, help='List of YouTube video IDs.') | ||||||
|         parser.add_argument( |         parser.add_argument( | ||||||
|             '--languages', |             '--languages', | ||||||
|  | @ -46,11 +76,25 @@ class YouTubeTranscriptCli(): | ||||||
|             type=str, |             type=str, | ||||||
|             help=( |             help=( | ||||||
|                 'A list of language codes in a descending priority. For example, if this is set to "de en" it will ' |                 'A list of language codes in a descending priority. For example, if this is set to "de en" it will ' | ||||||
|                 'first try to fetch the german transcript (de) and then fetch the english transcipt (en) if it fails ' |                 'first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails ' | ||||||
|                 'to do so. As I can\'t provide a complete list of all working language codes with full certainty, you ' |                 'to do so. As I can\'t provide a complete list of all working language codes with full certainty, you ' | ||||||
|                 'may have to play around with the language codes a bit, to find the one which is working for you!' |                 'may have to play around with the language codes a bit, to find the one which is working for you!' | ||||||
|             ), |             ), | ||||||
|         ) |         ) | ||||||
|  |         parser.add_argument( | ||||||
|  |             '--exclude-generated', | ||||||
|  |             action='store_const', | ||||||
|  |             const=True, | ||||||
|  |             default=False, | ||||||
|  |             help='If this flag is set transcripts which have been generated by YouTube will not be retrieved.', | ||||||
|  |         ) | ||||||
|  |         parser.add_argument( | ||||||
|  |             '--exclude-manually-created', | ||||||
|  |             action='store_const', | ||||||
|  |             const=True, | ||||||
|  |             default=False, | ||||||
|  |             help='If this flag is set transcripts which have been manually created will not be retrieved.', | ||||||
|  |         ) | ||||||
|         parser.add_argument( |         parser.add_argument( | ||||||
|             '--json', |             '--json', | ||||||
|             action='store_const', |             action='store_const', | ||||||
|  | @ -59,13 +103,24 @@ class YouTubeTranscriptCli(): | ||||||
|             help='If this flag is set the output will be JSON formatted.', |             help='If this flag is set the output will be JSON formatted.', | ||||||
|         ) |         ) | ||||||
|         parser.add_argument( |         parser.add_argument( | ||||||
|             '--http-proxy', dest='http_proxy', |             '--translate', | ||||||
|             default='', metavar='URL', |             default='', | ||||||
|  |             help=( | ||||||
|  |                 'The language code for the language you want this transcript to be translated to. Use the ' | ||||||
|  |                 '--list-transcripts feature to find out which languages are translatable and which translation ' | ||||||
|  |                 'languages are available.' | ||||||
|  |             ) | ||||||
|  |         ) | ||||||
|  |         parser.add_argument( | ||||||
|  |             '--http-proxy', | ||||||
|  |             default='', | ||||||
|  |             metavar='URL', | ||||||
|             help='Use the specified HTTP proxy.' |             help='Use the specified HTTP proxy.' | ||||||
|         ) |         ) | ||||||
|         parser.add_argument( |         parser.add_argument( | ||||||
|             '--https-proxy', dest='https_proxy', |             '--https-proxy', | ||||||
|             default='', metavar='URL', |             default='', | ||||||
|  |             metavar='URL', | ||||||
|             help='Use the specified HTTPS proxy.' |             help='Use the specified HTTPS proxy.' | ||||||
|         ) |         ) | ||||||
| 
 | 
 | ||||||
|  |  | ||||||
|  | @ -43,6 +43,18 @@ class TranscriptsDisabled(CouldNotRetrieveTranscript): | ||||||
|     CAUSE_MESSAGE = 'Subtitles are disabled for this video' |     CAUSE_MESSAGE = 'Subtitles are disabled for this video' | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | class NoTranscriptAvailable(CouldNotRetrieveTranscript): | ||||||
|  |     CAUSE_MESSAGE = 'No transcripts are available for this video' | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | class NotTranslatable(CouldNotRetrieveTranscript): | ||||||
|  |     CAUSE_MESSAGE = 'The requested language is not translatable' | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | class TranslationLanguageNotAvailable(CouldNotRetrieveTranscript): | ||||||
|  |     CAUSE_MESSAGE = 'The requested translation language is not available' | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
| class NoTranscriptFound(CouldNotRetrieveTranscript): | class NoTranscriptFound(CouldNotRetrieveTranscript): | ||||||
|     CAUSE_MESSAGE = ( |     CAUSE_MESSAGE = ( | ||||||
|         'No transcripts were found for any of the requested language codes: {requested_language_codes}\n\n' |         'No transcripts were found for any of the requested language codes: {requested_language_codes}\n\n' | ||||||
|  |  | ||||||
|  | @ -12,7 +12,14 @@ from xml.etree import ElementTree | ||||||
| import re | import re | ||||||
| 
 | 
 | ||||||
| from ._html_unescaping import unescape | from ._html_unescaping import unescape | ||||||
| from ._errors import VideoUnavailable, NoTranscriptFound, TranscriptsDisabled | from ._errors import ( | ||||||
|  |     VideoUnavailable, | ||||||
|  |     NoTranscriptFound, | ||||||
|  |     TranscriptsDisabled, | ||||||
|  |     NotTranslatable, | ||||||
|  |     TranslationLanguageNotAvailable, | ||||||
|  |     NoTranscriptAvailable, | ||||||
|  | ) | ||||||
| from ._settings import WATCH_URL | from ._settings import WATCH_URL | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @ -36,9 +43,14 @@ class TranscriptListFetcher(): | ||||||
| 
 | 
 | ||||||
|             raise TranscriptsDisabled(video_id) |             raise TranscriptsDisabled(video_id) | ||||||
| 
 | 
 | ||||||
|         return json.loads(splitted_html[1].split(',"videoDetails')[0].replace('\n', ''))[ |         captions_json = json.loads( | ||||||
|             'playerCaptionsTracklistRenderer' |             splitted_html[1].split(',"videoDetails')[0].replace('\n', '') | ||||||
|         ] |         )['playerCaptionsTracklistRenderer'] | ||||||
|  | 
 | ||||||
|  |         if 'captionTracks' not in captions_json: | ||||||
|  |             raise NoTranscriptAvailable(video_id) | ||||||
|  | 
 | ||||||
|  |         return captions_json | ||||||
| 
 | 
 | ||||||
|     def _fetch_html(self, video_id): |     def _fetch_html(self, video_id): | ||||||
|         return self._http_client.get(WATCH_URL.format(video_id=video_id)).text.replace( |         return self._http_client.get(WATCH_URL.format(video_id=video_id)).text.replace( | ||||||
|  | @ -53,10 +65,7 @@ class TranscriptList(): | ||||||
|     This object represents a list of transcripts. It can be iterated over to list all transcripts which are available |     This object represents a list of transcripts. It can be iterated over to list all transcripts which are available | ||||||
|     for a given YouTube video. Also it provides functionality to search for a transcript in a given language. |     for a given YouTube video. Also it provides functionality to search for a transcript in a given language. | ||||||
|     """ |     """ | ||||||
| 
 |     def __init__(self, video_id, manually_created_transcripts, generated_transcripts, translation_languages): | ||||||
|     # TODO implement iterator |  | ||||||
| 
 |  | ||||||
|     def __init__(self, video_id, manually_created_transcripts, generated_transcripts): |  | ||||||
|         """ |         """ | ||||||
|         The constructor is only for internal use. Use the static build method instead. |         The constructor is only for internal use. Use the static build method instead. | ||||||
| 
 | 
 | ||||||
|  | @ -66,10 +75,13 @@ class TranscriptList(): | ||||||
|         :type manually_created_transcripts: dict[str, Transcript] |         :type manually_created_transcripts: dict[str, Transcript] | ||||||
|         :param generated_transcripts: dict mapping language codes to the generated transcripts |         :param generated_transcripts: dict mapping language codes to the generated transcripts | ||||||
|         :type generated_transcripts: dict[str, Transcript] |         :type generated_transcripts: dict[str, Transcript] | ||||||
|  |         :param translation_languages: list of languages which can be used for translatable languages | ||||||
|  |         :type translation_languages: list[dict[str, str]] | ||||||
|         """ |         """ | ||||||
|         self.video_id = video_id |         self.video_id = video_id | ||||||
|         self._manually_created_transcripts = manually_created_transcripts |         self._manually_created_transcripts = manually_created_transcripts | ||||||
|         self._generated_transcripts = generated_transcripts |         self._generated_transcripts = generated_transcripts | ||||||
|  |         self._translation_languages = translation_languages | ||||||
| 
 | 
 | ||||||
|     @staticmethod |     @staticmethod | ||||||
|     def build(http_client, video_id, captions_json): |     def build(http_client, video_id, captions_json): | ||||||
|  | @ -83,7 +95,7 @@ class TranscriptList(): | ||||||
|         :param captions_json: the JSON parsed from the YouTube pages static HTML |         :param captions_json: the JSON parsed from the YouTube pages static HTML | ||||||
|         :type captions_json: dict |         :type captions_json: dict | ||||||
|         :return: the created TranscriptList |         :return: the created TranscriptList | ||||||
|         :rtype TranscriptList |         :rtype TranscriptList: | ||||||
|         """ |         """ | ||||||
|         translation_languages = [ |         translation_languages = [ | ||||||
|             { |             { | ||||||
|  | @ -108,15 +120,19 @@ class TranscriptList(): | ||||||
|                 caption['name']['simpleText'], |                 caption['name']['simpleText'], | ||||||
|                 caption['languageCode'], |                 caption['languageCode'], | ||||||
|                 caption.get('kind', '') == 'asr', |                 caption.get('kind', '') == 'asr', | ||||||
|                 translation_languages if caption['isTranslatable'] else [] |                 translation_languages if caption.get('isTranslatable', False) else [] | ||||||
|             ) |             ) | ||||||
| 
 | 
 | ||||||
|         return TranscriptList( |         return TranscriptList( | ||||||
|             video_id, |             video_id, | ||||||
|             manually_created_transcripts, |             manually_created_transcripts, | ||||||
|             generated_transcripts, |             generated_transcripts, | ||||||
|  |             translation_languages, | ||||||
|         ) |         ) | ||||||
| 
 | 
 | ||||||
|  |     def __iter__(self): | ||||||
|  |         return iter(list(self._manually_created_transcripts.values()) + list(self._generated_transcripts.values())) | ||||||
|  | 
 | ||||||
|     def find_transcript(self, language_codes): |     def find_transcript(self, language_codes): | ||||||
|         """ |         """ | ||||||
|         Finds a transcript for a given language code. Manually created transcripts are returned first and only if none |         Finds a transcript for a given language code. Manually created transcripts are returned first and only if none | ||||||
|  | @ -126,9 +142,9 @@ class TranscriptList(): | ||||||
|         :param language_codes: A list of language codes in a descending priority. For example, if this is set to |         :param language_codes: A list of language codes in a descending priority. For example, if this is set to | ||||||
|         ['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if |         ['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if | ||||||
|         it fails to do so. |         it fails to do so. | ||||||
|         :type languages: [str] |         :type languages: list[str] | ||||||
|         :return: the found Transcript |         :return: the found Transcript | ||||||
|         :rtype: Transcript |         :rtype Transcript: | ||||||
|         :raises: NoTranscriptFound |         :raises: NoTranscriptFound | ||||||
|         """ |         """ | ||||||
|         return self._find_transcript(language_codes, [self._manually_created_transcripts, self._generated_transcripts]) |         return self._find_transcript(language_codes, [self._manually_created_transcripts, self._generated_transcripts]) | ||||||
|  | @ -140,9 +156,9 @@ class TranscriptList(): | ||||||
|         :param language_codes: A list of language codes in a descending priority. For example, if this is set to |         :param language_codes: A list of language codes in a descending priority. For example, if this is set to | ||||||
|         ['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if |         ['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if | ||||||
|         it fails to do so. |         it fails to do so. | ||||||
|         :type languages: [str] |         :type languages: list[str] | ||||||
|         :return: the found Transcript |         :return: the found Transcript | ||||||
|         :rtype: Transcript |         :rtype Transcript: | ||||||
|         :raises: NoTranscriptFound |         :raises: NoTranscriptFound | ||||||
|         """ |         """ | ||||||
|         return self._find_transcript(language_codes, [self._generated_transcripts,]) |         return self._find_transcript(language_codes, [self._generated_transcripts,]) | ||||||
|  | @ -154,9 +170,9 @@ class TranscriptList(): | ||||||
|         :param language_codes: A list of language codes in a descending priority. For example, if this is set to |         :param language_codes: A list of language codes in a descending priority. For example, if this is set to | ||||||
|         ['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if |         ['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if | ||||||
|         it fails to do so. |         it fails to do so. | ||||||
|         :type languages: [str] |         :type languages: list[str] | ||||||
|         :return: the found Transcript |         :return: the found Transcript | ||||||
|         :rtype: Transcript |         :rtype Transcript: | ||||||
|         :raises: NoTranscriptFound |         :raises: NoTranscriptFound | ||||||
|         """ |         """ | ||||||
|         return self._find_transcript(language_codes, [self._manually_created_transcripts,]) |         return self._find_transcript(language_codes, [self._manually_created_transcripts,]) | ||||||
|  | @ -179,22 +195,28 @@ class TranscriptList(): | ||||||
|             '(MANUALLY CREATED)\n' |             '(MANUALLY CREATED)\n' | ||||||
|             '{available_manually_created_transcript_languages}\n\n' |             '{available_manually_created_transcript_languages}\n\n' | ||||||
|             '(GENERATED)\n' |             '(GENERATED)\n' | ||||||
|             '{available_generated_transcripts}' |             '{available_generated_transcripts}\n\n' | ||||||
|  |             '(TRANSLATION LANGUAGES)\n' | ||||||
|  |             '{available_translation_languages}' | ||||||
|         ).format( |         ).format( | ||||||
|             video_id=self.video_id, |             video_id=self.video_id, | ||||||
|             available_manually_created_transcript_languages=self._get_language_description( |             available_manually_created_transcript_languages=self._get_language_description( | ||||||
|                 self._manually_created_transcripts.values() |                 str(transcript) for transcript in self._manually_created_transcripts.values() | ||||||
|             ), |             ), | ||||||
|             available_generated_transcripts=self._get_language_description( |             available_generated_transcripts=self._get_language_description( | ||||||
|                 self._generated_transcripts.values() |                 str(transcript) for transcript in self._generated_transcripts.values() | ||||||
|             ), |             ), | ||||||
|  |             available_translation_languages=self._get_language_description( | ||||||
|  |                 '{language_code} ("{language}")'.format( | ||||||
|  |                     language=translation_language['language'], | ||||||
|  |                     language_code=translation_language['language_code'], | ||||||
|  |                 ) for translation_language in self._translation_languages | ||||||
|  |             ) | ||||||
|         ) |         ) | ||||||
| 
 | 
 | ||||||
|     def _get_language_description(self, transcripts): |     def _get_language_description(self, transcript_strings): | ||||||
|         return '\n'.join( |         description = '\n'.join(' - {transcript}'.format(transcript=transcript) for transcript in transcript_strings) | ||||||
|             ' - {transcript}'.format(transcript=str(transcript)) |         return description if description else 'None' | ||||||
|             for transcript in transcripts |  | ||||||
|         ) if transcripts else 'None' |  | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| class Transcript(): | class Transcript(): | ||||||
|  | @ -220,45 +242,49 @@ class Transcript(): | ||||||
|         self.language_code = language_code |         self.language_code = language_code | ||||||
|         self.is_generated = is_generated |         self.is_generated = is_generated | ||||||
|         self.translation_languages = translation_languages |         self.translation_languages = translation_languages | ||||||
|  |         self._translation_languages_dict = { | ||||||
|  |             translation_language['language_code']: translation_language['language'] | ||||||
|  |             for translation_language in translation_languages | ||||||
|  |         } | ||||||
| 
 | 
 | ||||||
|     def fetch(self): |     def fetch(self): | ||||||
|         """ |         """ | ||||||
|         Loads the actual transcript data. |         Loads the actual transcript data. | ||||||
| 
 | 
 | ||||||
|         :return: a list of dictionaries containing the 'text', 'start' and 'duration' keys |         :return: a list of dictionaries containing the 'text', 'start' and 'duration' keys | ||||||
|         :rtype: [{'text': str, 'start': float, 'end': float}] |         :rtype [{'text': str, 'start': float, 'end': float}]: | ||||||
|         """ |         """ | ||||||
|         return _TranscriptParser().parse( |         return _TranscriptParser().parse( | ||||||
|             self._http_client.get(self._url).text |             self._http_client.get(self._url).text | ||||||
|         ) |         ) | ||||||
| 
 | 
 | ||||||
|     def __str__(self): |     def __str__(self): | ||||||
|         return '{language_code} ("{language}")'.format( |         return '{language_code} ("{language}"){translation_description}'.format( | ||||||
|             language=self.language, |             language=self.language, | ||||||
|             language_code=self.language_code, |             language_code=self.language_code, | ||||||
|  |             translation_description='[TRANSLATABLE]' if self.is_translatable else '' | ||||||
|         ) |         ) | ||||||
| 
 | 
 | ||||||
| # TODO integrate translations in future release |     @property | ||||||
| #     @property |     def is_translatable(self): | ||||||
| #     def is_translatable(self): |         return len(self.translation_languages) > 0 | ||||||
| #         return len(self.translation_languages) > 0 | 
 | ||||||
| # |     def translate(self, language_code): | ||||||
| # |         if not self.is_translatable: | ||||||
| # class TranslatableTranscript(Transcript): |             raise NotTranslatable(self.video_id) | ||||||
| #     def __init__(self, http_client, url, translation_languages): | 
 | ||||||
| #         super(TranslatableTranscript, self).__init__(http_client, url) |         if language_code not in self._translation_languages_dict: | ||||||
| #         self._translation_languages = translation_languages |             raise TranslationLanguageNotAvailable(self.video_id) | ||||||
| #         self._translation_language_codes = {language['language_code'] for language in translation_languages} | 
 | ||||||
| # |         return Transcript( | ||||||
| # |             self._http_client, | ||||||
| #     def translate(self, language_code): |             self.video_id, | ||||||
| #         if language_code not in self._translation_language_codes: |             '{url}&tlang={language_code}'.format(url=self._url, language_code=language_code), | ||||||
| #             raise TranslatableTranscript.TranslationLanguageNotAvailable() |             self._translation_languages_dict[language_code], | ||||||
| # |             language_code, | ||||||
| #         return Transcript( |             True, | ||||||
| #             self._http_client, |             [], | ||||||
| #             '{url}&tlang={language_code}'.format(url=self._url, language_code=language_code) |         ) | ||||||
| #         ) |  | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| class _TranscriptParser(): | class _TranscriptParser(): | ||||||
|  | @ -269,7 +295,7 @@ class _TranscriptParser(): | ||||||
|             { |             { | ||||||
|                 'text': re.sub(self.HTML_TAG_REGEX, '', unescape(xml_element.text)), |                 'text': re.sub(self.HTML_TAG_REGEX, '', unescape(xml_element.text)), | ||||||
|                 'start': float(xml_element.attrib['start']), |                 'start': float(xml_element.attrib['start']), | ||||||
|                 'duration': float(xml_element.attrib['dur']), |                 'duration': float(xml_element.attrib.get('dur', '0.0')), | ||||||
|             } |             } | ||||||
|             for xml_element in ElementTree.fromstring(plain_data) |             for xml_element in ElementTree.fromstring(plain_data) | ||||||
|             if xml_element.text is not None |             if xml_element.text is not None | ||||||
|  |  | ||||||
										
											
												File diff suppressed because one or more lines are too long
											
										
									
								
							|  | @ -5,7 +5,15 @@ import os | ||||||
| 
 | 
 | ||||||
| import httpretty | import httpretty | ||||||
| 
 | 
 | ||||||
| from youtube_transcript_api import YouTubeTranscriptApi, VideoUnavailable, NoTranscriptFound, TranscriptsDisabled | from youtube_transcript_api import ( | ||||||
|  |     YouTubeTranscriptApi, | ||||||
|  |     TranscriptsDisabled, | ||||||
|  |     NoTranscriptFound, | ||||||
|  |     VideoUnavailable, | ||||||
|  |     NoTranscriptAvailable, | ||||||
|  |     NotTranslatable, | ||||||
|  |     TranslationLanguageNotAvailable, | ||||||
|  | ) | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| def load_asset(filename): | def load_asset(filename): | ||||||
|  | @ -42,6 +50,51 @@ class TestYouTubeTranscriptApi(TestCase): | ||||||
|             ] |             ] | ||||||
|         ) |         ) | ||||||
| 
 | 
 | ||||||
|  |     def test_list_transcripts(self): | ||||||
|  |         transcript_list = YouTubeTranscriptApi.list_transcripts('GJLlxj_dtq8') | ||||||
|  | 
 | ||||||
|  |         language_codes = {transcript.language_code for transcript in transcript_list} | ||||||
|  | 
 | ||||||
|  |         self.assertEqual(language_codes, {'zh', 'de', 'en', 'hi', 'ja', 'ko', 'es', 'cs', 'en'}) | ||||||
|  | 
 | ||||||
|  |     def test_list_transcripts__find_manually_created(self): | ||||||
|  |         transcript_list = YouTubeTranscriptApi.list_transcripts('GJLlxj_dtq8') | ||||||
|  |         transcript = transcript_list.find_manually_created_transcript(['cs']) | ||||||
|  | 
 | ||||||
|  |         self.assertFalse(transcript.is_generated) | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  |     def test_list_transcripts__find_generated(self): | ||||||
|  |         transcript_list = YouTubeTranscriptApi.list_transcripts('GJLlxj_dtq8') | ||||||
|  | 
 | ||||||
|  |         with self.assertRaises(NoTranscriptFound): | ||||||
|  |             transcript_list.find_generated_transcript(['cs']) | ||||||
|  | 
 | ||||||
|  |         transcript = transcript_list.find_generated_transcript(['en']) | ||||||
|  | 
 | ||||||
|  |         self.assertTrue(transcript.is_generated) | ||||||
|  | 
 | ||||||
|  |     def test_translate_transcript(self): | ||||||
|  |         transcript = YouTubeTranscriptApi.list_transcripts('GJLlxj_dtq8').find_transcript(['en']) | ||||||
|  | 
 | ||||||
|  |         translated_transcript = transcript.translate('af') | ||||||
|  | 
 | ||||||
|  |         self.assertEqual(translated_transcript.language_code, 'af') | ||||||
|  |         self.assertIn('&tlang=af', translated_transcript._url) | ||||||
|  | 
 | ||||||
|  |     def test_translate_transcript__translation_language_not_available(self): | ||||||
|  |         transcript = YouTubeTranscriptApi.list_transcripts('GJLlxj_dtq8').find_transcript(['en']) | ||||||
|  | 
 | ||||||
|  |         with self.assertRaises(TranslationLanguageNotAvailable): | ||||||
|  |             transcript.translate('xyz') | ||||||
|  | 
 | ||||||
|  |     def test_translate_transcript__not_translatable(self): | ||||||
|  |         transcript = YouTubeTranscriptApi.list_transcripts('GJLlxj_dtq8').find_transcript(['en']) | ||||||
|  |         transcript.translation_languages = [] | ||||||
|  | 
 | ||||||
|  |         with self.assertRaises(NotTranslatable): | ||||||
|  |             transcript.translate('af') | ||||||
|  | 
 | ||||||
|     def test_get_transcript__correct_language_is_used(self): |     def test_get_transcript__correct_language_is_used(self): | ||||||
|         YouTubeTranscriptApi.get_transcript('GJLlxj_dtq8', ['de', 'en']) |         YouTubeTranscriptApi.get_transcript('GJLlxj_dtq8', ['de', 'en']) | ||||||
|         query_string = httpretty.last_request().querystring |         query_string = httpretty.last_request().querystring | ||||||
|  | @ -88,6 +141,16 @@ class TestYouTubeTranscriptApi(TestCase): | ||||||
|         with self.assertRaises(NoTranscriptFound): |         with self.assertRaises(NoTranscriptFound): | ||||||
|             YouTubeTranscriptApi.get_transcript('GJLlxj_dtq8', languages=['cz']) |             YouTubeTranscriptApi.get_transcript('GJLlxj_dtq8', languages=['cz']) | ||||||
| 
 | 
 | ||||||
|  |     def test_get_transcript__exception_if_no_transcript_available(self): | ||||||
|  |         httpretty.register_uri( | ||||||
|  |             httpretty.GET, | ||||||
|  |             'https://www.youtube.com/watch', | ||||||
|  |             body=load_asset('youtube_no_transcript_available.html.static') | ||||||
|  |         ) | ||||||
|  | 
 | ||||||
|  |         with self.assertRaises(NoTranscriptAvailable): | ||||||
|  |             YouTubeTranscriptApi.get_transcript('MwBPvcYFY2E') | ||||||
|  | 
 | ||||||
|     def test_get_transcripts(self): |     def test_get_transcripts(self): | ||||||
|         video_id_1 = 'video_id_1' |         video_id_1 = 'video_id_1' | ||||||
|         video_id_2 = 'video_id_2' |         video_id_2 = 'video_id_2' | ||||||
|  |  | ||||||
|  | @ -3,10 +3,27 @@ from mock import MagicMock | ||||||
| 
 | 
 | ||||||
| import json | import json | ||||||
| 
 | 
 | ||||||
| from youtube_transcript_api._cli import YouTubeTranscriptCli, YouTubeTranscriptApi | from youtube_transcript_api import YouTubeTranscriptApi, VideoUnavailable | ||||||
|  | from youtube_transcript_api._cli import YouTubeTranscriptCli | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| class TestYouTubeTranscriptCli(TestCase): | class TestYouTubeTranscriptCli(TestCase): | ||||||
|  |     def setUp(self): | ||||||
|  |         self.transcript_mock = MagicMock() | ||||||
|  |         self.transcript_mock.fetch = MagicMock(return_value=[ | ||||||
|  |             {'text': 'Hey, this is just a test', 'start': 0.0, 'duration': 1.54}, | ||||||
|  |             {'text': 'this is not the original transcript', 'start': 1.54, 'duration': 4.16}, | ||||||
|  |             {'text': 'just something shorter, I made up for testing', 'start': 5.7, 'duration': 3.239} | ||||||
|  |         ]) | ||||||
|  |         self.transcript_mock.translate = MagicMock(return_value=self.transcript_mock) | ||||||
|  | 
 | ||||||
|  |         self.transcript_list_mock = MagicMock() | ||||||
|  |         self.transcript_list_mock.find_generated_transcript = MagicMock(return_value=self.transcript_mock) | ||||||
|  |         self.transcript_list_mock.find_manually_created_transcript = MagicMock(return_value=self.transcript_mock) | ||||||
|  |         self.transcript_list_mock.find_transcript = MagicMock(return_value=self.transcript_mock) | ||||||
|  | 
 | ||||||
|  |         YouTubeTranscriptApi.list_transcripts = MagicMock(return_value=self.transcript_list_mock) | ||||||
|  | 
 | ||||||
|     def test_argument_parsing(self): |     def test_argument_parsing(self): | ||||||
|         parsed_args = YouTubeTranscriptCli('v1 v2 --json --languages de en'.split())._parse_args() |         parsed_args = YouTubeTranscriptCli('v1 v2 --json --languages de en'.split())._parse_args() | ||||||
|         self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) |         self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) | ||||||
|  | @ -106,32 +123,107 @@ class TestYouTubeTranscriptCli(TestCase): | ||||||
|         self.assertEqual(parsed_args.http_proxy, '') |         self.assertEqual(parsed_args.http_proxy, '') | ||||||
|         self.assertEqual(parsed_args.https_proxy, '') |         self.assertEqual(parsed_args.https_proxy, '') | ||||||
| 
 | 
 | ||||||
|  |     def test_argument_parsing__list_transcripts(self): | ||||||
|  |         parsed_args = YouTubeTranscriptCli('--list-transcripts v1 v2'.split())._parse_args() | ||||||
|  |         self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) | ||||||
|  |         self.assertTrue(parsed_args.list_transcripts) | ||||||
|  | 
 | ||||||
|  |         parsed_args = YouTubeTranscriptCli('v1 v2 --list-transcripts'.split())._parse_args() | ||||||
|  |         self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) | ||||||
|  |         self.assertTrue(parsed_args.list_transcripts) | ||||||
|  | 
 | ||||||
|  |     def test_argument_parsing__translate(self): | ||||||
|  |         parsed_args = YouTubeTranscriptCli('v1 v2 --languages de en --translate cz'.split())._parse_args() | ||||||
|  |         self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) | ||||||
|  |         self.assertEqual(parsed_args.json, False) | ||||||
|  |         self.assertEqual(parsed_args.languages, ['de', 'en']) | ||||||
|  |         self.assertEqual(parsed_args.translate, 'cz') | ||||||
|  | 
 | ||||||
|  |         parsed_args = YouTubeTranscriptCli('v1 v2 --translate cz --languages de en'.split())._parse_args() | ||||||
|  |         self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) | ||||||
|  |         self.assertEqual(parsed_args.json, False) | ||||||
|  |         self.assertEqual(parsed_args.languages, ['de', 'en']) | ||||||
|  |         self.assertEqual(parsed_args.translate, 'cz') | ||||||
|  | 
 | ||||||
|  |     def test_argument_parsing__manually_or_generated(self): | ||||||
|  |         parsed_args = YouTubeTranscriptCli('v1 v2 --exclude-manually-created'.split())._parse_args() | ||||||
|  |         self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) | ||||||
|  |         self.assertTrue(parsed_args.exclude_manually_created) | ||||||
|  |         self.assertFalse(parsed_args.exclude_generated) | ||||||
|  | 
 | ||||||
|  |         parsed_args = YouTubeTranscriptCli('v1 v2 --exclude-generated'.split())._parse_args() | ||||||
|  |         self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) | ||||||
|  |         self.assertFalse(parsed_args.exclude_manually_created) | ||||||
|  |         self.assertTrue(parsed_args.exclude_generated) | ||||||
|  | 
 | ||||||
|  |         parsed_args = YouTubeTranscriptCli('v1 v2 --exclude-manually-created --exclude-generated'.split())._parse_args() | ||||||
|  |         self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) | ||||||
|  |         self.assertTrue(parsed_args.exclude_manually_created) | ||||||
|  |         self.assertTrue(parsed_args.exclude_generated) | ||||||
|  | 
 | ||||||
|     def test_run(self): |     def test_run(self): | ||||||
|         YouTubeTranscriptApi.get_transcripts = MagicMock(return_value=([], [])) |  | ||||||
|         YouTubeTranscriptCli('v1 v2 --languages de en'.split()).run() |         YouTubeTranscriptCli('v1 v2 --languages de en'.split()).run() | ||||||
| 
 | 
 | ||||||
|         YouTubeTranscriptApi.get_transcripts.assert_called_once_with( |         YouTubeTranscriptApi.list_transcripts.assert_any_call('v1', proxies=None) | ||||||
|             ['v1', 'v2'], |         YouTubeTranscriptApi.list_transcripts.assert_any_call('v2', proxies=None) | ||||||
|             languages=['de', 'en'], | 
 | ||||||
|             continue_after_error=True, |         self.transcript_list_mock.find_transcript.assert_any_call(['de', 'en']) | ||||||
|             proxies=None | 
 | ||||||
|  |     def test_run__failing_transcripts(self): | ||||||
|  |         YouTubeTranscriptApi.list_transcripts = MagicMock(side_effect=VideoUnavailable('video_id')) | ||||||
|  | 
 | ||||||
|  |         output = YouTubeTranscriptCli('v1 --languages de en'.split()).run() | ||||||
|  | 
 | ||||||
|  |         self.assertEqual(output, str(VideoUnavailable('video_id'))) | ||||||
|  | 
 | ||||||
|  |     def test_run__exclude_generated(self): | ||||||
|  |         YouTubeTranscriptCli('v1 v2 --languages de en --exclude-generated'.split()).run() | ||||||
|  | 
 | ||||||
|  |         self.transcript_list_mock.find_manually_created_transcript.assert_any_call(['de', 'en']) | ||||||
|  | 
 | ||||||
|  |     def test_run__exclude_manually_created(self): | ||||||
|  |         YouTubeTranscriptCli('v1 v2 --languages de en --exclude-manually-created'.split()).run() | ||||||
|  | 
 | ||||||
|  |         self.transcript_list_mock.find_generated_transcript.assert_any_call(['de', 'en']) | ||||||
|  | 
 | ||||||
|  |     def test_run__exclude_manually_created_and_generated(self): | ||||||
|  |         self.assertEqual( | ||||||
|  |             YouTubeTranscriptCli('v1 v2 --languages de en --exclude-manually-created --exclude-generated'.split()).run(), | ||||||
|  |             '' | ||||||
|         ) |         ) | ||||||
| 
 | 
 | ||||||
|  |     def test_run__translate(self): | ||||||
|  |         YouTubeTranscriptCli('v1 v2 --languages de en --translate cz'.split()).run(), | ||||||
|  | 
 | ||||||
|  |         self.transcript_mock.translate.assert_any_call('cz') | ||||||
|  | 
 | ||||||
|  |     def test_run__list_transcripts(self): | ||||||
|  |         YouTubeTranscriptCli('--list-transcripts v1 v2'.split()).run() | ||||||
|  | 
 | ||||||
|  |         YouTubeTranscriptApi.list_transcripts.assert_any_call('v1', proxies=None) | ||||||
|  |         YouTubeTranscriptApi.list_transcripts.assert_any_call('v2', proxies=None) | ||||||
|  | 
 | ||||||
|     def test_run__json_output(self): |     def test_run__json_output(self): | ||||||
|         YouTubeTranscriptApi.get_transcripts = MagicMock(return_value=([{'boolean': True}], [])) |  | ||||||
|         output = YouTubeTranscriptCli('v1 v2 --languages de en --json'.split()).run() |         output = YouTubeTranscriptCli('v1 v2 --languages de en --json'.split()).run() | ||||||
| 
 | 
 | ||||||
|         # will fail if output is not valid json |         # will fail if output is not valid json | ||||||
|         json.loads(output) |         json.loads(output) | ||||||
| 
 | 
 | ||||||
|     def test_run__proxies(self): |     def test_run__proxies(self): | ||||||
|         YouTubeTranscriptApi.get_transcripts = MagicMock(return_value=([], [])) |  | ||||||
|         YouTubeTranscriptCli( |         YouTubeTranscriptCli( | ||||||
|             'v1 v2 --languages de en --http-proxy http://user:pass@domain:port --https-proxy https://user:pass@domain:port'.split()).run() |             ( | ||||||
|  |                 'v1 v2 --languages de en ' | ||||||
|  |                 '--http-proxy http://user:pass@domain:port ' | ||||||
|  |                 '--https-proxy https://user:pass@domain:port' | ||||||
|  |             ).split() | ||||||
|  |         ).run() | ||||||
| 
 | 
 | ||||||
|         YouTubeTranscriptApi.get_transcripts.assert_called_once_with( |         YouTubeTranscriptApi.list_transcripts.assert_any_call( | ||||||
|             ['v1', 'v2'], |             'v1', | ||||||
|             languages=['de', 'en'], |             proxies={'http': 'http://user:pass@domain:port', 'https': 'https://user:pass@domain:port'} | ||||||
|             continue_after_error=True, |         ) | ||||||
|  | 
 | ||||||
|  |         YouTubeTranscriptApi.list_transcripts.assert_any_call( | ||||||
|  |             'v2', | ||||||
|             proxies={'http': 'http://user:pass@domain:port', 'https': 'https://user:pass@domain:port'} |             proxies={'http': 'http://user:pass@domain:port', 'https': 'https://user:pass@domain:port'} | ||||||
|         ) |         ) | ||||||
|  |  | ||||||
		Loading…
	
		Reference in New Issue