Merge pull request #1 from jdepoix/master

Updating to current master
This commit is contained in:
danielcliu 2020-01-09 18:56:45 -08:00 committed by GitHub
commit edefeeaf1d
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
14 changed files with 6707 additions and 210 deletions

206
README.md
View File

@ -1,13 +1,9 @@
# YouTube Transcript/Subtitle API (including automatically generated subtitles)
[![Donate](https://img.shields.io/badge/Donate-PayPal-green.svg)](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BAENLEW8VUJ6G&source=url) # YouTube Transcript/Subtitle API (including automatically generated subtitles and subtitle translations)
[![Build Status](https://travis-ci.org/jdepoix/youtube-transcript-api.svg)](https://travis-ci.org/jdepoix/youtube-transcript-api)
[![Coverage Status](https://coveralls.io/repos/github/jdepoix/youtube-transcript-api/badge.svg?branch=master)](https://coveralls.io/github/jdepoix/youtube-transcript-api?branch=master) [![Donate](https://img.shields.io/badge/Donate-PayPal-green.svg)](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BAENLEW8VUJ6G&source=url) [![Build Status](https://travis-ci.org/jdepoix/youtube-transcript-api.svg)](https://travis-ci.org/jdepoix/youtube-transcript-api) [![Coverage Status](https://coveralls.io/repos/github/jdepoix/youtube-transcript-api/badge.svg?branch=master)](https://coveralls.io/github/jdepoix/youtube-transcript-api?branch=master) [![MIT license](http://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](http://opensource.org/licenses/MIT) [![image](https://img.shields.io/pypi/v/youtube-transcript-api.svg)](https://pypi.org/project/youtube-transcript-api/) [![image](https://img.shields.io/pypi/pyversions/youtube-transcript-api.svg)](https://pypi.org/project/youtube-transcript-api/)
[![MIT license](http://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](http://opensource.org/licenses/MIT)
[![image](https://img.shields.io/pypi/v/youtube-transcript-api.svg)](https://pypi.org/project/youtube-transcript-api/)
[![image](https://img.shields.io/pypi/pyversions/youtube-transcript-api.svg)](https://pypi.org/project/youtube-transcript-api/)
This is an python API which allows you to get the transcripts/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require a headless browser, like other selenium based solutions do! This is an python API which allows you to get the transcripts/subtitles for a given YouTube video. It also works for automatically generated subtitles, supports translating subtitles and it does not require a headless browser, like other selenium based solutions do!
## Install ## Install
@ -23,13 +19,11 @@ If you want to use it from source, you'll have to install the dependencies manua
pip install -r requirements.txt pip install -r requirements.txt
``` ```
## How to use it You can either integrate this module [into an existing application](#api), or just use it via an [CLI](#cli).
You could either integrate this module into an existing application, or just use it via an CLI ## API
### In code The easiest way to get a transcript for a given video is to execute:
To get a transcript for a given video you can do:
```python ```python
from youtube_transcript_api import YouTubeTranscriptApi from youtube_transcript_api import YouTubeTranscriptApi
@ -55,15 +49,15 @@ This will return a list of dictionaries looking somewhat like this:
] ]
``` ```
You can also add the `languages` param if you want to make sure the transcripts are retrieved in your desired language (it usually defaults to english). You can also add the `languages` param if you want to make sure the transcripts are retrieved in your desired language (it defaults to english).
```python ```python
YouTubeTranscriptApi.get_transcripts(video_ids, languages=['de', 'en']) YouTubeTranscriptApi.get_transcripts(video_ids, languages=['de', 'en'])
``` ```
It's a list of language codes in a descending priority. In this example it will first try to fetch the german transcript (`'de'`) and then fetch the english transcript (`'en'`) if it fails to do so. As I can't provide a complete list of all working language codes with full certainty, you may have to play around with the language codes a bit, to find the one which is working for you! It's a list of language codes in a descending priority. In this example it will first try to fetch the german transcript (`'de'`) and then fetch the english transcript (`'en'`) if it fails to do so. If you want to find out which languages are available first, [have a look at `list_transcripts()`](#list-available-transcripts)
To get transcripts for a list fo video ids you can call: To get transcripts for a list of video ids you can call:
```python ```python
YouTubeTranscriptApi.get_transcripts(video_ids, languages=['de', 'en']) YouTubeTranscriptApi.get_transcripts(video_ids, languages=['de', 'en'])
@ -71,51 +65,163 @@ YouTubeTranscriptApi.get_transcripts(video_ids, languages=['de', 'en'])
`languages` also is optional here. `languages` also is optional here.
### CLI ### List available transcripts
Execute the CLI script using the video ids as parameters and the results will be printed out to the command line: If you want to list all transcripts which are available for a given video you can call:
```
youtube_transcript_api <first_video_id> <second_video_id> ...
```
The CLI also gives you the option to provide a list of preferred languages:
```
youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en
```
If you would prefer to write it into a file or pipe it into another application, you can also output the results as json using the following line:
```
youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en --json > transcripts.json
```
### Proxy
You can specify a https/http proxy, which will be used during the requests to YouTube:
```python ```python
from youtube_transcript_api import YouTubeTranscriptApi transcript_list = YouTubeTranscriptApi.list_transcripts(video_id, languages=['de', 'en'])
YouTubeTranscriptApi.get_transcript(video_id, proxies={"http": "http://user:pass@domain:port", "https": "https://user:pass@domain:port"})
``` ```
As the `proxies` dict is passed on to the `requests.get(...)` call, it follows the [format used by the requests library](http://docs.python-requests.org/en/master/user/advanced/#proxies). This will return a `TranscriptList` object which is iterable and provides methods to filter the list of transcripts for specific languages and types, like:
Using the CLI: ```python
transcript = transcript_list.find_transcript(['de', 'en'])
```
youtube_transcript_api <first_video_id> <second_video_id> --http-proxy http://user:pass@domain:port --https-proxy https://user:pass@domain:port
``` ```
By default this module always picks manually created transcripts over automatically created ones, if a transcript in the requested language is available both manually created and generated. The `TranscriptList` allows you to bypass this default behaviour by searching for specific transcript types:
## Warning ```python
# filter for manually created transcripts
transcript = transcript_list.find_manually_created_transcript(['de', 'en'])
This code uses an undocumented part of the YouTube API, which is called by the YouTube web-client. So there is no guarantee that it won't stop working tomorrow, if they change how things work. I will however do my best to make things working again as soon as possible if that happens. So if it stops working, let me know! # or automatically generated ones
transcript = transcript_list.find_generated_transcript(['de', 'en'])
```
## Donation The methods `find_generated_transcript`, `find_manually_created_transcript`, `find_generated_transcript` return `Transcript` objects. They contain metadata regarding the transcript:
If this project makes you happy by reducing your development time, you can make me happy by treating me to a cup of coffee :) ```python
print(
transcript.video_id,
transcript.language,
transcript.language_code,
# whether it has been manually created or generated by YouTube
transcript.is_generated,
# whether this transcript can be translated or not
transcript.is_translatable,
# a list of languages the transcript can be translated to
transcript.translation_languages,
)
```
and provide the method, which allows you to fetch the actual transcript data:
```python
transcript.fetch()
```
### Translate transcript
YouTube has a feature which allows you to automatically translate subtitles. This module also makes it possible to access this feature. To do so `Transcript` objects provide a `translate()` method, which returns a new translated `Transcript` object:
```python
transcript = transcript_list.find_transcript(['en'])
translated_transcript = transcript.translate('de')
print(translated_transcript.fetch())
```
### By example
```python
# retrieve the available transcripts
transcript_list = YouTubeTranscriptApi.get('video_id')
# iterate over all available transcripts
for transcript in transcript_list:
# the Transcript object provides metadata properties
print(
transcript.video_id,
transcript.language,
transcript.language_code,
# whether it has been manually created or generated by YouTube
transcript.is_generated,
# whether this transcript can be translated or not
transcript.is_translatable,
# a list of languages the transcript can be translated to
transcript.translation_languages,
)
# fetch the actual transcript data
print(transcript.fetch())
# translating the transcript will return another transcript object
print(transcript.translate('en').fetch())
# you can also directly filter for the language you are looking for, using the transcript list
transcript = transcript_list.find_transcript(['de', 'en'])
# or just filter for manually created transcripts
transcript = transcript_list.find_manually_created_transcript(['de', 'en'])
# or automatically generated ones
transcript = transcript_list.find_generated_transcript(['de', 'en'])
```
## CLI
Execute the CLI script using the video ids as parameters and the results will be printed out to the command line:
```
youtube_transcript_api <first_video_id> <second_video_id> ...
```
The CLI also gives you the option to provide a list of preferred languages:
```
youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en
```
You can also specify if you want to exclude automatically generated or manually created subtitles:
```
youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en --exclude-generated
youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en --exclude-manually-created
```
If you would prefer to write it into a file or pipe it into another application, you can also output the results as json using the following line:
```
youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en --json > transcripts.json
```
Translating transcripts using the CLI is also possible:
```
youtube_transcript_api <first_video_id> <second_video_id> ... --languages en --translate de
```
If you are not sure which languages are available for a given video you can call, to list all available transcripts:
```
youtube_transcript_api --list-transcripts <first_video_id>
```
## Proxy
You can specify a https/http proxy, which will be used during the requests to YouTube:
```python
from youtube_transcript_api import YouTubeTranscriptApi
YouTubeTranscriptApi.get_transcript(video_id, proxies={"http": "http://user:pass@domain:port", "https": "https://user:pass@domain:port"})
```
As the `proxies` dict is passed on to the `requests.get(...)` call, it follows the [format used by the requests library](http://docs.python-requests.org/en/master/user/advanced/#proxies).
Using the CLI:
```
youtube_transcript_api <first_video_id> <second_video_id> --http-proxy http://user:pass@domain:port --https-proxy https://user:pass@domain:port
```
## Warning
This code uses an undocumented part of the YouTube API, which is called by the YouTube web-client. So there is no guarantee that it won't stop working tomorrow, if they change how things work. I will however do my best to make things working again as soon as possible if that happens. So if it stops working, let me know!
## Donation
If this project makes you happy by reducing your development time, you can make me happy by treating me to a cup of coffee :)
[![Donate](https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif)](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BAENLEW8VUJ6G&source=url) [![Donate](https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif)](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BAENLEW8VUJ6G&source=url)

View File

@ -24,7 +24,7 @@ def get_test_suite():
setuptools.setup( setuptools.setup(
name="youtube_transcript_api", name="youtube_transcript_api",
version="0.1.9", version="0.2.1",
author="Jonas Depoix", author="Jonas Depoix",
author_email="jonas.depoix@web.de", author_email="jonas.depoix@web.de",
description="This is an python API which allows you to get the transcripts/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require a headless browser, like other selenium based solutions do!", description="This is an python API which allows you to get the transcripts/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require a headless browser, like other selenium based solutions do!",

View File

@ -1 +1,11 @@
from ._api import YouTubeTranscriptApi from ._api import YouTubeTranscriptApi
from ._transcripts import TranscriptList, Transcript
from ._errors import (
TranscriptsDisabled,
NoTranscriptFound,
CouldNotRetrieveTranscript,
VideoUnavailable,
NotTranslatable,
TranslationLanguageNotAvailable,
NoTranscriptAvailable,
)

View File

@ -1,56 +1,71 @@
import sys
# This can only be tested by using different python versions, therefore it is not covered by coverage.py
if sys.version_info.major == 2: # pragma: no cover
reload(sys)
sys.setdefaultencoding('utf-8')
from xml.etree import ElementTree
import re
import requests import requests
from ._html_unescaping import unescape from ._transcripts import TranscriptListFetcher
class YouTubeTranscriptApi(): class YouTubeTranscriptApi():
class CouldNotRetrieveTranscript(Exception): @classmethod
def list_transcripts(cls, video_id, proxies=None):
""" """
Raised if a transcript could not be retrieved. Retrieves the list of transcripts which are available for a given video. It returns a `TranscriptList` object
which is iterable and provides methods to filter the list of transcripts for specific languages. While iterating
over the `TranscriptList` the individual transcripts are represented by `Transcript` objects, which provide
metadata and can either be fetched by calling `transcript.fetch()` or translated by calling
`transcript.translate('en')`. Example::
# retrieve the available transcripts
transcript_list = YouTubeTranscriptApi.get('video_id')
# iterate over all available transcripts
for transcript in transcript_list:
# the Transcript object provides metadata properties
print(
transcript.video_id,
transcript.language,
transcript.language_code,
# whether it has been manually created or generated by YouTube
transcript.is_generated,
# a list of languages the transcript can be translated to
transcript.translation_languages,
)
# fetch the actual transcript data
print(transcript.fetch())
# translating the transcript will return another transcript object
print(transcript.translate('en').fetch())
# you can also directly filter for the language you are looking for, using the transcript list
transcript = transcript_list.find_transcript(['de', 'en'])
# or just filter for manually created transcripts
transcript = transcript_list.find_manually_created_transcript(['de', 'en'])
# or automatically generated ones
transcript = transcript_list.find_generated_transcript(['de', 'en'])
:param video_id: the youtube video id
:type video_id: str
:param proxies: a dictionary mapping of http and https proxies to be used for the network requests
:type proxies: {'http': str, 'https': str} - http://docs.python-requests.org/en/master/user/advanced/#proxies
:return: the list of available transcripts
:rtype TranscriptList:
""" """
with requests.Session() as http_client:
ERROR_MESSAGE = ( http_client.proxies = proxies if proxies else {}
'Could not get the transcript for the video {video_url}! ' return TranscriptListFetcher(http_client).fetch(video_id)
'This usually happens if one of the following things is the case:\n'
' - subtitles have been disabled by the uploader\n'
' - none of the language codes you provided are valid\n'
' - none of the languages you provided are supported by the video\n'
' - the video is no longer available.\n\n'
'If none of these things is the case, please create an issue at '
'https://github.com/jdepoix/youtube-transcript-api/issues.'
'Please add which version of youtube_transcript_api you are using and make sure that there '
'are no open issues which already describe your problem!'
)
def __init__(self, video_id):
super(YouTubeTranscriptApi.CouldNotRetrieveTranscript, self).__init__(
self.ERROR_MESSAGE.format(video_url=_TranscriptFetcher.WATCH_URL.format(video_id=video_id))
)
self.video_id = video_id
@classmethod @classmethod
def get_transcripts(cls, video_ids, languages=None, continue_after_error=False, proxies=None): def get_transcripts(cls, video_ids, languages=('en',), continue_after_error=False, proxies=None):
""" """
Retrieves the transcripts for a list of videos. Retrieves the transcripts for a list of videos.
:param video_ids: a list of youtube video ids :param video_ids: a list of youtube video ids
:type video_ids: [str] :type video_ids: list[str]
:param languages: A list of language codes in a descending priority. For example, if this is set to ['de', 'en'] :param languages: A list of language codes in a descending priority. For example, if this is set to ['de', 'en']
it will first try to fetch the german transcript (de) and then fetch the english transcipt (en) if it fails to it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails to
do so. As I can't provide a complete list of all working language codes with full certainty, you may have to do so.
play around with the language codes a bit, to find the one which is working for you! :type languages: list[str]
:type languages: [str]
:param continue_after_error: if this is set the execution won't be stopped, if an error occurs while retrieving :param continue_after_error: if this is set the execution won't be stopped, if an error occurs while retrieving
one of the video transcripts one of the video transcripts
:type continue_after_error: bool :type continue_after_error: bool
@ -58,7 +73,7 @@ class YouTubeTranscriptApi():
:type proxies: {'http': str, 'https': str} - http://docs.python-requests.org/en/master/user/advanced/#proxies :type proxies: {'http': str, 'https': str} - http://docs.python-requests.org/en/master/user/advanced/#proxies
:return: a tuple containing a dictionary mapping video ids onto their corresponding transcripts, and a list of :return: a tuple containing a dictionary mapping video ids onto their corresponding transcripts, and a list of
video ids, which could not be retrieved video ids, which could not be retrieved
:rtype: ({str: [{'text': str, 'start': float, 'end': float}]}, [str]} :rtype ({str: [{'text': str, 'start': float, 'end': float}]}, [str]}):
""" """
data = {} data = {}
unretrievable_videos = [] unretrievable_videos = []
@ -75,90 +90,21 @@ class YouTubeTranscriptApi():
return data, unretrievable_videos return data, unretrievable_videos
@classmethod @classmethod
def get_transcript(cls, video_id, languages=None, proxies=None): def get_transcript(cls, video_id, languages=('en',), proxies=None):
""" """
Retrieves the transcript for a single video. Retrieves the transcript for a single video. This is just a shortcut for calling::
YouTubeTranscriptApi.list_transcripts(video_id, proxies).find_transcript(languages).fetch()
:param video_id: the youtube video id :param video_id: the youtube video id
:type video_id: str :type video_id: str
:param languages: A list of language codes in a descending priority. For example, if this is set to ['de', 'en'] :param languages: A list of language codes in a descending priority. For example, if this is set to ['de', 'en']
it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails to it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails to
do so. As I can't provide a complete list of all working language codes with full certainty, you may have to do so.
play around with the language codes a bit, to find the one which is working for you! :type languages: list[str]
:type languages: [str]
:param proxies: a dictionary mapping of http and https proxies to be used for the network requests :param proxies: a dictionary mapping of http and https proxies to be used for the network requests
:type proxies: {'http': str, 'https': str} - http://docs.python-requests.org/en/master/user/advanced/#proxies :type proxies: {'http': str, 'https': str} - http://docs.python-requests.org/en/master/user/advanced/#proxies
:return: a list of dictionaries containing the 'text', 'start' and 'duration' keys :return: a list of dictionaries containing the 'text', 'start' and 'duration' keys
:rtype: [{'text': str, 'start': float, 'end': float}] :rtype [{'text': str, 'start': float, 'end': float}]:
""" """
try: return cls.list_transcripts(video_id, proxies).find_transcript(languages).fetch()
return _TranscriptParser(_TranscriptFetcher(video_id, languages, proxies).fetch()).parse()
except Exception:
raise YouTubeTranscriptApi.CouldNotRetrieveTranscript(video_id)
class _TranscriptFetcher():
WATCH_URL = 'https://www.youtube.com/watch?v={video_id}'
API_BASE_URL = 'https://www.youtube.com/api/{api_url}'
LANGUAGE_REGEX = re.compile(r'(&lang=.*&)|(&lang=.*)')
TIMEDTEXT_STRING = 'timedtext?v='
def __init__(self, video_id, languages, proxies):
self.video_id = video_id
self.languages = languages
self.proxies = proxies
def fetch(self):
if self.proxies:
fetched_site = requests.get(self.WATCH_URL.format(video_id=self.video_id), proxies=self.proxies).text
else:
fetched_site = requests.get(self.WATCH_URL.format(video_id=self.video_id)).text
timedtext_splits = fetched_site.split(self.TIMEDTEXT_STRING)
timedtext_url_start = (
timedtext_splits[2].find(self.TIMEDTEXT_STRING)
+ len(timedtext_splits[0])
+ len(timedtext_splits[1])
+ len(self.TIMEDTEXT_STRING) + 1
)
for language in (self.languages if self.languages else [None,]):
response = self._execute_api_request(fetched_site, timedtext_url_start, language)
if response:
return response
return None
def _execute_api_request(self, fetched_site, timedtext_url_start, language):
url = self.API_BASE_URL.format(
api_url=fetched_site[
timedtext_url_start:timedtext_url_start + fetched_site[timedtext_url_start:].find('"')
].replace(
'\\u0026', '&'
).replace(
'\\', ''
)
)
if language:
url = re.sub(self.LANGUAGE_REGEX, '&lang={language}&'.format(language=language), url)
if self.proxies:
return requests.get(url, proxies=self.proxies).text
else:
return requests.get(url).text
class _TranscriptParser():
HTML_TAG_REGEX = re.compile(r'<[^>]*>', re.IGNORECASE)
def __init__(self, plain_data):
self.plain_data = plain_data
def parse(self):
return [
{
'text': re.sub(self.HTML_TAG_REGEX, '', unescape(xml_element.text)),
'start': float(xml_element.attrib['start']),
'duration': float(xml_element.attrib['dur']),
}
for xml_element in ElementTree.fromstring(self.plain_data)
if xml_element.text is not None
]

View File

@ -14,22 +14,45 @@ class YouTubeTranscriptCli():
def run(self): def run(self):
parsed_args = self._parse_args() parsed_args = self._parse_args()
if parsed_args.exclude_manually_created and parsed_args.exclude_generated:
return ''
proxies = None proxies = None
if parsed_args.http_proxy != '' or parsed_args.https_proxy != '': if parsed_args.http_proxy != '' or parsed_args.https_proxy != '':
proxies = {"http": parsed_args.http_proxy, "https": parsed_args.https_proxy} proxies = {"http": parsed_args.http_proxy, "https": parsed_args.https_proxy}
transcripts, unretrievable_videos = YouTubeTranscriptApi.get_transcripts( transcripts = []
parsed_args.video_ids, exceptions = []
languages=parsed_args.languages,
continue_after_error=True, for video_id in parsed_args.video_ids:
proxies=proxies try:
) transcripts.append(self._fetch_transcript(parsed_args, proxies, video_id))
except Exception as exception:
exceptions.append(exception)
return '\n\n'.join( return '\n\n'.join(
[str(YouTubeTranscriptApi.CouldNotRetrieveTranscript(video_id)) for video_id in unretrievable_videos] [str(exception) for exception in exceptions]
+ ([json.dumps(transcripts) if parsed_args.json else pprint.pformat(transcripts)] if transcripts else []) + ([json.dumps(transcripts) if parsed_args.json else pprint.pformat(transcripts)] if transcripts else [])
) )
def _fetch_transcript(self, parsed_args, proxies, video_id):
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id, proxies=proxies)
if parsed_args.list_transcripts:
return str(transcript_list)
if parsed_args.exclude_manually_created:
transcript = transcript_list.find_generated_transcript(parsed_args.languages)
elif parsed_args.exclude_generated:
transcript = transcript_list.find_manually_created_transcript(parsed_args.languages)
else:
transcript = transcript_list.find_transcript(parsed_args.languages)
if parsed_args.translate:
transcript = transcript.translate(parsed_args.translate)
return transcript.fetch()
def _parse_args(self): def _parse_args(self):
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description=( description=(
@ -38,19 +61,40 @@ class YouTubeTranscriptCli():
'other selenium based solutions do!' 'other selenium based solutions do!'
) )
) )
parser.add_argument(
'--list-transcripts',
action='store_const',
const=True,
default=False,
help='This will list the languages in which the given videos are available in.',
)
parser.add_argument('video_ids', nargs='+', type=str, help='List of YouTube video IDs.') parser.add_argument('video_ids', nargs='+', type=str, help='List of YouTube video IDs.')
parser.add_argument( parser.add_argument(
'--languages', '--languages',
nargs='*', nargs='*',
default=[], default=['en',],
type=str, type=str,
help=( help=(
'A list of language codes in a descending priority. For example, if this is set to "de en" it will ' 'A list of language codes in a descending priority. For example, if this is set to "de en" it will '
'first try to fetch the german transcript (de) and then fetch the english transcipt (en) if it fails ' 'first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails '
'to do so. As I can\'t provide a complete list of all working language codes with full certainty, you ' 'to do so. As I can\'t provide a complete list of all working language codes with full certainty, you '
'may have to play around with the language codes a bit, to find the one which is working for you!' 'may have to play around with the language codes a bit, to find the one which is working for you!'
), ),
) )
parser.add_argument(
'--exclude-generated',
action='store_const',
const=True,
default=False,
help='If this flag is set transcripts which have been generated by YouTube will not be retrieved.',
)
parser.add_argument(
'--exclude-manually-created',
action='store_const',
const=True,
default=False,
help='If this flag is set transcripts which have been manually created will not be retrieved.',
)
parser.add_argument( parser.add_argument(
'--json', '--json',
action='store_const', action='store_const',
@ -59,13 +103,24 @@ class YouTubeTranscriptCli():
help='If this flag is set the output will be JSON formatted.', help='If this flag is set the output will be JSON formatted.',
) )
parser.add_argument( parser.add_argument(
'--http-proxy', dest='http_proxy', '--translate',
default='', metavar='URL', default='',
help=(
'The language code for the language you want this transcript to be translated to. Use the '
'--list-transcripts feature to find out which languages are translatable and which translation '
'languages are available.'
)
)
parser.add_argument(
'--http-proxy',
default='',
metavar='URL',
help='Use the specified HTTP proxy.' help='Use the specified HTTP proxy.'
) )
parser.add_argument( parser.add_argument(
'--https-proxy', dest='https_proxy', '--https-proxy',
default='', metavar='URL', default='',
metavar='URL',
help='Use the specified HTTPS proxy.' help='Use the specified HTTPS proxy.'
) )

View File

@ -0,0 +1,74 @@
from ._settings import WATCH_URL
class CouldNotRetrieveTranscript(Exception):
"""
Raised if a transcript could not be retrieved.
"""
ERROR_MESSAGE = '\nCould not retrieve a transcript for the video {video_url}!'
CAUSE_MESSAGE_INTRO = ' This is most likely caused by:\n\n{cause}'
CAUSE_MESSAGE = ''
GITHUB_REFERRAL = (
'\n\nIf you are sure that the described cause is not responsible for this error '
'and that a transcript should be retrievable, please create an issue at '
'https://github.com/jdepoix/youtube-transcript-api/issues. '
'Please add which version of youtube_transcript_api you are using '
'and provide the information needed to replicate the error. '
'Also make sure that there are no open issues which already describe your problem!'
)
def __init__(self, video_id):
self.video_id = video_id
super(CouldNotRetrieveTranscript, self).__init__(self._build_error_message())
def _build_error_message(self):
cause = self.cause
error_message = self.ERROR_MESSAGE.format(video_url=WATCH_URL.format(video_id=self.video_id))
if cause:
error_message += self.CAUSE_MESSAGE_INTRO.format(cause=cause) + self.GITHUB_REFERRAL
return error_message
@property
def cause(self):
return self.CAUSE_MESSAGE
class VideoUnavailable(CouldNotRetrieveTranscript):
CAUSE_MESSAGE = 'The video is no longer available'
class TranscriptsDisabled(CouldNotRetrieveTranscript):
CAUSE_MESSAGE = 'Subtitles are disabled for this video'
class NoTranscriptAvailable(CouldNotRetrieveTranscript):
CAUSE_MESSAGE = 'No transcripts are available for this video'
class NotTranslatable(CouldNotRetrieveTranscript):
CAUSE_MESSAGE = 'The requested language is not translatable'
class TranslationLanguageNotAvailable(CouldNotRetrieveTranscript):
CAUSE_MESSAGE = 'The requested translation language is not available'
class NoTranscriptFound(CouldNotRetrieveTranscript):
CAUSE_MESSAGE = (
'No transcripts were found for any of the requested language codes: {requested_language_codes}\n\n'
'{transcript_data}'
)
def __init__(self, video_id, requested_language_codes, transcript_data):
self._requested_language_codes = requested_language_codes
self._transcript_data = transcript_data
super(NoTranscriptFound, self).__init__(video_id)
@property
def cause(self):
return self.CAUSE_MESSAGE.format(
requested_language_codes=self._requested_language_codes,
transcript_data=str(self._transcript_data),
)

View File

@ -0,0 +1 @@
WATCH_URL = 'https://www.youtube.com/watch?v={video_id}'

View File

@ -0,0 +1,302 @@
import sys
# This can only be tested by using different python versions, therefore it is not covered by coverage.py
if sys.version_info.major == 2: # pragma: no cover
reload(sys)
sys.setdefaultencoding('utf-8')
import json
from xml.etree import ElementTree
import re
from ._html_unescaping import unescape
from ._errors import (
VideoUnavailable,
NoTranscriptFound,
TranscriptsDisabled,
NotTranslatable,
TranslationLanguageNotAvailable,
NoTranscriptAvailable,
)
from ._settings import WATCH_URL
class TranscriptListFetcher():
def __init__(self, http_client):
self._http_client = http_client
def fetch(self, video_id):
return TranscriptList.build(
self._http_client,
video_id,
self._extract_captions_json(self._fetch_html(video_id), video_id)
)
def _extract_captions_json(self, html, video_id):
splitted_html = html.split('"captions":')
if len(splitted_html) <= 1:
if '"playabilityStatus":' not in html:
raise VideoUnavailable(video_id)
raise TranscriptsDisabled(video_id)
captions_json = json.loads(
splitted_html[1].split(',"videoDetails')[0].replace('\n', '')
)['playerCaptionsTracklistRenderer']
if 'captionTracks' not in captions_json:
raise NoTranscriptAvailable(video_id)
return captions_json
def _fetch_html(self, video_id):
return self._http_client.get(WATCH_URL.format(video_id=video_id)).text.replace(
'\\u0026', '&'
).replace(
'\\', ''
)
class TranscriptList():
"""
This object represents a list of transcripts. It can be iterated over to list all transcripts which are available
for a given YouTube video. Also it provides functionality to search for a transcript in a given language.
"""
def __init__(self, video_id, manually_created_transcripts, generated_transcripts, translation_languages):
"""
The constructor is only for internal use. Use the static build method instead.
:param video_id: the id of the video this TranscriptList is for
:type video_id: str
:param manually_created_transcripts: dict mapping language codes to the manually created transcripts
:type manually_created_transcripts: dict[str, Transcript]
:param generated_transcripts: dict mapping language codes to the generated transcripts
:type generated_transcripts: dict[str, Transcript]
:param translation_languages: list of languages which can be used for translatable languages
:type translation_languages: list[dict[str, str]]
"""
self.video_id = video_id
self._manually_created_transcripts = manually_created_transcripts
self._generated_transcripts = generated_transcripts
self._translation_languages = translation_languages
@staticmethod
def build(http_client, video_id, captions_json):
"""
Factory method for TranscriptList.
:param http_client: http client which is used to make the transcript retrieving http calls
:type http_client: requests.Session
:param video_id: the id of the video this TranscriptList is for
:type video_id: str
:param captions_json: the JSON parsed from the YouTube pages static HTML
:type captions_json: dict
:return: the created TranscriptList
:rtype TranscriptList:
"""
translation_languages = [
{
'language': translation_language['languageName']['simpleText'],
'language_code': translation_language['languageCode'],
} for translation_language in captions_json['translationLanguages']
]
manually_created_transcripts = {}
generated_transcripts = {}
for caption in captions_json['captionTracks']:
if caption.get('kind', '') == 'asr':
transcript_dict = generated_transcripts
else:
transcript_dict = manually_created_transcripts
transcript_dict[caption['languageCode']] = Transcript(
http_client,
video_id,
caption['baseUrl'],
caption['name']['simpleText'],
caption['languageCode'],
caption.get('kind', '') == 'asr',
translation_languages if caption.get('isTranslatable', False) else []
)
return TranscriptList(
video_id,
manually_created_transcripts,
generated_transcripts,
translation_languages,
)
def __iter__(self):
return iter(list(self._manually_created_transcripts.values()) + list(self._generated_transcripts.values()))
def find_transcript(self, language_codes):
"""
Finds a transcript for a given language code. Manually created transcripts are returned first and only if none
are found, generated transcripts are used. If you only want generated transcripts use
find_manually_created_transcript instead.
:param language_codes: A list of language codes in a descending priority. For example, if this is set to
['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if
it fails to do so.
:type languages: list[str]
:return: the found Transcript
:rtype Transcript:
:raises: NoTranscriptFound
"""
return self._find_transcript(language_codes, [self._manually_created_transcripts, self._generated_transcripts])
def find_generated_transcript(self, language_codes):
"""
Finds a automatically generated transcript for a given language code.
:param language_codes: A list of language codes in a descending priority. For example, if this is set to
['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if
it fails to do so.
:type languages: list[str]
:return: the found Transcript
:rtype Transcript:
:raises: NoTranscriptFound
"""
return self._find_transcript(language_codes, [self._generated_transcripts,])
def find_manually_created_transcript(self, language_codes):
"""
Finds a manually created transcript for a given language code.
:param language_codes: A list of language codes in a descending priority. For example, if this is set to
['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if
it fails to do so.
:type languages: list[str]
:return: the found Transcript
:rtype Transcript:
:raises: NoTranscriptFound
"""
return self._find_transcript(language_codes, [self._manually_created_transcripts,])
def _find_transcript(self, language_codes, transcript_dicts):
for language_code in language_codes:
for transcript_dict in transcript_dicts:
if language_code in transcript_dict:
return transcript_dict[language_code]
raise NoTranscriptFound(
self.video_id,
language_codes,
self
)
def __str__(self):
return (
'For this video ({video_id}) transcripts are available in the following languages:\n\n'
'(MANUALLY CREATED)\n'
'{available_manually_created_transcript_languages}\n\n'
'(GENERATED)\n'
'{available_generated_transcripts}\n\n'
'(TRANSLATION LANGUAGES)\n'
'{available_translation_languages}'
).format(
video_id=self.video_id,
available_manually_created_transcript_languages=self._get_language_description(
str(transcript) for transcript in self._manually_created_transcripts.values()
),
available_generated_transcripts=self._get_language_description(
str(transcript) for transcript in self._generated_transcripts.values()
),
available_translation_languages=self._get_language_description(
'{language_code} ("{language}")'.format(
language=translation_language['language'],
language_code=translation_language['language_code'],
) for translation_language in self._translation_languages
)
)
def _get_language_description(self, transcript_strings):
description = '\n'.join(' - {transcript}'.format(transcript=transcript) for transcript in transcript_strings)
return description if description else 'None'
class Transcript():
def __init__(self, http_client, video_id, url, language, language_code, is_generated, translation_languages):
"""
You probably don't want to initialize this directly. Usually you'll access Transcript objects using a
TranscriptList.
:param http_client: http client which is used to make the transcript retrieving http calls
:type http_client: requests.Session
:param video_id: the id of the video this TranscriptList is for
:type video_id: str
:param url: the url which needs to be called to fetch the transcript
:param language: the name of the language this transcript uses
:param language_code:
:param is_generated:
:param translation_languages:
"""
self._http_client = http_client
self.video_id = video_id
self._url = url
self.language = language
self.language_code = language_code
self.is_generated = is_generated
self.translation_languages = translation_languages
self._translation_languages_dict = {
translation_language['language_code']: translation_language['language']
for translation_language in translation_languages
}
def fetch(self):
"""
Loads the actual transcript data.
:return: a list of dictionaries containing the 'text', 'start' and 'duration' keys
:rtype [{'text': str, 'start': float, 'end': float}]:
"""
return _TranscriptParser().parse(
self._http_client.get(self._url).text
)
def __str__(self):
return '{language_code} ("{language}"){translation_description}'.format(
language=self.language,
language_code=self.language_code,
translation_description='[TRANSLATABLE]' if self.is_translatable else ''
)
@property
def is_translatable(self):
return len(self.translation_languages) > 0
def translate(self, language_code):
if not self.is_translatable:
raise NotTranslatable(self.video_id)
if language_code not in self._translation_languages_dict:
raise TranslationLanguageNotAvailable(self.video_id)
return Transcript(
self._http_client,
self.video_id,
'{url}&tlang={language_code}'.format(url=self._url, language_code=language_code),
self._translation_languages_dict[language_code],
language_code,
True,
[],
)
class _TranscriptParser():
HTML_TAG_REGEX = re.compile(r'<[^>]*>', re.IGNORECASE)
def parse(self, plain_data):
return [
{
'text': re.sub(self.HTML_TAG_REGEX, '', unescape(xml_element.text)),
'start': float(xml_element.attrib['start']),
'duration': float(xml_element.attrib.get('dur', '0.0')),
}
for xml_element in ElementTree.fromstring(plain_data)
if xml_element.text is not None
]

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -5,7 +5,15 @@ import os
import httpretty import httpretty
from youtube_transcript_api._api import YouTubeTranscriptApi from youtube_transcript_api import (
YouTubeTranscriptApi,
TranscriptsDisabled,
NoTranscriptFound,
VideoUnavailable,
NoTranscriptAvailable,
NotTranslatable,
TranslationLanguageNotAvailable,
)
def load_asset(filename): def load_asset(filename):
@ -42,6 +50,51 @@ class TestYouTubeTranscriptApi(TestCase):
] ]
) )
def test_list_transcripts(self):
transcript_list = YouTubeTranscriptApi.list_transcripts('GJLlxj_dtq8')
language_codes = {transcript.language_code for transcript in transcript_list}
self.assertEqual(language_codes, {'zh', 'de', 'en', 'hi', 'ja', 'ko', 'es', 'cs', 'en'})
def test_list_transcripts__find_manually_created(self):
transcript_list = YouTubeTranscriptApi.list_transcripts('GJLlxj_dtq8')
transcript = transcript_list.find_manually_created_transcript(['cs'])
self.assertFalse(transcript.is_generated)
def test_list_transcripts__find_generated(self):
transcript_list = YouTubeTranscriptApi.list_transcripts('GJLlxj_dtq8')
with self.assertRaises(NoTranscriptFound):
transcript_list.find_generated_transcript(['cs'])
transcript = transcript_list.find_generated_transcript(['en'])
self.assertTrue(transcript.is_generated)
def test_translate_transcript(self):
transcript = YouTubeTranscriptApi.list_transcripts('GJLlxj_dtq8').find_transcript(['en'])
translated_transcript = transcript.translate('af')
self.assertEqual(translated_transcript.language_code, 'af')
self.assertIn('&tlang=af', translated_transcript._url)
def test_translate_transcript__translation_language_not_available(self):
transcript = YouTubeTranscriptApi.list_transcripts('GJLlxj_dtq8').find_transcript(['en'])
with self.assertRaises(TranslationLanguageNotAvailable):
transcript.translate('xyz')
def test_translate_transcript__not_translatable(self):
transcript = YouTubeTranscriptApi.list_transcripts('GJLlxj_dtq8').find_transcript(['en'])
transcript.translation_languages = []
with self.assertRaises(NotTranslatable):
transcript.translate('af')
def test_get_transcript__correct_language_is_used(self): def test_get_transcript__correct_language_is_used(self):
YouTubeTranscriptApi.get_transcript('GJLlxj_dtq8', ['de', 'en']) YouTubeTranscriptApi.get_transcript('GJLlxj_dtq8', ['de', 'en'])
query_string = httpretty.last_request().querystring query_string = httpretty.last_request().querystring
@ -53,26 +106,50 @@ class TestYouTubeTranscriptApi(TestCase):
def test_get_transcript__fallback_language_is_used(self): def test_get_transcript__fallback_language_is_used(self):
httpretty.register_uri( httpretty.register_uri(
httpretty.GET, httpretty.GET,
'https://www.youtube.com/api/timedtext', 'https://www.youtube.com/watch',
body='' body=load_asset('youtube_ww1_nl_en.html.static')
) )
YouTubeTranscriptApi.get_transcript('GJLlxj_dtq8', ['de', 'en']) YouTubeTranscriptApi.get_transcript('F1xioXWb8CY', ['de', 'en'])
query_string = httpretty.last_request().querystring query_string = httpretty.last_request().querystring
self.assertIn('lang', query_string) self.assertIn('lang', query_string)
self.assertEqual(len(query_string['lang']), 1) self.assertEqual(len(query_string['lang']), 1)
self.assertEqual(query_string['lang'][0], 'en') self.assertEqual(query_string['lang'][0], 'en')
def test_get_transcript__exception_is_raised_when_not_available(self): def test_get_transcript__exception_if_video_unavailable(self):
httpretty.register_uri( httpretty.register_uri(
httpretty.GET, httpretty.GET,
'https://www.youtube.com/api/timedtext', 'https://www.youtube.com/watch',
body='' body=load_asset('youtube_video_unavailable.html.static')
) )
with self.assertRaises(YouTubeTranscriptApi.CouldNotRetrieveTranscript): with self.assertRaises(VideoUnavailable):
YouTubeTranscriptApi.get_transcript('GJLlxj_dtq8') YouTubeTranscriptApi.get_transcript('abc')
def test_get_transcript__exception_if_transcripts_disabled(self):
httpretty.register_uri(
httpretty.GET,
'https://www.youtube.com/watch',
body=load_asset('youtube_transcripts_disabled.html.static')
)
with self.assertRaises(TranscriptsDisabled):
YouTubeTranscriptApi.get_transcript('dsMFmonKDD4')
def test_get_transcript__exception_if_language_unavailable(self):
with self.assertRaises(NoTranscriptFound):
YouTubeTranscriptApi.get_transcript('GJLlxj_dtq8', languages=['cz'])
def test_get_transcript__exception_if_no_transcript_available(self):
httpretty.register_uri(
httpretty.GET,
'https://www.youtube.com/watch',
body=load_asset('youtube_no_transcript_available.html.static')
)
with self.assertRaises(NoTranscriptAvailable):
YouTubeTranscriptApi.get_transcript('MwBPvcYFY2E')
def test_get_transcripts(self): def test_get_transcripts(self):
video_id_1 = 'video_id_1' video_id_1 = 'video_id_1'
@ -99,8 +176,8 @@ class TestYouTubeTranscriptApi(TestCase):
YouTubeTranscriptApi.get_transcripts(['video_id_1', 'video_id_2'], continue_after_error=True) YouTubeTranscriptApi.get_transcripts(['video_id_1', 'video_id_2'], continue_after_error=True)
YouTubeTranscriptApi.get_transcript.assert_any_call(video_id_1, None, None) YouTubeTranscriptApi.get_transcript.assert_any_call(video_id_1, ('en',), None)
YouTubeTranscriptApi.get_transcript.assert_any_call(video_id_2, None, None) YouTubeTranscriptApi.get_transcript.assert_any_call(video_id_2, ('en',), None)
def test_get_transcript__with_proxies(self): def test_get_transcript__with_proxies(self):
proxies = {'http': '', 'https:': ''} proxies = {'http': '', 'https:': ''}
@ -118,4 +195,4 @@ class TestYouTubeTranscriptApi(TestCase):
) )
YouTubeTranscriptApi.get_transcript = MagicMock() YouTubeTranscriptApi.get_transcript = MagicMock()
YouTubeTranscriptApi.get_transcripts(['GJLlxj_dtq8'], proxies=proxies) YouTubeTranscriptApi.get_transcripts(['GJLlxj_dtq8'], proxies=proxies)
YouTubeTranscriptApi.get_transcript.assert_any_call('GJLlxj_dtq8', None, proxies) YouTubeTranscriptApi.get_transcript.assert_any_call('GJLlxj_dtq8', ('en',), proxies)

View File

@ -3,10 +3,27 @@ from mock import MagicMock
import json import json
from youtube_transcript_api._cli import YouTubeTranscriptCli, YouTubeTranscriptApi from youtube_transcript_api import YouTubeTranscriptApi, VideoUnavailable
from youtube_transcript_api._cli import YouTubeTranscriptCli
class TestYouTubeTranscriptCli(TestCase): class TestYouTubeTranscriptCli(TestCase):
def setUp(self):
self.transcript_mock = MagicMock()
self.transcript_mock.fetch = MagicMock(return_value=[
{'text': 'Hey, this is just a test', 'start': 0.0, 'duration': 1.54},
{'text': 'this is not the original transcript', 'start': 1.54, 'duration': 4.16},
{'text': 'just something shorter, I made up for testing', 'start': 5.7, 'duration': 3.239}
])
self.transcript_mock.translate = MagicMock(return_value=self.transcript_mock)
self.transcript_list_mock = MagicMock()
self.transcript_list_mock.find_generated_transcript = MagicMock(return_value=self.transcript_mock)
self.transcript_list_mock.find_manually_created_transcript = MagicMock(return_value=self.transcript_mock)
self.transcript_list_mock.find_transcript = MagicMock(return_value=self.transcript_mock)
YouTubeTranscriptApi.list_transcripts = MagicMock(return_value=self.transcript_list_mock)
def test_argument_parsing(self): def test_argument_parsing(self):
parsed_args = YouTubeTranscriptCli('v1 v2 --json --languages de en'.split())._parse_args() parsed_args = YouTubeTranscriptCli('v1 v2 --json --languages de en'.split())._parse_args()
self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) self.assertEqual(parsed_args.video_ids, ['v1', 'v2'])
@ -60,7 +77,7 @@ class TestYouTubeTranscriptCli(TestCase):
parsed_args = YouTubeTranscriptCli('v1 v2'.split())._parse_args() parsed_args = YouTubeTranscriptCli('v1 v2'.split())._parse_args()
self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) self.assertEqual(parsed_args.video_ids, ['v1', 'v2'])
self.assertEqual(parsed_args.json, False) self.assertEqual(parsed_args.json, False)
self.assertEqual(parsed_args.languages, []) self.assertEqual(parsed_args.languages, ['en'])
def test_argument_parsing__fail_without_video_ids(self): def test_argument_parsing__fail_without_video_ids(self):
with self.assertRaises(SystemExit): with self.assertRaises(SystemExit):
@ -70,12 +87,12 @@ class TestYouTubeTranscriptCli(TestCase):
parsed_args = YouTubeTranscriptCli('v1 v2 --json'.split())._parse_args() parsed_args = YouTubeTranscriptCli('v1 v2 --json'.split())._parse_args()
self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) self.assertEqual(parsed_args.video_ids, ['v1', 'v2'])
self.assertEqual(parsed_args.json, True) self.assertEqual(parsed_args.json, True)
self.assertEqual(parsed_args.languages, []) self.assertEqual(parsed_args.languages, ['en'])
parsed_args = YouTubeTranscriptCli('--json v1 v2'.split())._parse_args() parsed_args = YouTubeTranscriptCli('--json v1 v2'.split())._parse_args()
self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) self.assertEqual(parsed_args.video_ids, ['v1', 'v2'])
self.assertEqual(parsed_args.json, True) self.assertEqual(parsed_args.json, True)
self.assertEqual(parsed_args.languages, []) self.assertEqual(parsed_args.languages, ['en'])
def test_argument_parsing__languages(self): def test_argument_parsing__languages(self):
parsed_args = YouTubeTranscriptCli('v1 v2 --languages de en'.split())._parse_args() parsed_args = YouTubeTranscriptCli('v1 v2 --languages de en'.split())._parse_args()
@ -106,32 +123,107 @@ class TestYouTubeTranscriptCli(TestCase):
self.assertEqual(parsed_args.http_proxy, '') self.assertEqual(parsed_args.http_proxy, '')
self.assertEqual(parsed_args.https_proxy, '') self.assertEqual(parsed_args.https_proxy, '')
def test_argument_parsing__list_transcripts(self):
parsed_args = YouTubeTranscriptCli('--list-transcripts v1 v2'.split())._parse_args()
self.assertEqual(parsed_args.video_ids, ['v1', 'v2'])
self.assertTrue(parsed_args.list_transcripts)
parsed_args = YouTubeTranscriptCli('v1 v2 --list-transcripts'.split())._parse_args()
self.assertEqual(parsed_args.video_ids, ['v1', 'v2'])
self.assertTrue(parsed_args.list_transcripts)
def test_argument_parsing__translate(self):
parsed_args = YouTubeTranscriptCli('v1 v2 --languages de en --translate cz'.split())._parse_args()
self.assertEqual(parsed_args.video_ids, ['v1', 'v2'])
self.assertEqual(parsed_args.json, False)
self.assertEqual(parsed_args.languages, ['de', 'en'])
self.assertEqual(parsed_args.translate, 'cz')
parsed_args = YouTubeTranscriptCli('v1 v2 --translate cz --languages de en'.split())._parse_args()
self.assertEqual(parsed_args.video_ids, ['v1', 'v2'])
self.assertEqual(parsed_args.json, False)
self.assertEqual(parsed_args.languages, ['de', 'en'])
self.assertEqual(parsed_args.translate, 'cz')
def test_argument_parsing__manually_or_generated(self):
parsed_args = YouTubeTranscriptCli('v1 v2 --exclude-manually-created'.split())._parse_args()
self.assertEqual(parsed_args.video_ids, ['v1', 'v2'])
self.assertTrue(parsed_args.exclude_manually_created)
self.assertFalse(parsed_args.exclude_generated)
parsed_args = YouTubeTranscriptCli('v1 v2 --exclude-generated'.split())._parse_args()
self.assertEqual(parsed_args.video_ids, ['v1', 'v2'])
self.assertFalse(parsed_args.exclude_manually_created)
self.assertTrue(parsed_args.exclude_generated)
parsed_args = YouTubeTranscriptCli('v1 v2 --exclude-manually-created --exclude-generated'.split())._parse_args()
self.assertEqual(parsed_args.video_ids, ['v1', 'v2'])
self.assertTrue(parsed_args.exclude_manually_created)
self.assertTrue(parsed_args.exclude_generated)
def test_run(self): def test_run(self):
YouTubeTranscriptApi.get_transcripts = MagicMock(return_value=([], []))
YouTubeTranscriptCli('v1 v2 --languages de en'.split()).run() YouTubeTranscriptCli('v1 v2 --languages de en'.split()).run()
YouTubeTranscriptApi.get_transcripts.assert_called_once_with( YouTubeTranscriptApi.list_transcripts.assert_any_call('v1', proxies=None)
['v1', 'v2'], YouTubeTranscriptApi.list_transcripts.assert_any_call('v2', proxies=None)
languages=['de', 'en'],
continue_after_error=True, self.transcript_list_mock.find_transcript.assert_any_call(['de', 'en'])
proxies=None
def test_run__failing_transcripts(self):
YouTubeTranscriptApi.list_transcripts = MagicMock(side_effect=VideoUnavailable('video_id'))
output = YouTubeTranscriptCli('v1 --languages de en'.split()).run()
self.assertEqual(output, str(VideoUnavailable('video_id')))
def test_run__exclude_generated(self):
YouTubeTranscriptCli('v1 v2 --languages de en --exclude-generated'.split()).run()
self.transcript_list_mock.find_manually_created_transcript.assert_any_call(['de', 'en'])
def test_run__exclude_manually_created(self):
YouTubeTranscriptCli('v1 v2 --languages de en --exclude-manually-created'.split()).run()
self.transcript_list_mock.find_generated_transcript.assert_any_call(['de', 'en'])
def test_run__exclude_manually_created_and_generated(self):
self.assertEqual(
YouTubeTranscriptCli('v1 v2 --languages de en --exclude-manually-created --exclude-generated'.split()).run(),
''
) )
def test_run__translate(self):
YouTubeTranscriptCli('v1 v2 --languages de en --translate cz'.split()).run(),
self.transcript_mock.translate.assert_any_call('cz')
def test_run__list_transcripts(self):
YouTubeTranscriptCli('--list-transcripts v1 v2'.split()).run()
YouTubeTranscriptApi.list_transcripts.assert_any_call('v1', proxies=None)
YouTubeTranscriptApi.list_transcripts.assert_any_call('v2', proxies=None)
def test_run__json_output(self): def test_run__json_output(self):
YouTubeTranscriptApi.get_transcripts = MagicMock(return_value=([{'boolean': True}], []))
output = YouTubeTranscriptCli('v1 v2 --languages de en --json'.split()).run() output = YouTubeTranscriptCli('v1 v2 --languages de en --json'.split()).run()
# will fail if output is not valid json # will fail if output is not valid json
json.loads(output) json.loads(output)
def test_run__proxies(self): def test_run__proxies(self):
YouTubeTranscriptApi.get_transcripts = MagicMock(return_value=([], []))
YouTubeTranscriptCli( YouTubeTranscriptCli(
'v1 v2 --languages de en --http-proxy http://user:pass@domain:port --https-proxy https://user:pass@domain:port'.split()).run() (
'v1 v2 --languages de en '
'--http-proxy http://user:pass@domain:port '
'--https-proxy https://user:pass@domain:port'
).split()
).run()
YouTubeTranscriptApi.get_transcripts.assert_called_once_with( YouTubeTranscriptApi.list_transcripts.assert_any_call(
['v1', 'v2'], 'v1',
languages=['de', 'en'], proxies={'http': 'http://user:pass@domain:port', 'https': 'https://user:pass@domain:port'}
continue_after_error=True, )
YouTubeTranscriptApi.list_transcripts.assert_any_call(
'v2',
proxies={'http': 'http://user:pass@domain:port', 'https': 'https://user:pass@domain:port'} proxies={'http': 'http://user:pass@domain:port', 'https': 'https://user:pass@domain:port'}
) )