Merge pull request #43 from jdepoix/bugfix/cli-language-default

fixed bug in cli where no transcript could be retrieved if no language was specified
This commit is contained in:
jdepoix 2019-12-31 00:12:39 +01:00 committed by GitHub
commit bfecd64b85
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 97 additions and 102 deletions

191
README.md
View File

@ -1,74 +1,69 @@
# YouTube Transcript/Subtitle API (including automatically generated subtitles and subtitle translations) # YouTube Transcript/Subtitle API (including automatically generated subtitles and subtitle translations)
[![Donate](https://img.shields.io/badge/Donate-PayPal-green.svg)](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BAENLEW8VUJ6G&source=url) [![Donate](https://img.shields.io/badge/Donate-PayPal-green.svg)](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BAENLEW8VUJ6G&source=url) [![Build Status](https://travis-ci.org/jdepoix/youtube-transcript-api.svg)](https://travis-ci.org/jdepoix/youtube-transcript-api) [![Coverage Status](https://coveralls.io/repos/github/jdepoix/youtube-transcript-api/badge.svg?branch=master)](https://coveralls.io/github/jdepoix/youtube-transcript-api?branch=master) [![MIT license](http://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](http://opensource.org/licenses/MIT) [![image](https://img.shields.io/pypi/v/youtube-transcript-api.svg)](https://pypi.org/project/youtube-transcript-api/) [![image](https://img.shields.io/pypi/pyversions/youtube-transcript-api.svg)](https://pypi.org/project/youtube-transcript-api/)
[![Build Status](https://travis-ci.org/jdepoix/youtube-transcript-api.svg)](https://travis-ci.org/jdepoix/youtube-transcript-api)
[![Coverage Status](https://coveralls.io/repos/github/jdepoix/youtube-transcript-api/badge.svg?branch=master)](https://coveralls.io/github/jdepoix/youtube-transcript-api?branch=master) This is an python API which allows you to get the transcripts/subtitles for a given YouTube video. It also works for automatically generated subtitles, supports translating subtitles and it does not require a headless browser, like other selenium based solutions do!
[![MIT license](http://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](http://opensource.org/licenses/MIT)
[![image](https://img.shields.io/pypi/v/youtube-transcript-api.svg)](https://pypi.org/project/youtube-transcript-api/) ## Install
[![image](https://img.shields.io/pypi/pyversions/youtube-transcript-api.svg)](https://pypi.org/project/youtube-transcript-api/)
It is recommended to [install this module by using pip](https://pypi.org/project/youtube-transcript-api/):
This is an python API which allows you to get the transcripts/subtitles for a given YouTube video. It also works for automatically generated subtitles, supports translating subtitles and it does not require a headless browser, like other selenium based solutions do!
```
## Install pip install youtube_transcript_api
```
It is recommended to [install this module by using pip](https://pypi.org/project/youtube-transcript-api/):
If you want to use it from source, you'll have to install the dependencies manually:
```
pip install youtube_transcript_api ```
``` pip install -r requirements.txt
```
If you want to use it from source, you'll have to install the dependencies manually:
```
pip install -r requirements.txt
```
You can either integrate this module [into an existing application](#api), or just use it via an [CLI](#cli). You can either integrate this module [into an existing application](#api), or just use it via an [CLI](#cli).
## API ## API
The easiest way to get a transcript for a given video is to execute: The easiest way to get a transcript for a given video is to execute:
```python ```python
from youtube_transcript_api import YouTubeTranscriptApi from youtube_transcript_api import YouTubeTranscriptApi
YouTubeTranscriptApi.get_transcript(video_id) YouTubeTranscriptApi.get_transcript(video_id)
``` ```
This will return a list of dictionaries looking somewhat like this: This will return a list of dictionaries looking somewhat like this:
```python ```python
[ [
{ {
'text': 'Hey there', 'text': 'Hey there',
'start': 7.58, 'start': 7.58,
'duration': 6.13 'duration': 6.13
}, },
{ {
'text': 'how are you', 'text': 'how are you',
'start': 14.08, 'start': 14.08,
'duration': 7.58 'duration': 7.58
}, },
# ... # ...
] ]
``` ```
You can also add the `languages` param if you want to make sure the transcripts are retrieved in your desired language (it defaults to english). You can also add the `languages` param if you want to make sure the transcripts are retrieved in your desired language (it defaults to english).
```python ```python
YouTubeTranscriptApi.get_transcripts(video_ids, languages=['de', 'en']) YouTubeTranscriptApi.get_transcripts(video_ids, languages=['de', 'en'])
``` ```
It's a list of language codes in a descending priority. In this example it will first try to fetch the german transcript (`'de'`) and then fetch the english transcript (`'en'`) if it fails to do so. If you want to find out which languages are available first, [have a look at `list_transcripts()`](#list-available-transcripts) It's a list of language codes in a descending priority. In this example it will first try to fetch the german transcript (`'de'`) and then fetch the english transcript (`'en'`) if it fails to do so. If you want to find out which languages are available first, [have a look at `list_transcripts()`](#list-available-transcripts)
To get transcripts for a list of video ids you can call: To get transcripts for a list of video ids you can call:
```python ```python
YouTubeTranscriptApi.get_transcripts(video_ids, languages=['de', 'en']) YouTubeTranscriptApi.get_transcripts(video_ids, languages=['de', 'en'])
``` ```
`languages` also is optional here. `languages` also is optional here.
### List available transcripts ### List available transcripts
@ -81,16 +76,16 @@ transcript_list = YouTubeTranscriptApi.list_transcripts(video_id, languages=['de
This will return a `TranscriptList` object which is iterable and provides methods to filter the list of transcripts for specific languages and types, like: This will return a `TranscriptList` object which is iterable and provides methods to filter the list of transcripts for specific languages and types, like:
```python ```python
transcript = transcript_list.find_transcript(['de', 'en']) transcript = transcript_list.find_transcript(['de', 'en'])
``` ```
By default this module always picks manually created transcripts over automatically created ones, if a transcript in the requested language is available both manually created and generated. The `TranscriptList` allows you to bypass this default behaviour by searching for specific transcript types: By default this module always picks manually created transcripts over automatically created ones, if a transcript in the requested language is available both manually created and generated. The `TranscriptList` allows you to bypass this default behaviour by searching for specific transcript types:
```python ```python
# filter for manually created transcripts # filter for manually created transcripts
transcript = transcript_list.find_manually_created_transcript(['de', 'en']) transcript = transcript_list.find_manually_created_transcript(['de', 'en'])
# or automatically generated ones # or automatically generated ones
transcript = transcript_list.find_generated_transcript(['de', 'en']) transcript = transcript_list.find_generated_transcript(['de', 'en'])
``` ```
@ -98,15 +93,15 @@ The methods `find_generated_transcript`, `find_manually_created_transcript`, `fi
```python ```python
print( print(
transcript.video_id, transcript.video_id,
transcript.language, transcript.language,
transcript.language_code, transcript.language_code,
# whether it has been manually created or generated by YouTube # whether it has been manually created or generated by YouTube
transcript.is_generated, transcript.is_generated,
# whether this transcript can be translated or not # whether this transcript can be translated or not
transcript.is_translatable, transcript.is_translatable,
# a list of languages the transcript can be translated to # a list of languages the transcript can be translated to
transcript.translation_languages, transcript.translation_languages,
) )
``` ```
@ -116,42 +111,42 @@ and provide the method, which allows you to fetch the actual transcript data:
transcript.fetch() transcript.fetch()
``` ```
### Translate transcript ### Translate transcript
YouTube has a feature which allows you to automatically translate subtitles. This module also makes it possible to access this feature. To do so `Transcript` objects provide a `translate()` method, which returns a new translated `Transcript` object: YouTube has a feature which allows you to automatically translate subtitles. This module also makes it possible to access this feature. To do so `Transcript` objects provide a `translate()` method, which returns a new translated `Transcript` object:
```python ```python
transcript = transcript_list.find_transcript(['en']) transcript = transcript_list.find_transcript(['en'])
translated_transcript = transcript.translate('de') translated_transcript = transcript.translate('de')
print(translated_transcript.fetch()) print(translated_transcript.fetch())
``` ```
### By example ### By example
```python ```python
# retrieve the available transcripts # retrieve the available transcripts
transcript_list = YouTubeTranscriptApi.get('video_id') transcript_list = YouTubeTranscriptApi.get('video_id')
# iterate over all available transcripts # iterate over all available transcripts
for transcript in transcript_list: for transcript in transcript_list:
# the Transcript object provides metadata properties # the Transcript object provides metadata properties
print( print(
transcript.video_id, transcript.video_id,
transcript.language, transcript.language,
transcript.language_code, transcript.language_code,
# whether it has been manually created or generated by YouTube # whether it has been manually created or generated by YouTube
transcript.is_generated, transcript.is_generated,
# whether this transcript can be translated or not # whether this transcript can be translated or not
transcript.is_translatable, transcript.is_translatable,
# a list of languages the transcript can be translated to # a list of languages the transcript can be translated to
transcript.translation_languages, transcript.translation_languages,
) )
# fetch the actual transcript data # fetch the actual transcript data
print(transcript.fetch()) print(transcript.fetch())
# translating the transcript will return another transcript object # translating the transcript will return another transcript object
print(transcript.translate('en').fetch()) print(transcript.translate('en').fetch())
# you can also directly filter for the language you are looking for, using the transcript list # you can also directly filter for the language you are looking for, using the transcript list
transcript = transcript_list.find_transcript(['de', 'en']) transcript = transcript_list.find_transcript(['de', 'en'])

View File

@ -72,7 +72,7 @@ class YouTubeTranscriptCli():
parser.add_argument( parser.add_argument(
'--languages', '--languages',
nargs='*', nargs='*',
default=[], default=['en',],
type=str, type=str,
help=( help=(
'A list of language codes in a descending priority. For example, if this is set to "de en" it will ' 'A list of language codes in a descending priority. For example, if this is set to "de en" it will '

View File

@ -77,7 +77,7 @@ class TestYouTubeTranscriptCli(TestCase):
parsed_args = YouTubeTranscriptCli('v1 v2'.split())._parse_args() parsed_args = YouTubeTranscriptCli('v1 v2'.split())._parse_args()
self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) self.assertEqual(parsed_args.video_ids, ['v1', 'v2'])
self.assertEqual(parsed_args.json, False) self.assertEqual(parsed_args.json, False)
self.assertEqual(parsed_args.languages, []) self.assertEqual(parsed_args.languages, ['en'])
def test_argument_parsing__fail_without_video_ids(self): def test_argument_parsing__fail_without_video_ids(self):
with self.assertRaises(SystemExit): with self.assertRaises(SystemExit):
@ -87,12 +87,12 @@ class TestYouTubeTranscriptCli(TestCase):
parsed_args = YouTubeTranscriptCli('v1 v2 --json'.split())._parse_args() parsed_args = YouTubeTranscriptCli('v1 v2 --json'.split())._parse_args()
self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) self.assertEqual(parsed_args.video_ids, ['v1', 'v2'])
self.assertEqual(parsed_args.json, True) self.assertEqual(parsed_args.json, True)
self.assertEqual(parsed_args.languages, []) self.assertEqual(parsed_args.languages, ['en'])
parsed_args = YouTubeTranscriptCli('--json v1 v2'.split())._parse_args() parsed_args = YouTubeTranscriptCli('--json v1 v2'.split())._parse_args()
self.assertEqual(parsed_args.video_ids, ['v1', 'v2']) self.assertEqual(parsed_args.video_ids, ['v1', 'v2'])
self.assertEqual(parsed_args.json, True) self.assertEqual(parsed_args.json, True)
self.assertEqual(parsed_args.languages, []) self.assertEqual(parsed_args.languages, ['en'])
def test_argument_parsing__languages(self): def test_argument_parsing__languages(self):
parsed_args = YouTubeTranscriptCli('v1 v2 --languages de en'.split())._parse_args() parsed_args = YouTubeTranscriptCli('v1 v2 --languages de en'.split())._parse_args()