Compare commits

...

38 Commits

Author SHA1 Message Date
df 5183884250 2021.06.06.1 2021-06-16 15:11:43 +01:00
df f04341939d (Re-)Add YoutubeSearchURLIE
https://github.com/ytdl-org/youtube-dl/pull/27749 (@pukkandan)
Code taken from: https://github.com/pukkandan/yt-dlc
Enable tests
2021-06-09 17:37:10 +01:00
df cebc7be09a Set a (partial) random cookie to avoid HTTP 429 errors on the "watch?v=" page
https://github.com/ytdl-org/youtube-dl/pull/29175 (@Stefan311)
2021-06-09 13:26:19 +01:00
df 3cc7e72d9c [webarchive] Added new extractor for the web archive
https://github.com/ytdl-org/youtube-dl/pull/28842 (@alex-gedeon)
Closes ytdl-org#13655
2021-06-09 12:50:22 +01:00
df 35bf1f5971 Handle user:pass in URLs
https://github.com/ytdl-org/youtube-dl/pull/28801 (@hhirtz)
Fixes "nonnumeric port" errors when youtube-dl is given URLs with
usernames and passwords such as:

    http://username:password@example.com/myvideo.mp4

    Refs:
    - https://en.wikipedia.org/wiki/Basic_access_authentication
    - https://tools.ietf.org/html/rfc1738#section-3.1
    - https://docs.python.org/3.8/library/urllib.parse.html#urllib.parse.urlsplit

    Fixes ytdl-org#18276 (point 4)
    Fixes ytdl-org#20258
    Fixes ytdl-org#26211 (see comment)
2021-06-09 12:45:00 +01:00
df 0345a064a7 [itv] fixed extraction (closes ytdl-org#28906)
https://github.com/ytdl-org/youtube-dl/pull/28955 (@sleaux-meaux)
ITV changes require media stream and subtitles to be taken
from different playlist resources. Incorporates suggestion in
ytdl-org#28906 (comment)
2021-06-09 12:32:35 +01:00
df 2d714ecb56 Merge branch 'master' of https://git.hpkg.tv/df/youtube-dl into Hummy 2021-06-09 12:24:46 +01:00
df 0ad58e61a7 Merge branch 'master' into Hummy 2021-04-26 00:13:09 +01:00
df 0b9bac2d45 Make description for __INITIAL_DATA__ more functionally 2021-04-22 18:57:56 +01:00
df 635225fa3f Move example URL from now invalid case to current case
Added in commit baf39a1aa8 but page structure has changed.
2021-04-22 18:57:55 +01:00
dirkf ab196ce69e Implement @dstftw review comments
Improve JS parsing  (Co-authored-by: Sergey M. <dstftw@gmail.com>)
Refactor populating entry from playlistObject
2021-04-22 18:57:54 +01:00
Sergey M․ 208509b528 release 2021.04.07 2021-04-22 18:57:53 +01:00
df e3a336bf4e [BBCIE] always strip channel name from either end of playlist title 2021-04-22 18:57:52 +01:00
df 56fc561c8d Extract first supportingMedia if no leadMedia in component-based Morph format 2021-04-22 18:57:51 +01:00
df 216c65d467 Add Bitesize, try to restore components-style Morph 2021-04-22 18:57:50 +01:00
df 42f1bb6506 Avoid \ continuation, fix other formatting and merge issues 2021-04-22 18:57:49 +01:00
df 8524dd70f9 Morph-based pages: add Weather
For Python 3 cast dict.views() from view to list
2021-04-22 18:57:48 +01:00
df b094d67002 Morph-based pages: fix articles and implement playlists
In Morph-based pages the target is JSON in a <script> element containing `Morph.setPayload(arg1, arg2);`

The first argument of Morph.setPayload() has to be matched to pick the correct instance of the call. In that case, the second argument contains the target JSON.

Morph-based articles were previously implemented but looked for a `body.components` member of the JSON object that no longer seems to be sent. Instead `body` and `body.media` have to be used.
2021-04-22 18:57:47 +01:00
df e23abe407e Support playlist vs video selection, separate --yes-playlist semantics 2021-04-22 18:57:46 +01:00
df 92e0b02ec2 Make --no-playlist and --yes-playlist independent: default depends on URL+page 2021-04-22 18:57:45 +01:00
dirkf 40152ecb68 Support full playlist and --no-playlist for Reel 2021-04-22 18:57:45 +01:00
df 081846a711 Make IBL access safer 2021-04-22 18:57:44 +01:00
df 1f3bdf9ad8 Allow for whitespace in 'next page' HTML 2021-04-22 18:57:43 +01:00
dirkf 200b9eebb3 Add test-case for audio description 2021-04-22 18:57:42 +01:00
df f3a33f91e2 Use embedded playlist data for iPlayer, support audio-described videos 2021-04-22 18:57:40 +01:00
df a4cbe8f909 Support embedded video, data in playlistObject of playerSettings 2021-04-22 18:57:40 +01:00
df 1189422cd0 Release 2021-02-04 2021-04-22 18:57:39 +01:00
df 62c225ef44 Release 2020-12-31 2021-04-22 18:57:36 +01:00
df beb803cd3b Release 2020.11.19 2021-04-22 18:57:33 +01:00
df 779663d086 Fix hiding old .opk versions from git 2021-04-22 18:57:30 +01:00
df 5d2bc1461c Merge 2020.11.18 2021-04-22 18:57:29 +01:00
df bf9254077b Align package directory structure with upstream.
Use PYTHONPATH to unify  bin/youtube-dl.
2021-04-22 18:57:29 +01:00
df d8e6815fef Add script to update masterGL branch from GitLab 2021-04-22 18:57:25 +01:00
df 9bcc47eef0 Update from youtube-dl_2020.11.01.1-1 package 2021-04-22 18:57:24 +01:00
df 4fc9148ab7 Add script to process iPlayer subtitles for use on HD/R-Fox 2021-04-22 18:57:24 +01:00
df f800b76250 Add build and installation scripts for Hummy package 2021-04-22 18:57:22 +01:00
df 90745d224b Hummy package youtube-dl_2020.11.01.1 2021-04-22 18:57:21 +01:00
df 48d5aba7b6 Support BBC World (etc) pages with data in SIMORGH_DATA JSON 2021-04-22 12:47:17 +01:00
30 changed files with 998 additions and 118 deletions

View File

@ -18,7 +18,7 @@ title: ''
<!--
Carefully read and work through this check list in order to prevent the most common mistakes and misuse of youtube-dl:
- First of, make sure you are using the latest version of youtube-dl. Run `youtube-dl --version` and ensure your version is 2021.06.06. If it's not, see https://yt-dl.org/update on how to update. Issues with outdated version will be REJECTED.
- First of all, make sure you are using the latest version of youtube-dl. Run `youtube-dl --version` and ensure your version is 2021.04.26. If it's not, see https://yt-dl.org/update on how to update. Issues with outdated version will be REJECTED.
- Make sure that all provided video/audio/playlist URLs (if any) are alive and playable in a browser.
- Make sure that all URLs and arguments with special characters are properly quoted or escaped as explained in http://yt-dl.org/escape.
- Search the bugtracker for similar issues: http://yt-dl.org/search-issues. DO NOT post duplicates.

View File

@ -19,7 +19,7 @@ labels: 'site-support-request'
<!--
Carefully read and work through this check list in order to prevent the most common mistakes and misuse of youtube-dl:
- First of, make sure you are using the latest version of youtube-dl. Run `youtube-dl --version` and ensure your version is 2021.06.06. If it's not, see https://yt-dl.org/update on how to update. Issues with outdated version will be REJECTED.
- First of all, make sure you are using the latest version of youtube-dl. Run `youtube-dl --version` and ensure your version is 2021.04.26. If it's not, see https://yt-dl.org/update on how to update. Issues with outdated version will be REJECTED.
- Make sure that all provided video/audio/playlist URLs (if any) are alive and playable in a browser.
- Make sure that site you are requesting is not dedicated to copyright infringement, see https://yt-dl.org/copyright-infringement. youtube-dl does not support such sites. In order for site support request to be accepted all provided example URLs should not violate any copyrights.
- Search the bugtracker for similar site support requests: http://yt-dl.org/search-issues. DO NOT post duplicates.

View File

@ -18,7 +18,7 @@ title: ''
<!--
Carefully read and work through this check list in order to prevent the most common mistakes and misuse of youtube-dl:
- First of, make sure you are using the latest version of youtube-dl. Run `youtube-dl --version` and ensure your version is 2021.06.06. If it's not, see https://yt-dl.org/update on how to update. Issues with outdated version will be REJECTED.
- First of all, make sure you are using the latest version of youtube-dl. Run `youtube-dl --version` and ensure your version is 2021.04.26. If it's not, see https://yt-dl.org/update on how to update. Issues with outdated version will be REJECTED.
- Search the bugtracker for similar site feature requests: http://yt-dl.org/search-issues. DO NOT post duplicates.
- Finally, put x into all relevant boxes (like this [x])
-->

View File

@ -18,7 +18,7 @@ title: ''
<!--
Carefully read and work through this check list in order to prevent the most common mistakes and misuse of youtube-dl:
- First of, make sure you are using the latest version of youtube-dl. Run `youtube-dl --version` and ensure your version is 2021.06.06. If it's not, see https://yt-dl.org/update on how to update. Issues with outdated version will be REJECTED.
- First of all, make sure you are using the latest version of youtube-dl. Run `youtube-dl --version` and ensure your version is 2021.04.26. If it's not, see https://yt-dl.org/update on how to update. Issues with outdated version will be REJECTED.
- Make sure that all provided video/audio/playlist URLs (if any) are alive and playable in a browser.
- Make sure that all URLs and arguments with special characters are properly quoted or escaped as explained in http://yt-dl.org/escape.
- Search the bugtracker for similar issues: http://yt-dl.org/search-issues. DO NOT post duplicates.

View File

@ -19,7 +19,7 @@ labels: 'request'
<!--
Carefully read and work through this check list in order to prevent the most common mistakes and misuse of youtube-dl:
- First of, make sure you are using the latest version of youtube-dl. Run `youtube-dl --version` and ensure your version is 2021.06.06. If it's not, see https://yt-dl.org/update on how to update. Issues with outdated version will be REJECTED.
- First of all, make sure you are using the latest version of youtube-dl. Run `youtube-dl --version` and ensure your version is 2021.04.26. If it's not, see https://yt-dl.org/update on how to update. Issues with outdated version will be REJECTED.
- Search the bugtracker for similar feature requests: http://yt-dl.org/search-issues. DO NOT post duplicates.
- Finally, put x into all relevant boxes (like this [x])
-->

1
Hummy/.gitignore vendored Normal file
View File

@ -0,0 +1 @@
*.sav

45
Hummy/Makefile Normal file
View File

@ -0,0 +1,45 @@
#
# Makefile for Hummy youtube-dl package
#
DISTPKG := python2.7/dist-packages
.PHONY: all clean opkg clean-opk clean-opkg release opkg-bin opkg-lib opkg-sav
MV ?= mv -f
define nl
endef
clean-opk: $(wildcard *.opk)
-$(RM) $?
clean-opkg:
-$(RM) -r opkg/bin
-$(RM) -r opkg/lib
clean: clean-opk clean-opkg
opkg-lib: ../youtube_dl
-$(RM) -r opkg/lib/$(DISTPKG)
install -d opkg/lib/$(DISTPKG)
cp -rpP $< opkg/lib/$(DISTPKG)
opkg-bin: $(wildcard ../bin/*) $(wildcard bin/*)
install -d opkg/bin
install -p $? opkg/bin
opkg-sav: $(wildcard *.opk)
$(foreach opk,$^,-$(MV) "$(opk)" "$(opk).sav"$(nl))
opkg: $(wildcard opkg/CONTROL/*) opkg-bin opkg-lib opkg-sav
opkg-pack opkg
-for opk in youtube-dl*.opk; do $(RM) "$${opk}.sav"; done
release: opkg
tagname="$$(for opk in youtube-dl*.opk; do echo "$$opk"; break; done)" && \
tagname="$${tagname%_*.*}" && \
test -n "$${tagname}" && \
git tag -f -a -m "Release $${tagname}" "$${tagname}"

14
Hummy/README.md Normal file
View File

@ -0,0 +1,14 @@
## Build and installation scripts for the Hummy youtube-dl package
# make clean
Remove build artefacts.
# make opkg
Create package using youtube-dl files from parent directory.
# make release
Tag the release with the version of the package made

26
Hummy/bin/fixsttl Executable file
View File

@ -0,0 +1,26 @@
#!/bin/sh
# Usage: fixsttl media_file
# for a media file, convert its .locale.srt, if any, to plain text .srt
# can be overriden STTL_LANG=da-DK, etc
STTL_LANG=${STTL_LANG:-en-GB}
main() {
local ext froot srt
[ -n "$1" ] || exit
# any other extensions?
for ext in mp4 mpg mkv; do
froot=${1%.$ext}
[ "$1" != "$froot" ] && break
done
[ "$1" = "$froot" ] && return 1
srt=${froot}.${STTL_LANG}.srt
[ -r "$srt" ] || return
# *.en-GB.srt -> *.srt
iconv -f UTF-8 -t LATIN1 "$srt" |
# strip <tags> and </tags>
sed -r -e 's@<[/a-zA-Z]+( [^>]*)?>@@g' > "${froot}.srt" &&
{ rm -f -- "$srt"; return 0; }
}
main "$@"

55
Hummy/bin/iplayer-episodes Executable file
View File

@ -0,0 +1,55 @@
#!/bin/sh
# scrape iPlayer programme URLs from a BBC web page
# args: [--queue|-q] iplayer_series_url
mung_url()
{ # prefix
local url
while read url; do
url=${url##href=\"};
echo $1${url%%\"}
done
}
case $1 in
--queue|-q)
if which qtube >/dev/null; then
qqq() {
while read -r line; do
qtube "$@" "$line"
done
}
else
printf "No qtube program is installed; listing qtube commands\n" >&2
qqq() {
while read -r line; do
echo qtube "$@" $(printf "'%s'" "$line")
done
}
fi
shift
;;
--help|-h) {
printf "Usage:\n\n%s [--queue|-q] iplayer_series_url\n\n" "${0##*/}"
printf "Extract iPlayer programme URLs from series page and pass to youtube-dl.\n\n"
printf "With queue option, instead try to queue each URL for download.\n\n"
} 1>&2
exit
;;
*) qqq() { youtube -a -; }
;;
esac
# get BBC's base address
bbc="$1"; bbc="${bbc%%/iplayer*}"
# parse the web page for episode URLs, extract and prepare them for youtube-dl
# curl: -k insecure, needed due to Humax's old SSL libs, -s silent, -S show errors anyway
# grep: -o print matching substring, -E match extended regular expression
curl -k -s -S $1 | grep -oE "href=('|\")/iplayer/episode/[^'\"]+\\1" | mung_url $bbc | \
sort | uniq | qqq

2
Hummy/bin/youtube Executable file
View File

@ -0,0 +1,2 @@
#!/bin/sh
python /mod/lib/python2.7/dist-packages/youtube_dl "$@"

View File

@ -0,0 +1,8 @@
Package: youtube-dl
Priority: optional
Section: misc
Version: 2021.06.06.1
Architecture: mipsel
Maintainer: prpr
Depends: ffmpeg(>=4.1),wget(>=1.20),python,libiconv
Description: Download videos from youtube.com or other video platforms

81
Hummy/opkg/CONTROL/postinst Executable file
View File

@ -0,0 +1,81 @@
#!/bin/sh
distpkgs=/mod/lib/python2.7/dist-packages
cfgfile='/mod/etc/youtube-dl.conf'
oldcfgfile="${cfgfile}.old"
# the default settings
def_settings() {
cat << EOM
--restrict-filenames
--prefer-ffmpeg
-f|--format "best[height<=?1080][fps<=?30]"
-o|--output "$outdir/%(title)s.%(ext)s"
EOM
}
is_set() { # option
# -w whole words -q just set return code -E extended regexp
[ -f "$cfgfile" ] && grep -wq -E -e "($1)" "$cfgfile"
}
case "$(cat /etc/model)" in
HDR)
outdir='/mnt/hd2/My Video'
;;
*) # HD
outdir='/media/drive1/Video'
;;
esac
settings="$(mktemp)"
def_settings |
while read opt val; do
# only add settings that aren't already set
if ! is_set "$opt"; then
echo "${opt%%|*}" "$val" >>"$settings"
fi
done
if [ -s "$settings" ]; then
if [ -f "$cfgfile" ]; then
cp "$cfgfile" "$oldcfgfile"
echo "Your youtube-dl settings file has been updated and"
echo "the previous settings file saved as $oldcfgfile"
fi
cat "$settings" >>"$cfgfile"
fi
rm "$settings"
sed -i 's/fps<=?30/fps<=?60/' "$cfgfile"
# make python recognise the distribution pkg directory
patch_python() {
profile=/mod/etc/profile/python
if ! grep -qF "$distpkgs" "$profile"; then
printf 'export PYTHONPATH="%s"\n' "$distpkgs" >> "$profile"
printf "\nLog out and in again to set PYTHONPATH\n\n"
fi
}
patch_python
find "${distpkgs}/youtube_dl" -name '*.pyc' -exec rm -f "{}" \;
# remove pre-20201112 installation
for tag in /tmp/.ytdl_*; do
[ -e "$tag" ] || continue
echo "$tag" |
( while IFS=_ read _ ver _; do
if [ "$ver" -lt 20201112 -a -e "${distpkgs}/youtube-dl" ]; then
rm -f "$tag"
find "${distpkgs}/youtube-dl" -name '*.pyc' -exec rm -f "{}" \;
rmdir "${distpkgs}/youtube-dl" || true
exit
fi
done )
done
# background compile
youtube-dl --version >/dev/null &
exit 0

8
Hummy/opkg/CONTROL/postrm Executable file
View File

@ -0,0 +1,8 @@
#!/bin/sh
cfgfile='/mod/etc/youtube-dl.conf'
oldcfgfile="${cfgfile}.old"
pkgdir="/mod/lib/python2.7/dist-packages/youtube_dl"
[ -f "$cfgfile" ] && rm "$cfgfile"
[ -f "$oldcfgfile" ] && rm "$oldcfgfile"
[ -d "$pkgdir" ] && rm -r "$pkgdir"
exit 0

8
Hummy/opkg/CONTROL/preinst Executable file
View File

@ -0,0 +1,8 @@
#!/bin/sh
CTL=/mod/var/opkg/info/youtube-dl.control
[ -r "$CTL" ] &&
grep -E '^Version:' "$CTL" |
( while IFS=".${IFS}" read _ yy mm dd _; do
echo >"/tmp/.ytdl_${yy}${mm}${dd}"
break
done )

View File

@ -1,4 +1,4 @@
#!/usr/bin/env python
#!/bin/env python
import youtube_dl

View File

@ -66,9 +66,9 @@ class TestAllURLsMatching(unittest.TestCase):
self.assertMatch('https://www.youtube.com/feed/watch_later', ['youtube:tab'])
self.assertMatch('https://www.youtube.com/feed/subscriptions', ['youtube:tab'])
# def test_youtube_search_matching(self):
# self.assertMatch('http://www.youtube.com/results?search_query=making+mustard', ['youtube:search_url'])
# self.assertMatch('https://www.youtube.com/results?baz=bar&search_query=youtube-dl+test+video&filters=video&lclk=video', ['youtube:search_url'])
def test_youtube_search_matching(self):
self.assertMatch('http://www.youtube.com/results?search_query=making+mustard', ['youtube:search_url'])
self.assertMatch('https://www.youtube.com/results?baz=bar&search_query=youtube-dl+test+video&filters=video&lclk=video', ['youtube:search_url'])
def test_facebook_matching(self):
self.assertTrue(FacebookIE.suitable('https://www.facebook.com/Shiniknoh#!/photo.php?v=10153317450565268'))

View File

@ -65,6 +65,8 @@ from youtube_dl.utils import (
sanitize_filename,
sanitize_path,
sanitize_url,
extract_user_pass,
sanitized_Request,
expand_path,
prepend_extension,
replace_extension,
@ -237,6 +239,26 @@ class TestUtil(unittest.TestCase):
self.assertEqual(sanitize_url('rmtps://foo.bar'), 'rtmps://foo.bar')
self.assertEqual(sanitize_url('https://foo.bar'), 'https://foo.bar')
def test_extract_user_pass(self):
self.assertEqual(extract_user_pass('http://foo.bar'), ('http://foo.bar', None, None))
self.assertEqual(extract_user_pass('http://:foo.bar'), ('http://:foo.bar', None, None))
self.assertEqual(extract_user_pass('http://@foo.bar'), ('http://foo.bar', '', ''))
self.assertEqual(extract_user_pass('http://:pass@foo.bar'), ('http://foo.bar', '', 'pass'))
self.assertEqual(extract_user_pass('http://user:@foo.bar'), ('http://foo.bar', 'user', ''))
self.assertEqual(extract_user_pass('http://user:pass@foo.bar'), ('http://foo.bar', 'user', 'pass'))
def test_sanitized_Request(self):
self.assertFalse(sanitized_Request('http://foo.bar').has_header('Authorization'))
self.assertFalse(sanitized_Request('http://:foo.bar').has_header('Authorization'))
self.assertEqual(sanitized_Request('http://@foo.bar').get_header('Authorization'),
'Basic Og==')
self.assertEqual(sanitized_Request('http://:pass@foo.bar').get_header('Authorization'),
'Basic OnBhc3M=')
self.assertEqual(sanitized_Request('http://user:@foo.bar').get_header('Authorization'),
'Basic dXNlcjo=')
self.assertEqual(sanitized_Request('http://user:pass@foo.bar').get_header('Authorization'),
'Basic dXNlcjpwYXNz')
def test_expand_path(self):
def env(var):
return '%{0}%'.format(var) if sys.platform == 'win32' else '${0}'.format(var)

2
tools/pullFromGL Executable file
View File

@ -0,0 +1,2 @@
#!/bin/sh
git pull masterGL master:masterGL

View File

@ -1,4 +1,4 @@
#!/usr/bin/env python
#!/bin/env python
# coding: utf-8
from __future__ import absolute_import, unicode_literals

View File

@ -1,4 +1,4 @@
#!/usr/bin/env python
#!/bin/env python
# coding: utf-8
from __future__ import unicode_literals

View File

@ -1,4 +1,4 @@
#!/usr/bin/env python
#!/mod/bin/busybox/env python
from __future__ import unicode_literals
# Execute with

View File

@ -24,6 +24,7 @@ from ..utils import (
get_element_by_class,
int_or_none,
js_to_json,
parse_bitrate,
parse_duration,
parse_iso8601,
strip_or_none,
@ -68,6 +69,8 @@ class BBCCoUkIE(InfoExtractor):
_EMP_PLAYLIST_NS = 'http://bbc.co.uk/2008/emp/playlist'
_DESCRIPTION_KEY = 'synopses'
_TESTS = [
{
'url': 'http://www.bbc.co.uk/programmes/b039g8p7',
@ -262,6 +265,21 @@ class BBCCoUkIE(InfoExtractor):
}, {
'url': 'https://www.bbc.co.uk/programmes/w172w4dww1jqt5s',
'only_matching': True,
}, {
# audio-described
'url': 'https://www.bbc.co.uk/iplayer/episode/m000b1v0/ad/his-dark-materials-series-1-1-lyras-jordan',
'info_dict': {
'id': 'p07ss5kj',
'ext': 'mp4',
'title': 'His Dark Materials - Series 1: 1. Lyra\u2019s Jordan - Audio Described',
'description': 'Orphan Lyra Belacqua\'s world is turned upside-down by her long-absent uncle\'s return from the north, while the glamorous Mrs Coulter visits Jordan College with a proposition.',
'duration': 3407,
},
'params': {
# rtmp download
'skip_download': True,
},
'skip': 'geolocation',
}]
def _login(self):
@ -317,6 +335,10 @@ class BBCCoUkIE(InfoExtractor):
def _extract_connections(self, media):
return media.get('connection') or []
def _get_description(self, data):
synopses = try_get(data, lambda x: x[self._DESCRIPTION_KEY], dict) or {}
return dict_get(synopses, ('large', 'medium', 'small'))
def _get_subtitles(self, media, programme_id):
subtitles = {}
for connection in self._extract_connections(media):
@ -352,7 +374,9 @@ class BBCCoUkIE(InfoExtractor):
last_exception = e
continue
self._raise_extractor_error(e)
self._raise_extractor_error(last_exception)
if last_exception and not formats:
self._raise_extractor_error(last_exception)
return formats, subtitles
def _download_media_selector_url(self, url, programme_id=None):
media_selection = self._download_json(
@ -542,15 +566,45 @@ class BBCCoUkIE(InfoExtractor):
programme_id = None
duration = None
description = None
thumbnail = None
tviplayer = self._search_regex(
r'mediator\.bind\(({.+?})\s*,\s*document\.getElementById',
webpage, 'player', default=None)
# current pages embed data from http://www.bbc.co.uk/programmes/PID.json
# similar data available at http://ibl.api.bbc.co.uk/ibl/v1/episodes/PID
redux_state = self._parse_json(self._html_search_regex(
r'<script\b[^>]+id=(["\'])tvip-script-app-store\1[^>]*>[^<]*_REDUX_STATE__\s*=\s*(?P<json>[^<]+)\s*;\s*<',
webpage, 'redux state', default='{}', group='json'), group_id, fatal=False)
episode = redux_state.get('episode', {})
if episode.get('id') == group_id:
# try to match the version against the page's version
current_version = episode.get('currentVersion')
kinds = ['original']
if current_version == 'ad':
kinds.insert(0, 'audio-described')
for kind in kinds:
for version in redux_state.get('versions', {}):
if try_get(version, lambda x: x['kind'], compat_str) == kind:
programme_id = version.get('id')
duration = try_get(version, lambda x: x['duration']['seconds'], int)
break
if programme_id:
break
if programme_id:
description = self._get_description(episode)
thumbnail = try_get(episode, lambda x: x['images']['standard'], compat_str)
if thumbnail:
thumbnail = thumbnail.format(recipe='raw')
if tviplayer:
player = self._parse_json(tviplayer, group_id).get('player', {})
duration = int_or_none(player.get('duration'))
programme_id = player.get('vpid')
if not programme_id:
# still valid?
tviplayer = self._search_regex(
r'mediator\.bind\(({.+?})\s*,\s*document\.getElementById',
webpage, 'player', default=None)
if tviplayer:
player = self._parse_json(tviplayer, group_id).get('player', {})
duration = int_or_none(player.get('duration'))
programme_id = player.get('vpid')
if not programme_id:
programme_id = self._search_regex(
@ -561,7 +615,7 @@ class BBCCoUkIE(InfoExtractor):
title = self._og_search_title(webpage, default=None) or self._html_search_regex(
(r'<h2[^>]+id="parent-title"[^>]*>(.+?)</h2>',
r'<div[^>]+class="info"[^>]*>\s*<h1>(.+?)</h1>'), webpage, 'title')
description = self._search_regex(
description = description or self._search_regex(
(r'<p class="[^"]*medium-description[^"]*">([^<]+)</p>',
r'<div[^>]+class="info_+synopsis"[^>]*>([^<]+)</div>'),
webpage, 'description', default=None)
@ -576,7 +630,7 @@ class BBCCoUkIE(InfoExtractor):
'id': programme_id,
'title': title,
'description': description,
'thumbnail': self._og_search_thumbnail(webpage, default=None),
'thumbnail': thumbnail or self._og_search_thumbnail(webpage, default=None),
'duration': duration,
'formats': formats,
'subtitles': subtitles,
@ -638,9 +692,7 @@ class BBCIE(BBCCoUkIE):
'skip_download': True,
}
}, {
# article with single video embedded with data-playable containing XML playlist
# with direct video links as progressiveDownloadUrl (for now these are extracted)
# and playlist with f4m and m3u8 as streamingUrl
# article with single video (formerly) embedded, now using SIMORGH_DATA JSON
'url': 'http://www.bbc.com/turkce/haberler/2015/06/150615_telabyad_kentin_cogu',
'info_dict': {
'id': '150615_telabyad_kentin_cogu',
@ -652,12 +704,13 @@ class BBCIE(BBCCoUkIE):
},
'params': {
'skip_download': True,
}
},
'skip': 'Video no longer embedded, 2021',
}, {
# single video embedded with data-playable containing XML playlists (regional section)
# single video embedded, legacy media, in promo object of SIMORGH_DATA JSON
'url': 'http://www.bbc.com/mundo/video_fotos/2015/06/150619_video_honduras_militares_hospitales_corrupcion_aw',
'info_dict': {
'id': '150619_video_honduras_militares_hospitales_corrupcion_aw',
'id': '39275083',
'ext': 'mp4',
'title': 'Honduras militariza sus hospitales por nuevo escándalo de corrupción',
'description': 'md5:1525f17448c4ee262b64b8f0c9ce66c8',
@ -750,6 +803,16 @@ class BBCIE(BBCCoUkIE):
'description': 'Fast-paced football, wit, wisdom and a ready smile - why Liverpool fans should come to love new boss Jurgen Klopp.',
},
'playlist_count': 3,
}, {
# single video embedded, data in playlistObject of playerSettings
'url': 'https://www.bbc.com/news/av/embed/p07xmg48/50670843',
'info_dict': {
'id': 'p07xmg48',
'ext': 'mp4',
'title': 'General election 2019: From the count, to your TV',
'description': 'General election 2019: From the count, to your TV',
'duration': 160,
},
}, {
# school report article with single video
'url': 'http://www.bbc.co.uk/schoolreport/35744779',
@ -813,6 +876,17 @@ class BBCIE(BBCCoUkIE):
}, {
# BBC Reel
'url': 'https://www.bbc.com/reel/video/p07c6sb6/how-positive-thinking-is-harming-your-happiness',
'info_dict': {
'id': 'mind-matters',
'title': 'Mind Matters',
'description': 'Uncovering the mysteries of our minds and the importance of mental health and well-being.',
'duration': 3083,
'upload_date': '20181214',
},
'playlist_count': 13,
}, {
# BBC Reel playlist and video => video
'url': 'https://www.bbc.com/reel/video/p07c6sb6/how-positive-thinking-is-harming-your-happiness',
'info_dict': {
'id': 'p07c6sb9',
'ext': 'mp4',
@ -824,6 +898,86 @@ class BBCIE(BBCCoUkIE):
'upload_date': '20190604',
'categories': ['Psychology'],
},
'params': {
'no-playlist': True,
},
}, {
# BBC Reel video and playlist => video
'url': 'https://www.bbc.com/reel/video/p099tghy/is-phrenology-the-weirdest-pseudoscience-of-them-all-',
'info_dict': {
'id': 'p07c6sb9',
'ext': 'mp4',
'title': 'How positive thinking is harming your happiness',
'alt_title': 'The downsides of positive thinking',
'description': 'md5:fad74b31da60d83b8265954ee42d85b4',
'duration': 235,
'thumbnail': r're:https?://.+/p07c9dsr.jpg',
'upload_date': '20190604',
'categories': ['Psychology'],
},
}, {
# BBC World Service etc: media nested in content object of SIMORGH_DATA JSON
'url': 'http://www.bbc.co.uk/scotland/articles/cm49v4x1r9lo',
'info_dict': {
'id': 'p06p040v',
'ext': 'mp4',
'title': 'Five things ants can teach us about management',
'description': 'They may be tiny, but us humans could learn a thing or two from ants.',
'duration': 191,
'thumbnail': r're:https?://.+/p06p0qzv.jpg',
'upload_date': '20181016',
},
}, {
# BBC Reel specified video and playlist => video
'url': 'https://www.bbc.com/reel/playlist/mind-matters?vpid=p0962h5x',
'info_dict': {
'id': 'p095rkvg',
'ext': 'mp4',
'title': 'Can you really have a \'photographic\' memory?',
'alt_title': 'Why your memory is not like a camera',
'description': 'md5:00000000000000000000000000000000',
'duration': 211,
'thumbnail': r're:https?://.+/p095rrbz.jpg',
'upload_date': '20210202',
'categories': ['Neuroscience'],
},
}, {
# BBC Reel specified video and playlist => playlist
'info_dict': {
'id': 'mind-matters',
'title': 'Mind Matters',
'description': 'Uncovering the mysteries of our minds and the importance of mental health and well-being.',
'duration': 3083,
'upload_date': '20181214',
},
'playlist_count': 13,
'params': {
'no-playlist': False,
},
}, {
# BBC Weather
'url': 'https://www.bbc.co.uk/weather/features/55581056',
'info_dict': {
'id': 'p093xhxl',
'ext': 'mp4',
'title': 'Weather for the Week Ahead',
'description': 'There\'ll be a battle between colder and milder weather in the coming few days, before it turns chillier once again.',
'duration': 209,
'thumbnail': r're:https?://.+/p093xk3z.jpg',
'upload_date': '20210113',
},
}, {
# BBC Bitesize
'url': 'https://www.bbc.co.uk/bitesize/guides/zgvq4qt/revision/6',
'info_dict': {
'id': 'p04yj749',
'ext': 'mp4',
'title': 'Circuits',
'description': 'Learn about and revise electrical circuits, charge, current, power and resistance with GCSE Bitesize Combined Science.',
'duration': 205,
'thumbnail': r're:https?://.+/p04z1ckk.jpg',
'upload_date': '20180223',
},
}]
@classmethod
@ -873,25 +1027,56 @@ class BBCIE(BBCCoUkIE):
'subtitles': subtitles,
}
def _extract_from_playlist_object(self, playlist_object):
title = playlist_object.get('title')
item_0 = try_get(playlist_object, lambda x: x['items'][0], dict)
if item_0 and title:
description = playlist_object.get('summary')
duration = int_or_none(item_0.get('duration'))
programme_id = dict_get(item_0, ('vpid', 'versionID'))
if programme_id:
return {
'id': programme_id,
'title': title,
'description': description,
'duration': duration,
}
return {}
def _get_playlist_entry(self, entry):
programme_id = entry.get('id')
if not programme_id:
return
formats, subtitles = self._download_media_selector(programme_id)
self._sort_formats(formats)
entry.update({
'formats': formats,
'subtitles': subtitles,
})
return entry
def _real_extract(self, url):
playlist_id = self._match_id(url)
webpage = self._download_webpage(url, playlist_id)
json_ld_info = self._search_json_ld(webpage, playlist_id, default={})
timestamp = json_ld_info.get('timestamp')
playlist_title = json_ld_info.get('title')
if not playlist_title:
playlist_title = self._og_search_title(
webpage, default=None) or self._html_search_regex(
r'<title>(.+?)</title>', webpage, 'playlist title', default=None)
playlist_title = (self._html_search_regex(r'<title\b[^>]*>(.+)</title>', webpage, 'playlist title', default=None)
or self._og_search_title(webpage, name='playlist title', default=None)
or self._html_search_meta('title', webpage, display_name='playlist title'))
if playlist_title:
playlist_title = re.sub(r'(.+)\s*-\s*BBC.*?$', r'\1', playlist_title).strip()
playlist_title = re.sub(r'^(BBC.*?\s*-\s*)?(.+)(?(1)|\s*-\s*BBC.*?)$', r'\2', playlist_title).strip()
playlist_description = json_ld_info.get(
'description') or self._og_search_description(webpage, default=None)
playlist_description = json_ld_info.get('description')
if not playlist_description:
playlist_description = (self._og_search_description(webpage, default=None)
or self._html_search_meta('description', webpage, default=None))
if playlist_description:
playlist_description = playlist_description.strip()
timestamp = json_ld_info.get('timestamp')
if not timestamp:
timestamp = parse_iso8601(self._search_regex(
[r'<meta[^>]+property="article:published_time"[^>]+content="([^"]+)"',
@ -903,6 +1088,7 @@ class BBCIE(BBCCoUkIE):
# article with multiple videos embedded with playlist.sxml (e.g.
# http://www.bbc.com/sport/0/football/34475836)
# - obsolete?
playlists = re.findall(r'<param[^>]+name="playlist"[^>]+value="([^"]+)"', webpage)
playlists.extend(re.findall(r'data-media-id="([^"]+/playlist\.sxml)"', webpage))
if playlists:
@ -920,27 +1106,17 @@ class BBCIE(BBCCoUkIE):
continue
settings = data_playable.get('settings', {})
if settings:
# data-playable with video vpid in settings.playlistObject.items (e.g.
# http://www.bbc.com/news/world-us-canada-34473351)
# data-playable with video vpid in settings.playlistObject.items
# obsolete? example previously quoted uses __INITIAL_DATA__ now
playlist_object = settings.get('playlistObject', {})
if playlist_object:
items = playlist_object.get('items')
if items and isinstance(items, list):
title = playlist_object['title']
description = playlist_object.get('summary')
duration = int_or_none(items[0].get('duration'))
programme_id = items[0].get('vpid')
formats, subtitles = self._download_media_selector(programme_id)
self._sort_formats(formats)
entries.append({
'id': programme_id,
'title': title,
'description': description,
entry = self._extract_from_playlist_object(playlist_object)
entry = self._get_playlist_entry(entry)
if entry:
entry.update({
'timestamp': timestamp,
'duration': duration,
'formats': formats,
'subtitles': subtitles,
})
entries.append(entry)
else:
# data-playable without vpid but with a playlist.sxml URLs
# in otherSettings.playlist (e.g.
@ -970,7 +1146,25 @@ class BBCIE(BBCCoUkIE):
if entry:
self._sort_formats(entry['formats'])
entries.append(entry)
else:
# embed video with playerSettings, eg
# https://www.bbc.com/news/av/embed/p07xmg48/50670843
settings = self._html_search_regex(
r'<script\b[^>]+>.+\.playerSettings\s*=\s*(?P<json>\{.*\})\s*(?:,\s*function\s*\(\s*\)\s*\{\s*["\']use strict.+\(\s*\)\s*)?</script\b',
webpage, 'player settings', default='{}', group='json')
settings = self._parse_json(settings, playlist_id, transform_source=js_to_json, fatal=False)
if settings:
playlist_object = settings.get('playlistObject', {})
if playlist_object:
entry = self._extract_from_playlist_object(playlist_object)
entry = self._get_playlist_entry(entry)
if entry:
thumbnail = playlist_object.get('holdingImageURL')
entry.update({
'timestamp': timestamp,
'thumbnail': thumbnail.replace('$recipe', 'raw') if thumbnail else None,
})
entries.append(entry)
if entries:
return self.playlist_result(entries, playlist_id, playlist_title, playlist_description)
@ -1012,56 +1206,96 @@ class BBCIE(BBCCoUkIE):
}
# bbc reel (e.g. https://www.bbc.com/reel/video/p07c6sb6/how-positive-thinking-is-harming-your-happiness)
# playlist pages have a current video (first in the list), plus links to the other videos
initial_data = self._parse_json(self._html_search_regex(
r'<script[^>]+id=(["\'])initial-data\1[^>]+data-json=(["\'])(?P<json>(?:(?!\2).)+)',
webpage, 'initial data', default='{}', group='json'), playlist_id, fatal=False)
if initial_data:
init_data = try_get(
initial_data, lambda x: x['initData']['items'][0], dict) or {}
smp_data = init_data.get('smpData') or {}
clip_data = try_get(smp_data, lambda x: x['items'][0], dict) or {}
version_id = clip_data.get('versionID')
if version_id:
title = smp_data['title']
formats, subtitles = self._download_media_selector(version_id)
self._sort_formats(formats)
image_url = smp_data.get('holdingImageURL')
display_date = init_data.get('displayDate')
topic_title = init_data.get('topicTitle')
init_items = try_get(
initial_data, lambda x: x['initData']['items'], list) or []
# Reel pages may have an active video and a playlist as well
# If the URL implies playlist, let --no-playlist select the video
# If the URL implies video (includes a PID string other than 'playlist'),
# let --yes-playlist select the playlist
# If the URL has parameter vpid set in the query string, treat it as
# implying a video and find that exact versionID in the playlist
noplaylist = self._downloader.params.get('noplaylist')
qs = compat_urlparse.parse_qs(compat_urlparse.urlparse(url).query)
vpid = try_get(qs, lambda x: x['vpid'][0], compat_str)
single_pid = vpid or \
re.search(r'[/=](!playlist\b)%s\b' % self._ID_REGEX, url)
if len(init_items) > 1:
if noplaylist and not single_pid:
self.to_screen('Downloading single video because of --no-playlist')
elif noplaylist == False and single_pid:
self.to_screen('Downloading playlist because of --yes-playlist')
if noplaylist is None:
noplaylist = single_pid
elif vpid and not noplaylist:
vpid = None
for item in init_items:
smp_data = try_get(item, lambda x: x['smpData'])
if not smp_data:
continue
entry = None
clip_data = try_get(smp_data, lambda x: x['items'][0], dict) or {}
version_id = clip_data.get('versionID')
if version_id:
if vpid and vpid != version_id:
continue
title = smp_data['title']
formats, subtitles = self._download_media_selector(version_id)
self._sort_formats(formats)
image_url = smp_data.get('holdingImageURL')
display_date = item.get('displayDate')
topic_title = item.get('topicTitle')
return {
'id': version_id,
'title': title,
'formats': formats,
'alt_title': init_data.get('shortTitle'),
'thumbnail': image_url.replace('$recipe', 'raw') if image_url else None,
'description': smp_data.get('summary') or init_data.get('shortSummary'),
'upload_date': display_date.replace('-', '') if display_date else None,
'subtitles': subtitles,
'duration': int_or_none(clip_data.get('duration')),
'categories': [topic_title] if topic_title else None,
}
entry = {
'id': version_id,
'title': title,
'formats': formats,
'alt_title': item.get('shortTitle'),
'thumbnail': image_url.replace('$recipe', 'raw') if image_url else None,
'description': smp_data.get('summary') or item.get('shortSummary'),
'upload_date': display_date.replace('-', '') if display_date else None,
'subtitles': subtitles,
'duration': int_or_none(clip_data.get('duration')),
'categories': [topic_title] if topic_title else None,
}
if entry:
if noplaylist:
return entry
entries.append(entry)
if entries:
initial_data = initial_data['initData']
title = initial_data.get('title')
description = initial_data.get('summary')
return self.playlist_result(entries, playlist_id, title, description)
# Morph based embed (e.g. http://www.bbc.co.uk/sport/live/olympics/36895975)
# There are several setPayload calls may be present but the video
# seems to be always related to the first one
# Several setPayload calls may be present so pick the one with 'asset-data'
# or 'page-component-data'
# For Weather, use 'asset-with-media'
# For Bitesize, use 'guide-data'
morph_payload = self._parse_json(
self._search_regex(
r'Morph\.setPayload\([^,]+,\s*({.+?})\);',
r'Morph\.setPayload\s*\([^,]+-(?:asset-data|page-component-data|asset-with-media|guide-data)/[^,]+,\s*(\{.+[]}]\s*})\s*\)(?:\s*;\s*}\s*\))?\s*;\s*</script',
webpage, 'morph payload', default='{}'),
playlist_id, fatal=False)
if morph_payload:
# try for components
components = try_get(morph_payload, lambda x: x['body']['components'], list) or []
for component in components:
if not isinstance(component, dict):
continue
lead_media = try_get(component, lambda x: x['props']['leadMedia'], dict)
if not lead_media:
lead_media = try_get(component, lambda x: x['props']['supportingMedia'][0], dict)
if not lead_media:
continue
identifiers = lead_media.get('identifiers')
if not identifiers or not isinstance(identifiers, dict):
continue
programme_id = identifiers.get('vpid') or identifiers.get('playablePid')
programme_id = dict_get(identifiers, ('vpid', 'playablePid'))
if not programme_id:
continue
title = lead_media.get('title') or self._og_search_title(webpage)
@ -1085,6 +1319,233 @@ class BBCIE(BBCCoUkIE):
'formats': formats,
'subtitles': subtitles,
}
# another type (asset-data/)
body_media = try_get(morph_payload, lambda x: x['body'], dict) or {}
# check for variant but similar format found with Weather
# dict.values() is a view in Python 3, a list in Python 2
primary_video = try_get(body_media, lambda x: list(x['media']['videos']['primary'].values())[0], dict)
if primary_video:
body_media.update(primary_video)
programme_id = body_media.get('versionPid')
else:
# Bite-size
page_children = try_get(body_media, lambda x: x['chapterData']['page']['children'], list) or []
def chdata_extract_media(children):
for child in children:
type = try_get(child, lambda x: x['type'], compat_str)
if type != 'element':
continue
if child.get('name') == 'media':
return try_get(child, lambda x: x['attributes'], dict)
media = chdata_extract_media(child.get('children'))
if media:
return media
media = chdata_extract_media(page_children)
if media:
programme_id = media.get('vpid')
if programme_id:
body_media.update(media)
if not programme_id:
body_media.update(body_media.get('media') or {})
programme_id = body_media.get('pid')
if programme_id:
title = (body_media.get('title')
or self._og_search_title(webpage)
or self._html_search_meta('title', webpage))
formats, subtitles = self._download_media_selector(programme_id)
self._sort_formats(formats)
image_url = dict_get(body_media, ('holdingImageUrl', 'holdingImage'))
return {
'id': programme_id,
'title': title,
'formats': formats,
'subtitles': subtitles,
'thumbnail': re.sub(r'(\{width}xn|\$recipe)', 'raw', image_url) if image_url else None,
'duration': parse_duration(dict_get(body_media, ('duration', 'durationSeconds'))),
'description': (try_get(body_media, lambda x: x['promos']['summary'], compat_str)
or dict_get(body_media, ('summary', 'shortSynopsis'))
or self._html_search_meta('description', webpage)),
'timestamp': parse_iso8601(dict_get(body_media, ('dateTime', 'lastUpdated', 'lastModified'))),
}
# morph-based playlist (replaces playlist.sxml)
# a JS setPayload call with arg1 containing the playlist_id has JSON in arg2;
# deeply nested within it is our target string containing more JSON ...
morph_payload = self._parse_json(
self._search_regex(
r'Morph\.setPayload\s*\([^,]+%s%s%s[^,]+,\s*(\{.+[]}]\s*})\s*\)\s*;' % ('%2F', playlist_id, '%22%2CisStory%3Atrue'),
webpage, 'morph playlist payload', default='{}'),
playlist_id, fatal=False)
if morph_payload:
# looking for a string containing a JSON list
components = try_get(morph_payload, lambda x: x['body']['content']['article']['body'], compat_str) or '[]'
components = self._parse_json(components, playlist_id, fatal=False) or []
for component in components:
if component.get('name') != 'video':
continue
component = component.get('videoData') or {}
programme_id = dict_get(component, ('vpid', 'pid'))
if programme_id:
formats, subtitles = self._download_media_selector(programme_id)
if not formats:
continue
self._sort_formats(formats)
entries.append({
'id': programme_id,
'title': component.get('title', 'Unnamed clip %s' % programme_id),
'formats': formats,
'subtitles': subtitles,
'thumbnail': dict_get(component, ('iChefImage', 'image')),
'duration': parse_duration(component.get('duration')),
'description': component.get('caption'),
})
if entries:
return self.playlist_result(
entries,
playlist_id,
playlist_title,
playlist_description)
body_media = try_get(morph_payload, lambda x: x['body'], dict) or {}
body_media.update(body_media.get('media') or {})
programme_id = body_media.get('pid')
if programme_id:
title = (body_media.get('title')
or self._og_search_title(webpage)
or self._html_search_meta('title', webpage))
formats, subtitles = self._download_media_selector(programme_id)
self._sort_formats(formats)
image_url = body_media.get('holdingImageUrl')
return {
'id': programme_id,
'title': title,
'formats': formats,
'subtitles': subtitles,
'thumbnail': image_url.replace('{width}xn', 'raw') if image_url else None,
'duration': parse_duration(body_media.get('duration')),
'description': (try_get(body_media, lambda x: x['promos']['summary'], str)
or self._html_search_meta('description', webpage)),
'timestamp': parse_iso8601(body_media.get('dateTime')),
}
# morph-based playlist (replaces playlist.sxml?)
# a JS setPayload call with arg1 containg the playlist_id has JSON in arg2;
# deeply nested within it is our target string containing more JSON ...
morph_payload = self._parse_json(
self._search_regex(
r'Morph\.setPayload\s*\([^,]+%s%s%s[^,]+,\s*(\{.+[]}]\s*})\s*\)\s*;' % ('%2F', playlist_id, '%22%2CisStory%3Atrue'),
webpage, 'morph playlist payload', default='{}'),
playlist_id, fatal=False)
if morph_payload:
# looking for a string containing a JSON list
components = try_get(morph_payload, lambda x: x['body']['content']['article']['body'], compat_str) or '[]'
components = self._parse_json(components, playlist_id, fatal=False) or []
for component in components:
if component.get('name') != 'video':
continue
component = component.get('videoData') or {}
programme_id = dict_get(component, ('vpid', 'pid'))
if programme_id:
formats, subtitles = self._download_media_selector(programme_id)
if not formats:
continue
self._sort_formats(formats)
entries.append({
'id': programme_id,
'title': component.get('title', 'Unnamed clip %s' % programme_id),
'formats': formats,
'subtitles': subtitles,
'thumbnail': dict_get(component, ('iChefImage', 'image')),
'duration': parse_duration(component.get('duration')),
'description': component.get('caption'),
})
if entries:
return self.playlist_result(
entries,
playlist_id,
playlist_title,
playlist_description)
# simorgh-based playlist (see https://github.com/bbc/simorgh)
# JSON assigned to window.SIMORGH_DATA in a <script> element
simorgh_data = self._parse_json(
self._search_regex(
r'window\.SIMORGH_DATA\s*=\s*(\{[^<]+})\s*</',
webpage, 'simorgh playlist', default='{}'),
playlist_id, fatal=False)
# legacy media, video in promo object (eg, http://www.bbc.com/mundo/video_fotos/2015/06/150619_video_honduras_militares_hospitales_corrupcion_aw)
playlist = try_get(simorgh_data, lambda x: x['pageData']['promo']['media']['playlist']) or []
if playlist:
media = simorgh_data['pageData']['promo']
if media['media'].get('format') == 'video':
media.update(media['media'])
formats = []
keys = {'url', 'format', 'format_id', 'language', 'quality', 'tbr', 'resolution'}
for format in playlist:
if not (format.get('url') and format.get('format')):
continue
bitrate = format.pop('bitrate')
if bitrate:
bitrate = re.sub(r'000\s*$', 'kbps', bitrate)
format['tbr'] = parse_bitrate(bitrate)
format['language'] = media.get('language')
# format id: penultimate item from the url split on _ and .
(fmt,) = re.split('[_.]', format['url'])[-2:][:1]
format['format_id'] = '%s_%s' % (format['format'], fmt)
if not format.get('resolution'):
format['resolution'] = fmt
format['quality'] = -1
formats.append(dict((k, format[k]) for k in keys))
self._sort_formats(formats)
return {
'id': media.get('id'),
'title': (dict_get(media.get('headlines'),
('shortHeadline', 'headline'))
or playlist_title),
'description': media.get('summary') or playlist_description,
'formats': formats,
'subtitles': None,
'thumbnail': try_get(media, lambda x: x['image']['href']),
'timestamp': int_or_none(media.get('timestamp'), scale=1000)
}
# general case: media nested in content object
# test: https://www.bbc.co.uk/scotland/articles/cm49v4x1r9lo
if simorgh_data:
def extract_media_from_simorgh(model):
if not isinstance(model, dict):
return
for block in model.get('blocks') or {}:
if block.get('type') == 'aresMediaMetadata':
vpid = try_get(block, lambda x: x['model']['versions'][0]['versionId'])
if vpid:
formats, subtitles = self._download_media_selector(vpid)
self._sort_formats(formats)
model = block['model']
version = model['versions'][0]
thumbnail = model.get('imageUrl')
return {
'id': vpid,
'title': model.get('title') or 'unnamed clip',
'description': dict_get(model.get('synopses') or {}, ('long', 'medium', 'short')),
'duration': (int_or_none(version.get('duration'))
or parse_duration(version.get('durationISO8601'))),
'timestamp': version.get('availableFrom'),
'thumbnail': urljoin(url, thumbnail.replace('$recipe', 'raw')) if thumbnail else None,
'formats': formats,
'subtitles': subtitles,
}
else:
entry = extract_media_from_simorgh(block.get('model'))
if entry:
return entry
playlist = extract_media_from_simorgh(try_get(simorgh_data, lambda x: x['pageData']['content']['model']))
if playlist:
return playlist
preload_state = self._parse_json(self._search_regex(
r'window\.__PRELOADED_STATE__\s*=\s*({.+?});', webpage,
@ -1162,6 +1623,7 @@ class BBCIE(BBCCoUkIE):
return self.playlist_result(
entries, playlist_id, playlist_title, playlist_description)
# eg, http://www.bbc.com/news/world-us-canada-34473351
initial_data = self._parse_json(self._search_regex(
r'window\.__INITIAL_DATA__\s*=\s*({.+?});', webpage,
'preload state', default='{}'), playlist_id, fatal=False)
@ -1176,7 +1638,11 @@ class BBCIE(BBCCoUkIE):
continue
formats, subtitles = self._download_media_selector(item_id)
self._sort_formats(formats)
item_desc = None
# make description by combining any .model.text strings in the .summary.blocks list
item_desc = ('\n\n'.join(filter(lambda x: x is not None,
map(lambda blk: try_get(blk, lambda x: x['model']['text'], compat_str),
try_get(media, lambda x: x['summary']['blocks'], list) or [])))
or None)
blocks = try_get(media, lambda x: x['summary']['blocks'], list)
if blocks:
summary = []
@ -1352,7 +1818,7 @@ class BBCCoUkPlaylistBaseIE(InfoExtractor):
if single_page:
return
next_page = self._search_regex(
r'<li[^>]+class=(["\'])pagination_+next\1[^>]*><a[^>]+href=(["\'])(?P<url>(?:(?!\2).)+)\2',
r'<li[^>]+class=(["\'])pagination_+next\1[^>]*>\s*<a[^>]+href=(["\'])(?P<url>(?:(?!\2).)+)\2',
webpage, 'next page url', default=None, group='url')
if not next_page:
break
@ -1360,6 +1826,13 @@ class BBCCoUkPlaylistBaseIE(InfoExtractor):
compat_urlparse.urljoin(url, next_page), playlist_id,
'Downloading page %d' % page_num, page_num)
def _extract_title_and_description(self, webpage):
title = (self._og_search_title(webpage, default=None)
or self._html_search_meta('title', webpage, display_name='playlist title', default='Unnamed playlist'))
description = (self._og_search_description(webpage, default=None)
or self._html_search_meta('description', webpage, default=None))
return title, description
def _real_extract(self, url):
playlist_id = self._match_id(url)
@ -1416,7 +1889,7 @@ class BBCCoUkIPlayerPlaylistBaseIE(InfoExtractor):
per_page = 36 if page else self._PAGE_SIZE
fetch_page = functools.partial(self._fetch_page, pid, per_page, series_id)
entries = fetch_page(int(page) - 1) if page else OnDemandPagedList(fetch_page, self._PAGE_SIZE)
playlist_data = self._get_playlist_data(self._call_api(pid, 1))
playlist_data = self._get_playlist_data(self._call_api(pid, 1) or {})
return self.playlist_result(
entries, pid, self._get_playlist_title(playlist_data),
self._get_description(playlist_data))
@ -1481,7 +1954,7 @@ class BBCCoUkIPlayerEpisodesIE(BBCCoUkIPlayerPlaylistBaseIE):
@staticmethod
def _get_elements(data):
return data['entities']['results']
return try_get(data, lambda x: x['entities']['results'], list)
@staticmethod
def _get_episode(element):
@ -1553,7 +2026,7 @@ class BBCCoUkIPlayerGroupIE(BBCCoUkIPlayerPlaylistBaseIE):
@staticmethod
def _get_elements(data):
return data['elements']
return try_get(data, lambda x: x['elements'], list)
@staticmethod
def _get_episode(element):
@ -1574,6 +2047,14 @@ class BBCCoUkIPlayerGroupIE(BBCCoUkIPlayerPlaylistBaseIE):
def _get_playlist_title(self, data):
return data.get('title')
def _extract_title_and_description(self, webpage):
title, description = super(BBCCoUkIPlayerGroupIE, self)._extract_title_and_description(webpage)
title = self._html_search_regex(r'<h1>([^<]+)</h1>', webpage, 'title', default=title)
description = self._html_search_regex(
r'<p[^>]+class=(["\'])subtitle\1[^>]*>(?P<value>[^<]+)</p>',
webpage, 'description', group='value', default=description)
return title, description
class BBCCoUkPlaylistIE(BBCCoUkPlaylistBaseIE):
IE_NAME = 'bbc.co.uk:playlist'
@ -1616,8 +2097,3 @@ class BBCCoUkPlaylistIE(BBCCoUkPlaylistBaseIE):
'url': 'http://www.bbc.co.uk/programmes/b055jkys/episodes/player',
'only_matching': True,
}]
def _extract_title_and_description(self, webpage):
title = self._og_search_title(webpage, fatal=False)
description = self._og_search_description(webpage)
return title, description

View File

@ -1520,6 +1520,7 @@ from .wdr import (
WDRElefantIE,
WDRMobileIE,
)
from .webarchive import WebArchiveIE
from .webcaster import (
WebcasterIE,
WebcasterFeedIE,
@ -1610,7 +1611,7 @@ from .youtube import (
YoutubeRecommendedIE,
YoutubeSearchDateIE,
YoutubeSearchIE,
#YoutubeSearchURLIE,
YoutubeSearchURLIE,
YoutubeSubscriptionsIE,
YoutubeTruncatedIDIE,
YoutubeTruncatedURLIE,

View File

@ -15,6 +15,7 @@ from ..utils import (
merge_dicts,
parse_duration,
smuggle_url,
try_get,
url_or_none,
)
@ -23,15 +24,20 @@ class ITVIE(InfoExtractor):
_VALID_URL = r'https?://(?:www\.)?itv\.com/hub/[^/]+/(?P<id>[0-9a-zA-Z]+)'
_GEO_COUNTRIES = ['GB']
_TESTS = [{
'url': 'https://www.itv.com/hub/liar/2a4547a0012',
'url': 'https://www.itv.com/hub/the-durrells/2a4156a0001',
'info_dict': {
'id': '2a4547a0012',
'id': '2a4156a0001',
'ext': 'mp4',
'title': 'Liar - Series 2 - Episode 6',
'description': 'md5:d0f91536569dec79ea184f0a44cca089',
'series': 'Liar',
'season_number': 2,
'episode_number': 6,
'title': 'The Durrells - Series 1 - Episode 1',
'description': 'md5:43ae58e27aa91720fc933a68a37e9e95',
'series': 'The Durrells',
'season_number': 1,
'episode_number': 1,
'subtitles': {
'en': [
{'url': 'https://itvpnpsubtitles.content.itv.com/2-4156-0001-003/Subtitles/3/WebVTT-OUT-OF-BAND/2-4156-0001-003_Series1590486890_TX000000.vtt'}
]
},
},
'params': {
# m3u8 download
@ -87,13 +93,13 @@ class ITVIE(InfoExtractor):
},
'variantAvailability': {
'featureset': {
'min': ['hls', 'aes', 'outband-webvtt'],
'max': ['hls', 'aes', 'outband-webvtt']
'min': ['hls', 'aes'],
'max': ['hls', 'aes']
},
'platformTag': 'dotcom'
'platformTag': 'mobile'
}
}).encode(), headers=headers)
video_data = ios_playlist['Playlist']['Video']
video_data = try_get(ios_playlist, lambda x: x['Playlist']['Video'], dict) or {}
ios_base_url = video_data.get('Base')
formats = []
@ -114,8 +120,36 @@ class ITVIE(InfoExtractor):
})
self._sort_formats(formats)
subs_playlist = self._download_json(
ios_playlist_url, video_id, data=json.dumps({
'user': {
'itvUserId': '',
'entitlements': [],
'token': ''
},
'device': {
'manufacturer': 'Safari',
'model': '5',
'os': {
'name': 'Windows NT',
'version': '6.1',
'type': 'desktop'
}
},
'client': {
'version': '4.1',
'id': 'browser'
},
'variantAvailability': {
'featureset': {
'min': ['mpeg-dash', 'widevine', 'outband-webvtt'],
'max': ['mpeg-dash', 'widevine', 'outband-webvtt']
},
'platformTag': 'mobile'
}
}).encode(), headers=headers)
subs = try_get(subs_playlist, lambda x: x['Playlist']['Video']['Subtitles'], list) or []
subtitles = {}
subs = video_data.get('Subtitles') or []
for sub in subs:
if not isinstance(sub, dict):
continue

View File

@ -0,0 +1,54 @@
# coding: utf-8
from __future__ import unicode_literals
from .common import InfoExtractor
class WebArchiveIE(InfoExtractor):
_VALID_URL = r'https?:\/\/(?:www\.)?web\.archive\.org\/web\/([0-9]+)\/https?:\/\/(?:www\.)?youtube\.com\/watch\?v=(?P<id>[0-9A-Za-z_-]{1,11})$'
_TEST = {
'url': 'https://web.archive.org/web/20150415002341/https://www.youtube.com/watch?v=aYAGB11YrSs',
'md5': 'ec44dc1177ae37189a8606d4ca1113ae',
'info_dict': {
'url': 'https://web.archive.org/web/2oe_/http://wayback-fakeurl.archive.org/yt/aYAGB11YrSs',
'id': 'aYAGB11YrSs',
'ext': 'mp4',
'title': 'Team Fortress 2 - Sandviches!',
'author': 'Zeurel',
}
}
def _real_extract(self, url):
# Get video ID and page
video_id = self._match_id(url)
webpage = self._download_webpage(url, video_id)
# Extract title and author
title = self._html_search_regex(r'<title>(.+?)</title>', webpage, 'title').strip()
author = self._html_search_regex(r'"author":"([a-zA-Z0-9]+)"', webpage, 'author').strip()
# Parse title
if title.endswith(' - YouTube'):
title = title[:-10]
# Use link translator mentioned in https://github.com/ytdl-org/youtube-dl/issues/13655
link_stub = "https://web.archive.org/web/2oe_/http://wayback-fakeurl.archive.org/yt/"
# Extract hash from url
hash_idx = url.find("watch?v=") + len("watch?v=")
youtube_hash = url[hash_idx:]
# If there's an ampersand, cut off before it
ampersand = youtube_hash.find('&')
if ampersand != -1:
youtube_hash = youtube_hash[:ampersand]
# Recreate the fixed pattern url and return
reconstructed_url = link_stub + youtube_hash
return {
'url': reconstructed_url,
'id': video_id,
'title': title,
'author': author,
'ext': "mp4"
}

View File

@ -7,6 +7,7 @@ import json
import os.path
import random
import re
import string
import traceback
from .common import InfoExtractor, SearchInfoExtractor
@ -1478,8 +1479,13 @@ class YoutubeIE(YoutubeBaseInfoExtractor):
video_id = self._match_id(url)
base_url = self.http_scheme() + '//www.youtube.com/'
webpage_url = base_url + 'watch?v=' + video_id
# setting a random cookie helps to avoid http 429 errors
rnd1 = ''.join(random.choice(string.ascii_letters+string.digits) for i in range(11))
rnd2 = ''.join(random.choice(string.ascii_letters+string.digits) for i in range(11))
cookie = 'CONSENT=YES+cb.20210608-18-p0.de+FX+696; GPS=1; YSC='+rnd1+'; VISITOR_INFO1_LIVE='+rnd2+'; PREF=tz=Europe.London'
webpage = self._download_webpage(
webpage_url + '&bpctr=9999999999&has_verified=1', video_id, fatal=False)
webpage_url + '&bpctr=9999999999&has_verified=1', video_id, fatal=False, headers={'Cookie':cookie})
player_response = None
if webpage:
@ -2002,7 +2008,7 @@ class YoutubeTabIE(YoutubeBaseInfoExtractor):
(?:
(?:channel|c|user|feed|hashtag)/|
(?:playlist|watch)\?.*?\blist=|
(?!(?:watch|embed|v|e)\b)
(?!(?:watch|embed|v|e|results)\b)
)
(?P<id>[^/?\#&]+)
'''
@ -3079,11 +3085,10 @@ class YoutubeSearchDateIE(YoutubeSearchIE):
_SEARCH_PARAMS = 'CAI%3D'
r"""
class YoutubeSearchURLIE(YoutubeSearchIE):
IE_DESC = 'YouTube.com search URLs'
IE_NAME = 'youtube:search_url'
_VALID_URL = r'https?://(?:www\.)?youtube\.com/results\?(.*?&)?(?:search_query|q)=(?P<query>[^&]+)(?:[&]|$)'
_VALID_URL = r'https?://(?:www\.)?youtube\.com/results\?(.*?&)?(?:search_query|q)=(?:[^&]+)(?:[&]|$)'
_TESTS = [{
'url': 'https://www.youtube.com/results?baz=bar&search_query=youtube-dl+test+video&filters=video&lclk=video',
'playlist_mincount': 5,
@ -3095,9 +3100,20 @@ class YoutubeSearchURLIE(YoutubeSearchIE):
'only_matching': True,
}]
@classmethod
def _make_valid_url(cls):
return cls._VALID_URL
def _real_extract(self, url):
qs = compat_parse_qs(compat_urllib_parse_urlparse(url).query)
query = (qs.get('search_query') or qs.get('q'))[0]
self._SEARCH_PARAMS = qs.get('sp', ('',))[0]
return self._get_n_results(query, self._MAX_RESULTS)
r"""
mobj = re.match(self._VALID_URL, url)
query = compat_urllib_parse_unquote_plus(mobj.group('query'))
# url_result(url, ie=None, video_id=None, video_title=None)
#_SEARCH_KEY='ytsearch'+ ()
webpage = self._download_webpage(url, query)
return self.playlist_result(self._process_page(webpage), playlist_title=query)
"""

View File

@ -173,7 +173,7 @@ def parseOpts(overrideArguments=None):
'--ignore-config',
action='store_true',
help='Do not read configuration files. '
'When given in the global configuration file /etc/youtube-dl.conf: '
'When given in the global configuration file /mod/etc/youtube-dl.conf: '
'Do not read the user configuration in ~/.config/youtube-dl/config '
'(%APPDATA%/youtube-dl/config.txt on Windows)')
general.add_option(
@ -330,11 +330,11 @@ def parseOpts(overrideArguments=None):
))
selection.add_option(
'--no-playlist',
action='store_true', dest='noplaylist', default=False,
action='store_true', dest='noplaylist', default=None,
help='Download only the video, if the URL refers to a video and a playlist.')
selection.add_option(
'--yes-playlist',
action='store_false', dest='noplaylist', default=False,
action='store_false', dest='noplaylist', default=None,
help='Download the playlist, if the URL refers to a video and a playlist.')
selection.add_option(
'--age-limit',
@ -903,7 +903,7 @@ def parseOpts(overrideArguments=None):
elif '--ignore-config' in command_line_conf:
pass
else:
system_conf = _readOptions('/etc/youtube-dl.conf')
system_conf = _readOptions('/mod/etc/youtube-dl.conf')
if '--ignore-config' not in system_conf:
user_conf = _readUserConf()

View File

@ -1,4 +1,4 @@
#!/usr/bin/env python
#!/bin/env python
# coding: utf-8
from __future__ import unicode_literals
@ -2153,9 +2153,36 @@ def sanitize_url(url):
return re.sub(mistake, fixup, url)
return url
def extract_user_pass(url):
parts = compat_urlparse.urlsplit(url)
username = parts.username
password = parts.password
if username is not None:
if password is None:
password = ''
netloc = parts.hostname
if parts.port is not None:
netloc = parts.hostname + ':' + parts.port
parts = parts._replace(netloc=netloc)
url = compat_urlparse.urlunsplit(parts)
return url, username, password
def sanitized_Request(url, *args, **kwargs):
return compat_urllib_request.Request(sanitize_url(url), *args, **kwargs)
url = sanitize_url(url)
url, username, password = extract_user_pass(url)
if username is not None:
# password is not None
auth_payload = username + ':' + password
auth_payload = base64.b64encode(auth_payload.encode('utf-8')).decode('utf-8')
auth_header = 'Basic ' + auth_payload
if len(args) >= 2:
args[1]['Authorization'] = auth_header
else:
if 'headers' not in kwargs:
kwargs['headers'] = {}
kwargs['headers']['Authorization'] = 'Basic ' + auth_payload
return compat_urllib_request.Request(url, *args, **kwargs)
def expand_path(s):

View File

@ -1,3 +1,3 @@
from __future__ import unicode_literals
__version__ = '2021.06.06'
__version__ = '2021.06.06.1'