-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kankids extractor #32825
base: master
Are you sure you want to change the base?
Kankids extractor #32825
Conversation
Fixed a non p- playlist id matching bug.
Please link a Site Support Request for this site or paste the completed template here. Also, there are some more flake8 nits to clear. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work!
This is a good start and I've made some suggestions for how to complete the job.
] | ||
|
||
def _real_extract(self, url): | ||
m = super()._match_valid_url(url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- As we don't hate Python2,
super()
has to mention the class and instance:super(KanKidsIE, self)
, but... - Just say this, since Python knows how to find the method to run from the class and its base classes:
m = super()._match_valid_url(url) | |
m = self._match_valid_url(url) |
CONTENT_DIR = r'/content/kids/' | ||
DOMAIN = r'kankids.org.il' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- You don't need raw strings unless there's a
\
in the string. - Possibly
CONTENT_DIR
is only ever used when preceded byDOMAIN
and these could be one constant? - Possibly the constant(s) could be class variables rather than module variables?
- But maybe this will be irrelevant.
|
||
class KanKidsIE(InfoExtractor): | ||
_VALID_URL = r'https?://(?:www\.)?' +\ | ||
DOMAIN.replace('.', '\\.') + CONTENT_DIR +\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a more general solution to this:
DOMAIN.replace('.', '\\.') + CONTENT_DIR +\ | |
re.escape(DOMAIN + CONTENT_DIR) +\ |
series_id = m.group('id') | ||
category = m.group('category') | ||
playlist_season = m.group('season') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this in one call:
series_id = m.group('id') | |
category = m.group('category') | |
playlist_season = m.group('season') | |
series_id, category, playlist_season = m.group('id', 'category', 'season') |
title_pattern = r'<title>(?P<title>.+) \|' | ||
series_title = re.search(title_pattern, webpage) | ||
if not series_title: | ||
series_title = re.search(title_pattern[:-1] + r'-', webpage) | ||
if series_title: | ||
series_title = series_title.group('title') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Use
_search_regex()
, or when matching displayed content rather than inside HTML tags,_html_search_regex()
. - Then alternative patterns can be in a tuple or list, if, unlike this case, the alternates can't be expressed easily in the pattern.
Like so:
title_pattern = r'<title>(?P<title>.+) \|' | |
series_title = re.search(title_pattern, webpage) | |
if not series_title: | |
series_title = re.search(title_pattern[:-1] + r'-', webpage) | |
if series_title: | |
series_title = series_title.group('title') | |
series_title = self._html_search_regex( | |
r'<title\b[^>]*>([^>]+)(?:\s+[|-][^>]*)?</title', webpage, 'title') |
I'll break that regex down using the x
flag that allows comments and swallows insignificant white-space, in a multi-line string with '''
s:
r'''(?x)
<title\b[^>]*> # allow the site to send attributes in the `title` tag
# `\b` matches only if the next character doesn't match `[a-zA-Z0-9_]`
# strictly we should exclude `-` too: `(?!-)\b`
([^>]+) # `_html_search_regex()` finds group 1 by default
# match characters that are not (`^`) `<`: the tag content should end at `</title>`
# to skip everything after the first separator rather than the last, use `([^>]+?)`
# could allow for other embedded tags by using `(?:(?!</title)[\s\S])+` but:
# 1. the content of a `<title>` is just text
# 2. actually we (probably) never do that anyway
(?: # an unnumbered ("non-captured") group
\s+ # allow the site to send one or more white-space before any separator
[|-] # separator is either `|` or `-`: inside `[...]`:
# 1. no need to escape special regex characters `|` (also `? . * + { } ( )`)
# 2. `-` must be first or last
[^>]* # again match not `>`
)? # optionally match group: maybe this title has no, or an unexpected, separator
</title # end of content
'''
But if/as the suffix removal should be a common pattern or routine, it would be better to extract the entire <title>
content and then remove the suffix, perhaps with re.sub()
.
|
||
season = playlist_season if playlist_season else r'(?P<season>\w+)' | ||
content_dir = CONTENT_DIR + category + r'-main/' | ||
playlist = set(re.findall( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set()
is a good solution if you're concerned about duplicate URLs but won't help if there are different URLs for the same video (say, with different query parameters). In that case you have to extract a video ID for each URL and process the ID with set()
.
DOMAIN = r'kankids.org.il' | ||
|
||
|
||
class KanKidsIE(InfoExtractor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've written the KanKidsPlaylistIE
extractor!
There should also be a KanKidsIE
that extracts single video pages. At the moment I think this is relying on the ld+json extraction in GenericIE
, but that's a bit clumsy.
If there were a support request issue I expect (my inference as a non-Hebrew speaker) that the requester would be asking to support your test playlist pages
- playlist: https://www.kankids.org.il/content/kids/ktantanim-main/p-11732/
- series/season (?) page: https://www.kankids.org.il/content/kids/hinuchit-main/cramel_main/s1/
And presumably also
- series (?) page: https://www.kankids.org.il/content/kids/hinuchit-main/cramel_main/
And also, from those playlist pages, video pages like
- https://www.kankids.org.il/content/kids/ktantanim-main/p-11732/s1/100574/
- https://www.kankids.org.il/content/kids/hinuchit-main/cramel_main/s1/136623/
I'll discuss the two cases in the next two posts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the playlist pages, it seems that the video links are in anchor tags like this:
<a href="/content/kids/ktantanim-main/p-11732/s1/100026/" class="card-link" title="בית ספר לקוסמים | קורעים עיתונים | פרק 7">
The tactic for this sort of link is:
- find each tag
- pass the tag to
extract_attributes()
(so the order of the attributes isn't significant) urljoin()
the page URL with the value of thehref
attribute- skip the item if that URL is
None
- extract any
title
- pass the URL and any
title
tourl_result()
But as the links belong to the site we can do better like this
6a. check the URL with KanKidsIE._match_valid_url()
6b. skip the item if no match (or just do 6. as before)
6c. pass the URL with the ID from the match and KanKidsIE.ie_key()
and any title
to url_result()
At 1. we should have a utility function equivalent to JS document.querySelector(".card-link")
but instead have to roll a pattern like
# match group 1
r'(<a\s[^>]*\bclass\s*=\s*"card-link"[^>]*>)'
# or as the `class` attribute value can be a space-separated list
r'(<a\s[^>]*\bclass\s*=\s*("|\')\s*(?:\S+\s+)*?card-link(?:\s+\S+)*\s*\2[^>]*>)'
At 5. maybe factor out a pattern or routine for removing unwanted suffixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the video pages, there is a <script>
tag like this:
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "VideoObject",
"name": "בית ספר לקוסמים | הקלף החתוך | פרק 36",
"description": "דיקו הקוסם ויואב השוליה מלמדים איך מנבאים בדיוק היכן יבחר המתנדב לחתוך את הקלפים",
"thumbnailUrl": "https://kan-media.kan.org.il/media/omhdwt20/2019-2-25-imgid-5344_b.jpeg",
"uploadDate": "2023-04-22T22:05:32+03:00",
"contentUrl": "https://cdnapisec.kaltura.com/p/2717431/sp/271743100/playManifest/entryId/1_en2re1iu/format/applehttp/protocol/https/desiredFileName.m3u8",
"embedUrl": "https://www.kankids.org.il/content/kids/ktantanim-main/p-11732/s1/100574/"
}
</script>
The tactic for this is just to call self._search_json_ld(webpage, video_id, expected_type='VideoObject')
. From the above JSON this will return a dict
with url
, timestamp
, title
, description
, thumbnail
set. To make a good return value just update with{'id': video_id}
.
Optionally
- apply the common suffix removal process to the
title
- pass the
url
todetermine_ext()
and if it's'm3u8'
(as above) pass the poppedurl
to_extract_m3u8_formats()
and its resultingformats
to_sort_formats()
, and update the info-dict with{'formats': formats}
- update the info-dict with any other known attributes available in the page, maybe from this JS fragment:
dataLayer.push({
...
season: '1',
episode_number: '36',
episode_title: 'בית ספר לקוסמים | הקלף החתוך | פרק 36',
genre_tags: '',
item_duration: '0',
program_type: 'טלויזיה',
program_genre: 'קטנטנים',
program_format: 'סדרה',
article_type: '',
page_name: 'בית ספר לקוסמים | הקלף החתוך | פרק 36',
program_name: 'בית ספר לקוסמים',
channel_name: 'חינוכית - קטנטנים'
...
season = playlist_season if playlist_season else r'(?P<season>\w+)' | ||
content_dir = CONTENT_DIR + category + r'-main/' | ||
playlist = set(re.findall( | ||
r'href="' + content_dir # Content dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above for a preferable tactic here. Maybe a string format would have been better than concatenation, though.
Please follow the guide below
x
into all the boxes [ ] relevant to your pull request (like that [x])Before submitting a pull request make sure you have:
In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:
What is the purpose of your pull request?
Description of your pull request and other information
Implementation of a new extractor for a new streaming site: kankids.org.il.