Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kankids extractor #32825

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Kankids extractor #32825

wants to merge 8 commits into from

Conversation

deepspy
Copy link

@deepspy deepspy commented Jun 24, 2024

Please follow the guide below

  • You will be asked some questions, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your pull request (like that [x])
  • Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

Implementation of a new extractor for a new streaming site: kankids.org.il.

@deepspy deepspy marked this pull request as ready for review June 24, 2024 20:20
@dirkf
Copy link
Contributor

dirkf commented Jul 3, 2024

Please link a Site Support Request for this site or paste the completed template here. Also, there are some more flake8 nits to clear.

Copy link
Contributor

@dirkf dirkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work!

This is a good start and I've made some suggestions for how to complete the job.

]

def _real_extract(self, url):
m = super()._match_valid_url(url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. As we don't hate Python2, super() has to mention the class and instance: super(KanKidsIE, self), but...
  2. Just say this, since Python knows how to find the method to run from the class and its base classes:
Suggested change
m = super()._match_valid_url(url)
m = self._match_valid_url(url)

Comment on lines +7 to +8
CONTENT_DIR = r'/content/kids/'
DOMAIN = r'kankids.org.il'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. You don't need raw strings unless there's a \ in the string.
  2. Possibly CONTENT_DIR is only ever used when preceded by DOMAIN and these could be one constant?
  3. Possibly the constant(s) could be class variables rather than module variables?
  4. But maybe this will be irrelevant.


class KanKidsIE(InfoExtractor):
_VALID_URL = r'https?://(?:www\.)?' +\
DOMAIN.replace('.', '\\.') + CONTENT_DIR +\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a more general solution to this:

Suggested change
DOMAIN.replace('.', '\\.') + CONTENT_DIR +\
re.escape(DOMAIN + CONTENT_DIR) +\

Comment on lines +38 to +40
series_id = m.group('id')
category = m.group('category')
playlist_season = m.group('season')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this in one call:

Suggested change
series_id = m.group('id')
category = m.group('category')
playlist_season = m.group('season')
series_id, category, playlist_season = m.group('id', 'category', 'season')

Comment on lines +44 to +49
title_pattern = r'<title>(?P<title>.+) \|'
series_title = re.search(title_pattern, webpage)
if not series_title:
series_title = re.search(title_pattern[:-1] + r'-', webpage)
if series_title:
series_title = series_title.group('title')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Use _search_regex(), or when matching displayed content rather than inside HTML tags, _html_search_regex().
  2. Then alternative patterns can be in a tuple or list, if, unlike this case, the alternates can't be expressed easily in the pattern.

Like so:

Suggested change
title_pattern = r'<title>(?P<title>.+) \|'
series_title = re.search(title_pattern, webpage)
if not series_title:
series_title = re.search(title_pattern[:-1] + r'-', webpage)
if series_title:
series_title = series_title.group('title')
series_title = self._html_search_regex(
r'<title\b[^>]*>([^>]+)(?:\s+[|-][^>]*)?</title', webpage, 'title')

I'll break that regex down using the x flag that allows comments and swallows insignificant white-space, in a multi-line string with '''s:

r'''(?x)
    <title\b[^>]*>   # allow the site to send attributes in the `title` tag
                     # `\b` matches only if the next character doesn't match `[a-zA-Z0-9_]`
                     # strictly we should exclude `-` too: `(?!-)\b`
    ([^>]+)          # `_html_search_regex()` finds group 1 by default
                     # match characters that are not (`^`) `<`: the tag content should end at `</title>`
                     # to skip everything after the first separator rather than the last, use `([^>]+?)`
                     # could allow for other embedded tags by using `(?:(?!</title)[\s\S])+` but:
                     # 1. the content of a `<title>` is just text
                     # 2. actually we (probably) never do that anyway
    (?:              # an unnumbered ("non-captured") group  
       \s+           # allow the site to send one or more white-space before any separator 
       [|-]          # separator is either `|` or `-`: inside `[...]`:
                     # 1. no need to escape special regex characters `|` (also `? . * + { } ( )`)
                     # 2. `-` must be first or last
       [^>]*         # again match not `>`
    )?               # optionally match group: maybe this title has no, or an unexpected, separator
    </title          # end of content
'''

But if/as the suffix removal should be a common pattern or routine, it would be better to extract the entire <title> content and then remove the suffix, perhaps with re.sub().


season = playlist_season if playlist_season else r'(?P<season>\w+)'
content_dir = CONTENT_DIR + category + r'-main/'
playlist = set(re.findall(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set() is a good solution if you're concerned about duplicate URLs but won't help if there are different URLs for the same video (say, with different query parameters). In that case you have to extract a video ID for each URL and process the ID with set().

DOMAIN = r'kankids.org.il'


class KanKidsIE(InfoExtractor):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've written the KanKidsPlaylistIE extractor!

There should also be a KanKidsIE that extracts single video pages. At the moment I think this is relying on the ld+json extraction in GenericIE, but that's a bit clumsy.

If there were a support request issue I expect (my inference as a non-Hebrew speaker) that the requester would be asking to support your test playlist pages

And presumably also

And also, from those playlist pages, video pages like

I'll discuss the two cases in the next two posts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the playlist pages, it seems that the video links are in anchor tags like this:

<a href="/content/kids/ktantanim-main/p-11732/s1/100026/" class="card-link" title="בית ספר לקוסמים | קורעים עיתונים | פרק 7">

The tactic for this sort of link is:

  1. find each tag
  2. pass the tag to extract_attributes() (so the order of the attributes isn't significant)
  3. urljoin() the page URL with the value of the href attribute
  4. skip the item if that URL is None
  5. extract any title
  6. pass the URL and any title to url_result()

But as the links belong to the site we can do better like this
6a. check the URL with KanKidsIE._match_valid_url()
6b. skip the item if no match (or just do 6. as before)
6c. pass the URL with the ID from the match and KanKidsIE.ie_key() and any title to url_result()

At 1. we should have a utility function equivalent to JS document.querySelector(".card-link") but instead have to roll a pattern like

# match group 1
r'(<a\s[^>]*\bclass\s*=\s*"card-link"[^>]*>)'
# or as the `class` attribute value can be a space-separated list
r'(<a\s[^>]*\bclass\s*=\s*("|\')\s*(?:\S+\s+)*?card-link(?:\s+\S+)*\s*\2[^>]*>)'

At 5. maybe factor out a pattern or routine for removing unwanted suffixes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the video pages, there is a <script> tag like this:

<script type="application/ld+json">
    {
        "@context": "https://schema.org",
        "@type": "VideoObject",

                "name": "בית ספר לקוסמים | הקלף החתוך | פרק 36",
                "description": "דיקו הקוסם ויואב השוליה מלמדים איך מנבאים בדיוק היכן יבחר המתנדב לחתוך את הקלפים",
                "thumbnailUrl": "https://kan-media.kan.org.il/media/omhdwt20/2019-2-25-imgid-5344_b.jpeg",
                "uploadDate": "2023-04-22T22:05:32+03:00",
                "contentUrl": "https://cdnapisec.kaltura.com/p/2717431/sp/271743100/playManifest/entryId/1_en2re1iu/format/applehttp/protocol/https/desiredFileName.m3u8",
                "embedUrl": "https://www.kankids.org.il/content/kids/ktantanim-main/p-11732/s1/100574/"
                }
</script>

The tactic for this is just to call self._search_json_ld(webpage, video_id, expected_type='VideoObject'). From the above JSON this will return a dict with url, timestamp, title, description, thumbnail set. To make a good return value just update with{'id': video_id}.

Optionally

  1. apply the common suffix removal process to the title
  2. pass the url to determine_ext() and if it's 'm3u8' (as above) pass the popped url to _extract_m3u8_formats() and its resulting formats to _sort_formats(), and update the info-dict with {'formats': formats}
  3. update the info-dict with any other known attributes available in the page, maybe from this JS fragment:
        dataLayer.push({
            ...
            season: '1',
            episode_number: '36',
            episode_title: 'בית ספר לקוסמים | הקלף החתוך | פרק 36',
            genre_tags: '',
            item_duration: '0',
            program_type: 'טלויזיה',
            program_genre: 'קטנטנים',
            program_format: 'סדרה',
            article_type: '',
            page_name: 'בית ספר לקוסמים | הקלף החתוך | פרק 36',
            program_name: 'בית ספר לקוסמים',
            channel_name: 'חינוכית - קטנטנים'
            ...

season = playlist_season if playlist_season else r'(?P<season>\w+)'
content_dir = CONTENT_DIR + category + r'-main/'
playlist = set(re.findall(
r'href="' + content_dir # Content dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above for a preferable tactic here. Maybe a string format would have been better than concatenation, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants