Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Google News link schema changed? #645

Open
moehmeni opened this issue Jul 24, 2024 · 6 comments
Open

[BUG] Google News link schema changed? #645

moehmeni opened this issue Jul 24, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@moehmeni
Copy link

moehmeni commented Jul 24, 2024

For decoding Google News URLs into their real ones, I am getting error

import base64
import re

# Some url encoding related constants
_ENCODED_URL_PREFIX = "https://news.google.com/rss/articles/"
_ENCODED_URL_PREFIX_WITH_CONSENT = (
    "https://consent.google.com/m?continue=https://news.google.com/rss/articles/"
)
_ENCODED_URL_RE = re.compile(
    rf"^{re.escape(_ENCODED_URL_PREFIX_WITH_CONSENT)}(?P<encoded_url>[^?]+)"
)
_ENCODED_URL_RE = re.compile(
    rf"^{re.escape(_ENCODED_URL_PREFIX)}(?P<encoded_url>[^?]+)"
)
_DECODED_URL_RE = re.compile(rb'^\x08\x13".+?(?P<primary_url>http[^\xd2]+)\xd2\x01')


def prepare_gnews_url(url):
    # There seems to be a case when we get a URL with consent.google.com
    # see https://github.com/ranahaani/GNews/issues/62
    # Also, the URL is directly decoded, no need to go through news.google.com

    match = _ENCODED_URL_RE.match(url)
    encoded_text = match.groupdict()["encoded_url"]
    # Fix incorrect padding. Ref: https://stackoverflow.com/a/49459036/
    encoded_text += "==="
    decoded_text = base64.urlsafe_b64decode(encoded_text)

    match = _DECODED_URL_RE.match(decoded_text)

    primary_url = match.groupdict()["primary_url"]
    primary_url = primary_url.decode()
    return primary_url


# Test the function
url = "https://news.google.com/rss/articles/CBMi2AFBVV95cUxQOHZlbFBOSXZDQTVDNWhibW9nMlUzaWpfbVRZaTNKMXd4VFNtQ2YxQWt2UmtDbHdia2xvbHZDMU03eXVabzFscDdMcHV4aGFnNW1zdU9zakVyaEFmMm1FVDVBRVotdktTbkJBOUFrT3dwNTY5bVNzZWRJQk1RT3l5SnBBeWdXS1laeVpwejQzN3luZjgwVjN0bFB5NkZSM2oxRXJ6Q0ItbDNMUDZJRTdEZXhjbUV1Z3NYMHdXV1hKV3N3YndWOVZjVE9uZlBGNkk0SS1mbTZ3b0Q?oc=5"
result = prepare_gnews_url(url)
print("Result:", result)
AttributeError: 'NoneType' object has no attribute 'groupdict'

I think they changed recently while it was working before.

@moehmeni moehmeni added the bug Something isn't working label Jul 24, 2024
@ljiang22
Copy link

I got the same error. Not sure if google changed anything. Does anyone have an idea about it?

@bckenstler
Copy link

Same

@trentinrossi
Copy link

Same issue

@renatocaliari
Copy link

Same here

@Nyveon
Copy link

Nyveon commented Aug 6, 2024

It seems a solution was figured out here https://gist.github.com/huksley/bc3cb046157a99cd9d1517b32f91a99e?permalink_comment_id=4500912 and @Glyphosate69 wrote a python version. Maybe this can replace the prepare_gnews_url internal function? It's working for our use case at least

@Ronkiro
Copy link

Ronkiro commented Sep 20, 2024

As i replied in codelucas/newspaper#1003 (comment) it's not really a solution, you may still hit some 429's and need to workaround this. But i found no other solutions but to follow this method for now (or use Custom Search API for it, which is kinda meh for the job)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants