[BUG] Google News link schema changed? #645

moehmeni · 2024-07-24T16:29:50Z

For decoding Google News URLs into their real ones, I am getting error

import base64
import re

# Some url encoding related constants
_ENCODED_URL_PREFIX = "https://news.google.com/rss/articles/"
_ENCODED_URL_PREFIX_WITH_CONSENT = (
    "https://consent.google.com/m?continue=https://news.google.com/rss/articles/"
)
_ENCODED_URL_RE = re.compile(
    rf"^{re.escape(_ENCODED_URL_PREFIX_WITH_CONSENT)}(?P<encoded_url>[^?]+)"
)
_ENCODED_URL_RE = re.compile(
    rf"^{re.escape(_ENCODED_URL_PREFIX)}(?P<encoded_url>[^?]+)"
)
_DECODED_URL_RE = re.compile(rb'^\x08\x13".+?(?P<primary_url>http[^\xd2]+)\xd2\x01')


def prepare_gnews_url(url):
    # There seems to be a case when we get a URL with consent.google.com
    # see https://github.com/ranahaani/GNews/issues/62
    # Also, the URL is directly decoded, no need to go through news.google.com

    match = _ENCODED_URL_RE.match(url)
    encoded_text = match.groupdict()["encoded_url"]
    # Fix incorrect padding. Ref: https://stackoverflow.com/a/49459036/
    encoded_text += "==="
    decoded_text = base64.urlsafe_b64decode(encoded_text)

    match = _DECODED_URL_RE.match(decoded_text)

    primary_url = match.groupdict()["primary_url"]
    primary_url = primary_url.decode()
    return primary_url


# Test the function
url = "https://news.google.com/rss/articles/CBMi2AFBVV95cUxQOHZlbFBOSXZDQTVDNWhibW9nMlUzaWpfbVRZaTNKMXd4VFNtQ2YxQWt2UmtDbHdia2xvbHZDMU03eXVabzFscDdMcHV4aGFnNW1zdU9zakVyaEFmMm1FVDVBRVotdktTbkJBOUFrT3dwNTY5bVNzZWRJQk1RT3l5SnBBeWdXS1laeVpwejQzN3luZjgwVjN0bFB5NkZSM2oxRXJ6Q0ItbDNMUDZJRTdEZXhjbUV1Z3NYMHdXV1hKV3N3YndWOVZjVE9uZlBGNkk0SS1mbTZ3b0Q?oc=5"
result = prepare_gnews_url(url)
print("Result:", result)

AttributeError: 'NoneType' object has no attribute 'groupdict'

I think they changed recently while it was working before.

The text was updated successfully, but these errors were encountered:

ljiang22 · 2024-07-25T02:03:42Z

I got the same error. Not sure if google changed anything. Does anyone have an idea about it?

bckenstler · 2024-07-25T04:10:41Z

Same

trentinrossi · 2024-08-01T18:10:48Z

Same issue

renatocaliari · 2024-08-05T11:24:55Z

Same here

Nyveon · 2024-08-06T13:58:58Z

It seems a solution was figured out here https://gist.github.com/huksley/bc3cb046157a99cd9d1517b32f91a99e?permalink_comment_id=4500912 and @Glyphosate69 wrote a python version. Maybe this can replace the prepare_gnews_url internal function? It's working for our use case at least

Ronkiro · 2024-09-20T23:44:32Z

As i replied in codelucas/newspaper#1003 (comment) it's not really a solution, you may still hit some 429's and need to workaround this. But i found no other solutions but to follow this method for now (or use Custom Search API for it, which is kinda meh for the job)

moehmeni added the bug Something isn't working label Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Google News link schema changed? #645

[BUG] Google News link schema changed? #645

moehmeni commented Jul 24, 2024 •

edited

Loading

ljiang22 commented Jul 25, 2024

bckenstler commented Jul 25, 2024

trentinrossi commented Aug 1, 2024

renatocaliari commented Aug 5, 2024

Nyveon commented Aug 6, 2024 •

edited

Loading

Ronkiro commented Sep 20, 2024

[BUG] Google News link schema changed? #645

[BUG] Google News link schema changed? #645

Comments

moehmeni commented Jul 24, 2024 • edited Loading

ljiang22 commented Jul 25, 2024

bckenstler commented Jul 25, 2024

trentinrossi commented Aug 1, 2024

renatocaliari commented Aug 5, 2024

Nyveon commented Aug 6, 2024 • edited Loading

Ronkiro commented Sep 20, 2024

moehmeni commented Jul 24, 2024 •

edited

Loading

Nyveon commented Aug 6, 2024 •

edited

Loading