Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update time extraction code for all configured news sites #5

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

paultcochrane
Copy link

This PR updates the publication date/time extraction code for all sites defined in the extension's manifest file. In some cases it was necessary to extend the matches in the manifest as the news sites have changed their URLs slightly. I've tried to document in each commit the changes that I made and why so that these changes can be cherry picked if so desired.

This PR is submitted in the hope that it is useful; if you want anything changed, I'll be more than happy to update and resubmit as necessary.

... because BBC uses bbc.com outside of the UK.
The publication date on BBC news is now either within the
`datePublished` element of a JSON object stored within a `<script>`
element or (if that doesn't exist) within a `<time>` element within the
`datetime` attribute.  Now the stale news warning works for the BBC
again.
Although CNN uses what seems to have become a standard across many news
sites for specifying the publication date/time (i.e. the `content`
attributes of the `<meta>` element with the `"article:published_time"`
property), this doesn't seem to be available when the extension is
loaded.  However, CNN provides the `<meta>` element with the `pubdate`
name and the date information is stored in this element's `content`
attribute.  This change now gets the stale news warning to work again on
CNN.
The DailyMail uses what seems to have become a standard on news sites:
the `<meta>` element with the `article:published_time` property contains
the publication date/time data.
As with other news agencies, The Guardian is now using the
`article:published_time` meta property to store the publication
date/time.
... because they also use huffpost.com now.
As with many other news sites, the publication date/time is stored in
the `article:published_time` meta property.  This change allows the
publication date to be extracted, however the extension still won't run
because the site has disallowed `alert()`s from running.
As with many other news sites, the publication date/time is stored in
the `article:published_time` meta property.
As with many other news sites, the publication date/time is stored in
the `article:published_time` meta property.  However, sometimes this
isn't seen by the stale news warning extension, hence the
`datePublished` element of the page's JSON metadata (stored in a
`<script>` element) is used as a fall-back.

The weird thing with the India Times is that when loading a page the
first time, the extension doesn't pick up any publication date/time
information, however on reload it *does*.  Odd.
The Times of India uses a similar technique to what the BBC does: the
date is embedded in the `datePublished` element a JSON object which is
embedded in a `<script>` element on the page.
As with many other news sites, the publication date/time is stored
within the page's `article:published_time` meta property.
... which uses the `datePublished` property of a JSON object provided
via a `<script>` in the page.  This is only a slight change over what
this extension's code used to do; the JSON is no longer in an element
with a well-defined id: one has to search through all `<script>` tags to
find the one which contains the relevant information.
... which now use a `<time>` element with the `itemprop` attribute of
`dateCreated` (which is the publication date being looked for by the
stale news warning extension).  The date/time data is stored in the
`datetime` attribute.
... which puts the publication date/time info in the `<meta>` tag with
the name `DCSext.articleFirstPublished`.
... which puts the publication date/time in the `datePublished` element
of a JSON object embedded within a `<script>` element.

Unfortunately, even though the publication date/time is extracted
correctly, the extension can't show it because `alert()`s are forbidden
on Yahoo News.
... which puts the publication date/time in the `datePublished` element
of a JSON object embedded within a `<script>` element.

Unfortunately, even though the publication date/time is extracted
correctly, the extension can't show it because `alert()`s are forbidden
on Yahoo News.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant