Make the finder return only published documents #158

octopusinvitro · 2017-04-11T15:14:41Z

In order to show links to unlinked pages in the scraping view, we need to make a difference between published and unpublished posts. This is because posts/events that have a the published field off won't be linked in the site. In the case of info pages and summaries, they are always published, but they have to be linked manually in the site, so those who were not linked won't be scraped either.

At the moment, find_all doesn't make a distinction between documents with the published field on or off, which is a bug. To fix it, the class was modified so that find_all selects only the documents with a published set to true, and a new method was added for the unpublished documents, find_unpublished.

Finally, the find_all method was renamed to find_published.

After that we can send all the unpublished and unlinked documents to the scraper view, since those are the documents that will not be linked anywhere in the site.

Related issues

Closes #151

mhl

I've left some comments on the detail of this pull request, but I was quite tempted, along the guideline that one should do code review "outside-in", to stop at the pull request description, because it's not completely clear to me.

There are two distinct concepts here:

Whether a document is "published" in the sense that it has a published: true attribute in the frontmatter, or being published is the default for that collection of documents.
Whether a page is linked to from somewhere on the site, so will be scraped by default.

When I mentioned the issue of unlinked-to pages not being scraped, I was specifically thinking about info pages, and not worrying about whether they're considered published or not. (i.e. I was thinking of point 2, not point 1).

On the other hand, you have clearly been thinking mostly about the "published" status of documents, point 1, and might have been conflating 1 and 2 somewhat, from the way you added to the scraper start page unpublished posts described as "Unlinked"? I think you've also spotted and fixed a bug to do with the published status of posts, in that /blog would previously have included unpublished posts, which is clearly a bug!

The question of whether unpublished posts should be linked to on the scraper start page is one that I hadn't thought about, but I guess you could argue it either way ("it might be convenient to see unpublished posts on the live site if you know the URL" vs "someone could guess the URL of an unpublished blog post") The reason I particularly wanted the unlinked-to info pages to be scraped was to help with us testing the site. The same reasoning could apply to posts and events. And draft blog posts aren't really private anyway, since the GitHub repository is publicly browseable. Perhaps the best solution is to scrape the unpublished pages too, but make sure we let our partners know that drafts are possible for people to discover, so they should never put any sensitive information in them.

mhl · 2017-04-19T08:57:18Z

tests/document/finder.rb

@@ -7,11 +7,32 @@
  let(:finder) { Document::Finder.new(pattern: "#{filename}.md", baseurl: '/path/') }

  it 'finds a single document' do
-    Dir.stub :glob, [filename] do
+    Dir.stub :glob, [new_tempfile('')] do


This doesn't have to change, does it? If you want to change to using new_tempfile for all of these consistently, I'd make that a separate preparatory commit, and also change the remaining use of filename.

Ah, this is because now things have changed: when you call find_single you are also calling find_all, so with the introduction of select(&:published?), the method find_all now has to look at the contents of the file, which before it didn't have to, so this test fails. But I wouldn't create a temp file for the other tests if it is not needed... 🤔 What do you think?

mhl · 2017-04-19T08:59:43Z

lib/document/finder.rb

@@ -21,7 +21,7 @@ def find_or_empty
    end

    def find_all
-      filenames.map { |filename| create_document(filename) }
+      filenames.map { |filename| create_document(filename) }.select(&:published?)


It's a bit confusing to me that this is called find_all while it's not returning all the documents, just those for which published? is true. Perhaps you could call this find_published instead? (I realise that we don't have published fields for some of the collections, but I think it's less confusing to assume that every document of those is published to having something called find_all that isn't returning all documents.)

mhl · 2017-04-19T09:02:36Z

lib/document/finder.rb

+      create_documents.select(&:published?)
+    end
+
+    def find_previews


Calling this find_previews adds a surprising interpretation to what it returns which is surprising, I think. They're not previews in any sense yet: it's still the full document. I'd call this find_unpublished.

mhl · 2017-04-19T09:06:58Z

tests/document/finder.rb

@@ -33,6 +33,18 @@
    end
  end

+  it 'can find unpublished documents' do


In terms of robustness of the tests, in this case even if you reversed the logic of find_previews, so that it only returned published documents, the tests would still pass. A simple improvement so that can't so easily happen is to stub it with [new_tempfile(published), new_tempfile(unpublished), new_tempfile(unpublished)] so that there are 2 unpublished and 1 published documents, and assert that find_previews must return 2.

(This is a general point that I try to think about when writing tests: "How could this test still pass if I were making plausible but incorrect changes to the code under test, and how likely is someone to make such a change?" In this case, getting the logic of a predicate the wrong way round is a very easy and common mistake to make, so I'd try to make sure the test would still fail in that case.)

mhl · 2017-04-19T09:41:01Z

tests/web/unlinked_pages.rb

+      get '/scraper-start-page.html'
+      refute_empty(subject.xpath('//a[contains(@href, "/blog/")]'))
+      refute_empty(subject.xpath('//a[contains(@href, "/events/")]'))
+      refute_empty(subject.xpath('//a[contains(@href, "/info/") and not(contains(@href, "events"))][1]'))


I don't think this tests what the test says it tests. e.g. any info page other than /info/events existing will allow the test to pass, not just unlinked-to ones.

This method allows the scraper route to send the unpublished documents to the view so that they can be scraped and pushed to the static repo. Documents that are unpublished are not linked anywhere on the site. With this change, they can be loaded as a preview without being explicitly linked from the site. This means that all documents, linked or unlinked, are published.

There is a small nuance between "published" documents and "unlinked" documents that may not be obvious to a newcomer to the codebase. So an explanation was filled under the tribal knowledge section of the contributing guide.

coveralls · 2017-07-12T22:28:41Z

Coverage increased (+0.003%) to 99.758% when pulling e7e0ad0 on scrape-unlinked-pages into ce9d5bd on master.

octopusinvitro · 2017-07-12T22:32:52Z

@mhl I rebased this branch as it was ready to be reviewed after addressing feedback 👍

octopusinvitro force-pushed the scrape-unlinked-pages branch from 5cdb86c to 0eaba71 Compare April 11, 2017 15:38

octopusinvitro requested a review from mhl April 11, 2017 17:33

octopusinvitro force-pushed the scrape-unlinked-pages branch 3 times, most recently from 71e7098 to b452406 Compare April 18, 2017 15:17

mhl suggested changes Apr 19, 2017

View reviewed changes

octopusinvitro force-pushed the scrape-unlinked-pages branch 3 times, most recently from b044632 to 8575c96 Compare April 20, 2017 12:06

octopusinvitro force-pushed the scrape-unlinked-pages branch from 8575c96 to 858d571 Compare April 27, 2017 16:36

octopusinvitro added 7 commits July 12, 2017 23:19

Make the finder return only published documents

cf06306

Add unpublished posts to the scraper page to have their previews

7532f22

Add unpublished events to the scraper page to have their previews

c742f35

Add unlinked info pages to the scraper page to have their previews

822dc98

Rename find_all to find_published for documents

fcc5bc2

octopusinvitro force-pushed the scrape-unlinked-pages branch from 858d571 to e7e0ad0 Compare July 12, 2017 22:26

theyworkforyou deleted a comment from coveralls Jul 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the finder return only published documents #158

Make the finder return only published documents #158

octopusinvitro commented Apr 11, 2017 •

edited

Loading

mhl left a comment •

edited

Loading

mhl Apr 19, 2017

octopusinvitro Apr 19, 2017 •

edited

Loading

mhl Apr 19, 2017

mhl Apr 19, 2017

mhl Apr 19, 2017

mhl Apr 19, 2017

coveralls commented Jul 12, 2017

octopusinvitro commented Jul 12, 2017

Make the finder return only published documents #158

Are you sure you want to change the base?

Make the finder return only published documents #158

Conversation

octopusinvitro commented Apr 11, 2017 • edited Loading

Related issues

mhl left a comment • edited Loading

Choose a reason for hiding this comment

mhl Apr 19, 2017

Choose a reason for hiding this comment

octopusinvitro Apr 19, 2017 • edited Loading

Choose a reason for hiding this comment

mhl Apr 19, 2017

Choose a reason for hiding this comment

mhl Apr 19, 2017

Choose a reason for hiding this comment

mhl Apr 19, 2017

Choose a reason for hiding this comment

mhl Apr 19, 2017

Choose a reason for hiding this comment

coveralls commented Jul 12, 2017

octopusinvitro commented Jul 12, 2017

octopusinvitro commented Apr 11, 2017 •

edited

Loading

mhl left a comment •

edited

Loading

octopusinvitro Apr 19, 2017 •

edited

Loading