RFC: next-gen scraper interface #85

jamesturk · 2020-07-09T21:33:20Z

jamesturk
Jul 9, 2020
Maintainer

I'd like to begin the process of moving the scrapers to a new interface that makes them easier to write, maintain, and test (yes, test). I've played around with this a bit with prior projects and introduced piecemeal bits of this, but I think that the current state of person/committee scrapers (mostly disabled, many not working) would be a good place to start, and the lessons learned there can be brought over to the Bill/Vote scrapers when ready.

To head off the biggest anticipated concern: I'm not proposing another move anywhere near as big as the billy->pupa migration, in fact I'd like to do this in a way where individual scrapers can be ported. Initially they'll output backwards-compatible JSON, only the way they're invoked will need to change when they're rewritten.

The main goals of this next iteration:

give tools that help reduce code duplication & solve/prevent common problems
introduce ways to scraper authors to add custom validation hooks for the data in a particular state
encourage authors to structure code in a way that eliminates 100+ line long functions that do everything under the sun
make it possible to run a portion of a scraper very very easily

Non-goals:

parallelization. while it might actually be possible to implement based on some of what is being proposed, I'm stopping short of introducing it for reasons that have been discussed (see parallelism in scraping? #47)
new data format. the JSON we have is a sort of contract for now, and replacing just the scraper UI without altering the various import pipelines seems like the way to go, it also allows us to do this piecemeal and not switch 52*N scrapers all at once

Proposed 'Public' Interface (internal methods are hidden, a rough but working implementation of this exists at https://github.com/openstates/people/pull/303/files):

class Scraper:
    """ one implemented per-state, will ideally have no session-specific code """

    def to_object(self, item):
        """
        converts intermediate data (often in a dictionary) to a final object to be validated
        """
        return item

    def start_scrape(self, chamber, session):
        """
        yields one or more Page objects that will kick off the scrape.

        It may also raise a ValueError (TBD) when it does not have an appropriate entrypoint
        to scrape the requested data.
        """
        raise NotImplementedError()


class Page:
    subpages = []   
    # this is a list of lambdas that take the item returned from get_data and returns a list of Page objects that should be scraped next & used to augment the scraper

    def __init__(self, url):
        """
        a Page can be instantiated with a url & options (TBD) needed to fetch it
        """
        self.url = url

    def set_raw_data(self, raw_data):
        """ callback to handle raw data returned by grabbing the URL """
        self.raw_data = raw_data

    def get_data(self):
        """ return data extracted from this page and this page alone """
        raise NotImplementedError()


# multiple derived pages will exist, but the most common would be 
class HtmlPage:
    def set_raw_data(self, raw_data):
        self.raw_data = raw_data
        self.root = lxml.html.fromstring(raw_data.content)
        self.root.make_links_absolute(self.url)


class HtmlListPage(HtmlPage):
    """
    Simplification for HTML pages that get a list of items and process them.

    When overriding the class, instead of providing get_data, one must only provide
    an xpath and a process_item function.
    """
    xpath = None

    def process_item(self, item):
        return item

Usage would look like this (assume an executable named os-scrape that can load defined subclasses of the above as needed):

os-scrape scrape md people --chamber upper

Would fetch MDPeopleScraper and instantiate it with chamber=upper.
MDPeopleScraper's start_scrape method would yield a MDPersonList (or MDSenateList) Page object with the appropriate URL for the current session. After the scraper obtained the URL in question, it would call MDPersonList.get_data().
If MDPersonList defines any subpage objects, they will be instantiated for every item yielded from MDPersonList.get_data(). Let's say we have an MDPersonDetail object. That page will be fetched & get_data will be called. The result of MDPersonList.get_data() and MDPersonDetail.get_data() will be merged (dictionary merge by default, overridable??).
The final object, after being augmented by all subpages, will be passed into the MDPeopleScraper.to_object() method, which turns dictionaries into Person objects. It is also possible to have the parent pages yield person objects if desired, in which case the default implementation of to_object that simply returns the input will suffice.

This structure also makes a command like this possible:

os-scrape sample MDPersonDetail http://mgaleg.maryland.gov/mgawebsite/Members/Details/hershey

This would instantiate an instance of MDPersonDetail with the given URL, uses a default scraper to fetch the URL, and returns whatever data is obtained from that page. This makes it possible to work on scrapers that handle given pages from the command line, greatly speeding up the feedback loop. It might also be desirable to pass multiple URLs in, which would be possible as well.

A few advantages that might not be apparent as well:

Testing scrapers can be done by passing them a blob of HTML, which could make writing tests for scrapers easier. It'd be possible to automate somewhat by storing a snapshot of what any given Page object returns for a given blob of HTML. - At a minimum, when evaluating a contribution it'd be possible to know which pages need to be evaluated for changes.
By rolling more functionality into shared Page classes like the aforementioned HtmlPage and HtmlListPage we reduce the need to have things like make_links_absolute() and other common functions scattered through the codebase. Base pages for XLS, PDF, etc. would encapsulate a lot of the cross-state behavior.

Open questions:

Do others find this appealing? Foresee problems?
How do we want to implement action/vote/etc. classifiers, it seems like they should possibly implemented on the scraper itself?
While there are likely a few edge cases that might not fit this structure perfectly, it should be possible to break out of this structure by overriding some of the methods as needed, do we anticipate needing to do so in more than 5 or so states? If so, there may be a need to alter the level of abstraction present.

jamesturk · 2020-07-09T21:37:40Z

jamesturk
Jul 9, 2020
Maintainer Author

Somewhat separate, but also worth noting from the example, part of this implementation would be the introduction of an XPath object that we'd encourage the use of over elem.xpath(...). This class is currently prototyped in the linked example (https://github.com/openstates/people/pull/303/files) but allows usage like

XPath("//div/a/@href").match_one(elem)

or

# there are 50 legislators, but sometimes a few vacancies
XPath("//tr", min_items=40, max_items=50).match(elem)

I can imagine adding a few other niceties onto this class, for example, it might prove desirable to have it better handle whitespace by default so that comparisons all over the place don't need to have .strip() calls appended.

0 replies

jessemortenson · 2020-07-10T22:15:38Z

jessemortenson
Jul 10, 2020
Maintainer

Thanks for sharing this for feedback. I think your goal of fitting the unit of logic to a unit of input (a certain URL, a certain snippet of HTML) is a good idea. I affirm the disadvantages you note with the status quo (long procedures = hard to debug, test, verify).

In the Scrapy library a common pattern is the CrawlSpider: its inputs are a) a set of rules for defining which URLs to follow from a given starting URL and b) a method that processes any given URL (which could yield one or more objects). This is both pretty simple to start with and also simple to test/debug as a unit (a single URL). What you're describing sounds similar.

I got lost in numbered item 3 ("If MDPersonList defines any subpage objects..."), so the above approach sounds perhaps more complicated (compared to scarpy), but I may just not understand the "subpage object" concept... a concrete example of "MDPersonList.get_data() and MDPersonDetail.get_data() will be merged" might help. Like, both would yield a Bill object? and the final Bill object would be their merger?

The concern that occurs to me (assuming my read above is correct) is that it might be not clear what the inputs/outputs of each stage (page vs. subpage) are. Might be hard to reason about if the Page can yield a Person object with arbitrary properties defined, and Subpage can also yield a Person object with arbitrary properties defined: if I'm working with the Subpage, how do I know exactly what the responsibility of my code is, without simultaneously keeping the Page in context? An alternative might be to define define specific inputs from one to another (more along a directed acyclic graph type of thinking):

pseudocode:

PersonPage (url) {
  // fetch the page
  for people_on_page:
    // get properties from page
    return new PersonContactDetailsPage(contact_url, first_name, last_name, photo_url)
}

PersonContactDetailsPage (url, first_name, last_name, photo_url) {
  // fetch the page
  // get properties from contact details page for this person
  yield new Person(first_name, last_name, url, photo_url, phone_number, email);
}

0 replies

jamesturk · 2020-07-24T21:11:31Z

jamesturk
Jul 24, 2020
Maintainer Author

Thanks for the response & sorry for the delay- thinking things through a bit more & playing with the working MD code to figure out what is the right call.

Right now both all pages return dictionaries, and the subpage dictionaries are merged into the parent dictionary.

So PersonPage might yield {"name": "James Turk", "url": "https://example.com/jamesturk"} and PersonContactDetailsPage might yield {"phone": "202-555-1234"} and then the MDPersonScraper.to_object converts these dictionaries into the final Person object that is yielded

The reasoning I had in mind is that it makes testing a bit easier, I'd like to give people tools like

$ os-scrape testpage PersonContactDetailsPage https://example.com/jamesturk
{'phone': '202-555-1234'}

To make working on a particular portion easier. There are some trade-offs as you note: notably not knowing the responsibility of the page & the added step of writing the to_object function instead of passing around partially-completed objects.

I'm rethinking a bit of this and will post more ideas soon, but wanted to at least provide an update.

0 replies

jamesturk · 2020-12-21T20:13:52Z

jamesturk
Dec 21, 2020
Maintainer Author

Alright, been rethinking this a bit... the unwanted complexity in the last draft came from delegating to subpages. I think the trade-off of single point of responsibility vs. ease of that kind of testing wasn't quite balanced.

Outlining top level goals:

Scrapers should be easier to write and understand.
We want to encourage people to write scrapers that can be run on a single item.
We want scrapers to be easily "testable" from the command line without running a full scrape.
Most scrapers are comprised of two kinds of pages: listing pages, and detail pages.
Sometimes there are multiple detail pages for an object.
Sometimes there are auxillary pages that we gather information from that then gets linked onto the object (e.g. a page categorizing each bill into subjects, that we scrape all at once then append onto Bill objects)

With that in mind, I'm thinking the correct boundaries are:

ListPage: returns a list of additional pages to scrape.
DetailPage: given a URL can build a complete object. No more concept of subpages here, a DetailPage is responsible for fetching children.
AuxillaryPage: can collect additional information to be attached to DetailPage output. (details TBD)

The trade-off in enforcing this kind of simplicity is that if the ListPage has useful details that are harder to scrape from DetailPage, we aren't defining a way to do so. That is probably mostly necessary to enforce DetailPage scrapers to be as complete as possible, which as Jesse pointed out, makes understanding them & testing a lot easier.

I need to update a few people scrapers to get legislators working so I'm going to work with this pattern in the people repo. I'll update once I've written 2-3 of these.

0 replies

showerst · 2020-12-21T20:52:44Z

showerst
Dec 21, 2020
Collaborator

How do you see this model handing things like the CSV/FTP based scrapers?

This is somewhat orthogonal, but as we're thinking about this stuff I have a few pain points I think we should take a look at:

XPath can be a real bear for some use cases, and is definitely a barrier to entry these days. I think picking a selector library and standardizing on that as an option would be great (without dropping xpath of course).
Same with a time and date parser lib, I feel like we're often writing awkward code to parse stuff like Zulu dates that would be better handled by a well tested library. Also one that can try to just human parse w/out defining the format would be nice, especially for states that just every once in a while break it so you end up writing a 3 level try catch because someone wrote "3 PM" instead of "3PM"
We could build in some standardized way to just say "here's the timezone for all dates on this object" rather than littering the code with self._tz.localize(.
(this is less easy) Every ASP.net page has the same work of figuring out which crazy session request vars are needed, pulling those out, and writing a handler to post it all.
I think we were on the right path with the old action classifier objects, but it just got too complex to maintain well when it tried to detect chamber and committees and action classes. A simpler standard 'stick a dict of regex->classification(s) into the scraper someplace obvious' would be the way to go.

0 replies

jamesturk · 2020-12-21T21:03:23Z

jamesturk
Dec 21, 2020
Maintainer Author

As for CSV/FTP based scrapers, the basic format would be the same but they wouldn’t inherit from the HtmlPage/HtmlListPage, but instead from either the generic Page or a CSVPage, etc. if one proves useful. The core of the interface here is pushing scrapers to split apart listing and detail collection more explicitly so that the entry points are callable separately. There may be cases where all of the data is collected from the list (in the case of a CSV) in which case there may not be a detail view. I’m still thinking through if there are options there but I think the status quo will remain possible at a minimum. Thanks for laying out the other pain points, current thoughts below: 1. Agree 100%. Was thinking of at least altering this to also have an option for CSS Selectors, would there be others you’d want to support? 2. Same here, my general thinking of this kind of thing (also relates to #3) is that the scrape object can have a lot of cleaning. Scrapers can then yield back messier data and there’ll be field-specific cleaners that run before validation. (party=R becomes party=Republican, dates are parsed according to a known set of rules either from a third party library or a predefined set of formats) 3. I think the aforementioned cleaning pathway could include a localize step, where it gets the TZ from is an open question, but 100% agree we should move this out of scrapers generally. 4. I’m generally on board with this one too, but haven’t thought about how it ties into these other changes yet.

…

On Dec 21, 2020, 3:52 PM -0500, showerst ***@***.***>, wrote: 1. How do you see this model handing things like the CSV/FTP based scrapers? This is somewhat orthogonal, but as we're thinking about this stuff I have a few pain points I think we should take a look at: 1. XPath can be a real bear for some use cases, and is definitely a barrier to entry these days. I think picking a selector library and standardizing on that as an option would be great (without dropping xpath of course). 2. Same with a time and date parser lib, I feel like we're often writing awkward code to parse stuff like Zulu dates that would be better handled by a well tested library. Also one that can try to just human parse w/out defining the format would be nice, especially for states that just every once in a while break it so you end up writing a 3 level try catch because someone wrote "3 PM" instead of "3PM" 3. We could build in some standardized way to just say "here's the timezone for all dates on this object" rather than littering the code with self._tz.localize(. 4. (this is less easy) Every ASP.net page has the same work of figuring out which crazy session request vars are needed, pulling those out, and writing a handler to post it all. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

1 reply

showerst Dec 21, 2020
Collaborator

Was thinking of at least altering this to also have an option for CSS Selectors, would there be others you’d want to support?

Nah i think just whatever python lib is most popular these days for document.queryselector syntax is probably fine. I feel the same about dates and times, scraper writers can probably adapt to whichever package we pick.

showerst · 2020-12-21T21:16:17Z

showerst
Dec 21, 2020
Collaborator

Cool, yeah the general architecture makes a lot of sense. I tend to go w/ the very naive scrape -> scrape_chamber -> scrape_bill_page -> scrape_versions, scrape_sponsors, etc pipeline just to keep the code simple, and I think this design basically does that but in a more testable / cleaner way.

0 replies

jamesturk · 2020-12-22T23:20:13Z

jamesturk
Dec 22, 2020
Maintainer Author

I have an updated version of this working that I'm somewhat happy with.

A README is here: https://github.com/openstates/people/blob/spatula2/scrape/spatula/README.md

There are two working examples now (OK and MD people) both are IMO much easier to read than prior iterations. I'd love to get some feedback. PR that has the implementation as well as these examples is here: https://github.com/openstates/people/pull/371/files

If folks like this, my plan is to extract this back out into the spatula library (or a successor) and then new people scrapers will be the guinea pig since they just need to output YAML and are already in various states of disrepair. Then when things are less crazy (mid-2021) I'll introduce the tooling to use this into the bill scrapers repo, but there won't need to be a mass migration or anything like that, this is just a new framework for rewritten scrapers.

0 replies

jamesturk · 2021-01-26T00:33:49Z

jamesturk
Jan 26, 2021
Maintainer Author

I meant to update earlier, I've pulled this back into the main spatula lib

https://spatula.readthedocs.io/en/latest/

There are a few edge cases I need to handle, and one very big important question mark for Bill/Vote scrapers in how it'll handle returning multiple types of object from the same stream, but the CLI features I have in there made writing people scrapers really useful. My goal is to port the existing FL spatula bill scraper to the new format and then mostly stabilize the API, if folks have time feedback on what's there would be really helpful

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: next-gen scraper interface #85

{{title}}

Replies: 9 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

RFC: next-gen scraper interface #85

jamesturk Jul 9, 2020 Maintainer

Replies: 9 comments · 1 reply

jamesturk Jul 9, 2020 Maintainer Author

jessemortenson Jul 10, 2020 Maintainer

jamesturk Jul 24, 2020 Maintainer Author

jamesturk Dec 21, 2020 Maintainer Author

showerst Dec 21, 2020 Collaborator

jamesturk Dec 21, 2020 Maintainer Author

showerst Dec 21, 2020 Collaborator

showerst Dec 21, 2020 Collaborator

jamesturk Dec 22, 2020 Maintainer Author

jamesturk Jan 26, 2021 Maintainer Author

jamesturk
Jul 9, 2020
Maintainer

Replies: 9 comments 1 reply

jamesturk
Jul 9, 2020
Maintainer Author

jessemortenson
Jul 10, 2020
Maintainer

jamesturk
Jul 24, 2020
Maintainer Author

jamesturk
Dec 21, 2020
Maintainer Author

showerst
Dec 21, 2020
Collaborator

jamesturk
Dec 21, 2020
Maintainer Author

showerst Dec 21, 2020
Collaborator

showerst
Dec 21, 2020
Collaborator

jamesturk
Dec 22, 2020
Maintainer Author

jamesturk
Jan 26, 2021
Maintainer Author