RFC: next-gen scraper interface #85
Replies: 9 comments 1 reply
-
Somewhat separate, but also worth noting from the example, part of this implementation would be the introduction of an XPath object that we'd encourage the use of over elem.xpath(...). This class is currently prototyped in the linked example (https://github.com/openstates/people/pull/303/files) but allows usage like
or
I can imagine adding a few other niceties onto this class, for example, it might prove desirable to have it better handle whitespace by default so that comparisons all over the place don't need to have .strip() calls appended. |
Beta Was this translation helpful? Give feedback.
-
Thanks for sharing this for feedback. I think your goal of fitting the unit of logic to a unit of input (a certain URL, a certain snippet of HTML) is a good idea. I affirm the disadvantages you note with the status quo (long procedures = hard to debug, test, verify). In the Scrapy library a common pattern is the CrawlSpider: its inputs are a) a set of rules for defining which URLs to follow from a given starting URL and b) a method that processes any given URL (which could yield one or more objects). This is both pretty simple to start with and also simple to test/debug as a unit (a single URL). What you're describing sounds similar. I got lost in numbered item 3 ("If MDPersonList defines any subpage objects..."), so the above approach sounds perhaps more complicated (compared to scarpy), but I may just not understand the "subpage object" concept... a concrete example of "MDPersonList.get_data() and MDPersonDetail.get_data() will be merged" might help. Like, both would yield a Bill object? and the final Bill object would be their merger? The concern that occurs to me (assuming my read above is correct) is that it might be not clear what the inputs/outputs of each stage (page vs. subpage) are. Might be hard to reason about if the Page can yield a Person object with arbitrary properties defined, and Subpage can also yield a Person object with arbitrary properties defined: if I'm working with the Subpage, how do I know exactly what the responsibility of my code is, without simultaneously keeping the Page in context? An alternative might be to define define specific inputs from one to another (more along a directed acyclic graph type of thinking): pseudocode:
|
Beta Was this translation helpful? Give feedback.
-
Thanks for the response & sorry for the delay- thinking things through a bit more & playing with the working MD code to figure out what is the right call. Right now both all pages return dictionaries, and the subpage dictionaries are merged into the parent dictionary. So PersonPage might yield {"name": "James Turk", "url": "https://example.com/jamesturk"} and PersonContactDetailsPage might yield {"phone": "202-555-1234"} and then the MDPersonScraper.to_object converts these dictionaries into the final Person object that is yielded The reasoning I had in mind is that it makes testing a bit easier, I'd like to give people tools like
To make working on a particular portion easier. There are some trade-offs as you note: notably not knowing the responsibility of the page & the added step of writing the to_object function instead of passing around partially-completed objects. I'm rethinking a bit of this and will post more ideas soon, but wanted to at least provide an update. |
Beta Was this translation helpful? Give feedback.
-
Alright, been rethinking this a bit... the unwanted complexity in the last draft came from delegating to subpages. I think the trade-off of single point of responsibility vs. ease of that kind of testing wasn't quite balanced. Outlining top level goals:
With that in mind, I'm thinking the correct boundaries are: ListPage: returns a list of additional pages to scrape. The trade-off in enforcing this kind of simplicity is that if the ListPage has useful details that are harder to scrape from DetailPage, we aren't defining a way to do so. That is probably mostly necessary to enforce DetailPage scrapers to be as complete as possible, which as Jesse pointed out, makes understanding them & testing a lot easier. I need to update a few people scrapers to get legislators working so I'm going to work with this pattern in the people repo. I'll update once I've written 2-3 of these. |
Beta Was this translation helpful? Give feedback.
-
This is somewhat orthogonal, but as we're thinking about this stuff I have a few pain points I think we should take a look at:
|
Beta Was this translation helpful? Give feedback.
-
As for CSV/FTP based scrapers, the basic format would be the same but they wouldn’t inherit from the HtmlPage/HtmlListPage, but instead from either the generic Page or a CSVPage, etc. if one proves useful. The core of the interface here is pushing scrapers to split apart listing and detail collection more explicitly so that the entry points are callable separately. There may be cases where all of the data is collected from the list (in the case of a CSV) in which case there may not be a detail view. I’m still thinking through if there are options there but I think the status quo will remain possible at a minimum.
Thanks for laying out the other pain points, current thoughts below:
1. Agree 100%. Was thinking of at least altering this to also have an option for CSS Selectors, would there be others you’d want to support?
2. Same here, my general thinking of this kind of thing (also relates to #3) is that the scrape object can have a lot of cleaning. Scrapers can then yield back messier data and there’ll be field-specific cleaners that run before validation. (party=R becomes party=Republican, dates are parsed according to a known set of rules either from a third party library or a predefined set of formats)
3. I think the aforementioned cleaning pathway could include a localize step, where it gets the TZ from is an open question, but 100% agree we should move this out of scrapers generally.
4. I’m generally on board with this one too, but haven’t thought about how it ties into these other changes yet.
…On Dec 21, 2020, 3:52 PM -0500, showerst ***@***.***>, wrote:
1. How do you see this model handing things like the CSV/FTP based scrapers?
This is somewhat orthogonal, but as we're thinking about this stuff I have a few pain points I think we should take a look at:
1. XPath can be a real bear for some use cases, and is definitely a barrier to entry these days. I think picking a selector library and standardizing on that as an option would be great (without dropping xpath of course).
2. Same with a time and date parser lib, I feel like we're often writing awkward code to parse stuff like Zulu dates that would be better handled by a well tested library. Also one that can try to just human parse w/out defining the format would be nice, especially for states that just every once in a while break it so you end up writing a 3 level try catch because someone wrote "3 PM" instead of "3PM"
3. We could build in some standardized way to just say "here's the timezone for all dates on this object" rather than littering the code with self._tz.localize(.
4. (this is less easy) Every ASP.net page has the same work of figuring out which crazy session request vars are needed, pulling those out, and writing a handler to post it all.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Beta Was this translation helpful? Give feedback.
-
Cool, yeah the general architecture makes a lot of sense. I tend to go w/ the very naive |
Beta Was this translation helpful? Give feedback.
-
I have an updated version of this working that I'm somewhat happy with. A README is here: https://github.com/openstates/people/blob/spatula2/scrape/spatula/README.md There are two working examples now (OK and MD people) both are IMO much easier to read than prior iterations. I'd love to get some feedback. PR that has the implementation as well as these examples is here: https://github.com/openstates/people/pull/371/files If folks like this, my plan is to extract this back out into the spatula library (or a successor) and then new people scrapers will be the guinea pig since they just need to output YAML and are already in various states of disrepair. Then when things are less crazy (mid-2021) I'll introduce the tooling to use this into the bill scrapers repo, but there won't need to be a mass migration or anything like that, this is just a new framework for rewritten scrapers. |
Beta Was this translation helpful? Give feedback.
-
I meant to update earlier, I've pulled this back into the main spatula lib https://spatula.readthedocs.io/en/latest/ There are a few edge cases I need to handle, and one very big important question mark for Bill/Vote scrapers in how it'll handle returning multiple types of object from the same stream, but the CLI features I have in there made writing people scrapers really useful. My goal is to port the existing FL spatula bill scraper to the new format and then mostly stabilize the API, if folks have time feedback on what's there would be really helpful |
Beta Was this translation helpful? Give feedback.
-
I'd like to begin the process of moving the scrapers to a new interface that makes them easier to write, maintain, and test (yes, test). I've played around with this a bit with prior projects and introduced piecemeal bits of this, but I think that the current state of person/committee scrapers (mostly disabled, many not working) would be a good place to start, and the lessons learned there can be brought over to the Bill/Vote scrapers when ready.
To head off the biggest anticipated concern: I'm not proposing another move anywhere near as big as the billy->pupa migration, in fact I'd like to do this in a way where individual scrapers can be ported. Initially they'll output backwards-compatible JSON, only the way they're invoked will need to change when they're rewritten.
The main goals of this next iteration:
Non-goals:
Proposed 'Public' Interface (internal methods are hidden, a rough but working implementation of this exists at https://github.com/openstates/people/pull/303/files):
Usage would look like this (assume an executable named os-scrape that can load defined subclasses of the above as needed):
This structure also makes a command like this possible:
This would instantiate an instance of MDPersonDetail with the given URL, uses a default scraper to fetch the URL, and returns whatever data is obtained from that page. This makes it possible to work on scrapers that handle given pages from the command line, greatly speeding up the feedback loop. It might also be desirable to pass multiple URLs in, which would be possible as well.
A few advantages that might not be apparent as well:
Open questions:
Beta Was this translation helpful? Give feedback.
All reactions