-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP Potential Fix for CT Scrape Issue #3658
Conversation
mzagaja
commented
Apr 14, 2021
- Potential fix for CT scraping issue adopts xpath search pattern to grab the PDF link instead of the “link next to the htm” (that happened to be a PDF in the previous iteration of the website). Currently not working nor emitting an error on my machine.
* Potential fix for CT scraping issue adopts xpath search pattern to grab the PDF link instead of the “link next to the htm” (that happened to be a PDF in the previous iteration of the website). Currently not working nor emitting an error on my machine.
thanks for this, on my list to review this week |
when the scrape finishes what does the summary block look like?
As you can see, I'm not seeing any votes from this change. I'm curious if yours was different? |
I also got:
I do see the ones with votes in the cache, but have to try and map which bill JSON files they go with to investigate further now that I ran it fully. Going to try and investigate a bit more this weekend and maybe take another swing at it. |
One bill that has a vote is:
{
"legislative_session": "2021",
"identifier": "SJ00025",
"title": "RESOLUTION CONFIRMING THE NOMINATION OF DR. JOHN BONETTI OF FARMINGTON TO BE A MEMBER OF THE PSYCHIATRIC SECURITY REVIEW BOARD.",
"from_organization": "~{\"classification\": \"upper\"}",
"classification": [
"joint resolution"
],
"subject": [
],
"abstracts": [
],
"other_titles": [
],
"other_identifiers": [
],
"actions": [
{
"description": "NOMINATION REFERRED TO COMMITTEE ON Executive and Legislative Nominations",
"date": "2021-01-06",
"organization_id": "~{\"classification\": \"upper\"}",
"classification": [
"referral-committee"
],
"related_entities": [
]
},
{
"description": "PUBLIC HEARING 0311",
"date": "2021-03-11",
"organization_id": "~{\"classification\": \"upper\"}",
"classification": [
],
"related_entities": [
]
},
{
"description": "FAV. RPT., TAB. FOR CAL., SEN.",
"date": "2021-03-15",
"organization_id": "~{\"classification\": \"upper\"}",
"classification": [
],
"related_entities": [
]
},
{
"description": "SENATE CALENDAR NUMBER 71",
"date": "2021-03-15",
"organization_id": "~{\"classification\": \"upper\"}",
"classification": [
],
"related_entities": [
]
},
{
"description": "ADOPTED, SENATE",
"date": "2021-03-23",
"organization_id": "~{\"classification\": \"upper\"}",
"classification": [
"passage"
],
"related_entities": [
]
},
{
"description": "ON CONSENT CALENDAR",
"date": "2021-03-23",
"organization_id": "~{\"classification\": \"upper\"}",
"classification": [
],
"related_entities": [
]
},
{
"description": "RULES SUSPENDED,TRANS.TO HOUSE",
"date": "2021-03-23",
"organization_id": "~{\"classification\": \"lower\"}",
"classification": [
],
"related_entities": [
]
},
{
"description": "FAV. RPT., TABLED FOR HOUSE CALENDER",
"date": "2021-04-12",
"organization_id": "~{\"classification\": \"lower\"}",
"classification": [
],
"related_entities": [
]
},
{
"description": "HOUSE CALENDAR NUMBER 272",
"date": "2021-04-12",
"organization_id": "~{\"classification\": \"lower\"}",
"classification": [
],
"related_entities": [
]
}
],
"sponsorships": [
{
"name": "Duff, Bob",
"classification": "primary",
"entity_type": "person",
"primary": true,
"person_id": "~{\"name\": \"Duff, Bob\"}",
"organization_id": null
},
{
"name": "Concepcion, Julio A.",
"classification": "primary",
"entity_type": "person",
"primary": true,
"person_id": "~{\"name\": \"Concepcion, Julio A.\"}",
"organization_id": null
}
],
"related_bills": [
],
"versions": [
{
"note": "Senate Joint Nomination",
"links": [
{
"url": "https://www.cga.ct.gov/2021/TOB/S/PDF/2021SJ-00025-R00-SB.PDF",
"media_type": "application/pdf"
}
],
"date": "",
"classification": ""
}
],
"documents": [
],
"sources": [
{
"url": "ftp://ftp.cga.ct.gov/pub/data/bill_info.csv",
"note": ""
},
{
"url": "https://www.cga.ct.gov/asp/cgabillstatus/cgabillstatus.asp?selBillType=Bill&bill_num=SJ25&which_year=2021",
"note": ""
}
],
"extras": {
},
"_id": "900ea9ec-a2ef-11eb-b816-0242ac120002"
} The cache links to a vote PDF: mzagaja@MacBook-Pro ~/D/openstates-scrapers (ct-scrape-issue)> rg 'VOTE' _cache/
_cache/www.cga.ct.gov,asp,cgabillstatus,cgabillstatus.asp,selBillType=Bill&bill_num=SJ25&which_year=2021,5d50bf64a9cf8bf185c29c5b6b132421
1091:<tr><td bgcolor=black> </td><td><a href="/2021/VOTE/S/PDF/2021SV-00072-R00SJ00025-SV.PDF">Senate Roll Call Vote 72 </a></td></tr></tbody></table> Which is a valid URL: https://cga.ct.gov/2021/VOTE/S/PDF/2021SV-00072-R00SJ00025-SV.PDF I can confirm by dropping in a debugger statement that we see a link: for link in page.xpath(
"//a[(contains(@href, '/pdf/') or contains(@href, '/PDF/')) and contains(@href, '/VOTE/')]"
):
# 2011 HJ 31 has a blank vote, others might too
print(link.text)
import pdb; pdb.set_trace()
if link.text:
pdf_link = link
if pdf_link:
yield from self.scrape_vote(
bill, pdf_link.text.strip(), link.attrib["href"]
)
print('Finished scraping webpage') 13:18:58 INFO scrapelib: GET - https://www.cga.ct.gov/asp/cgabillstatus/cgabillstatus.asp?selBillType=Bill&bill_num=HB6423&which_year=2021
House Roll Call Vote 32 AS AMENDED
> /opt/openstates/openstates/scrapers/ct/bills.py(153)scrape_bill_page()
-> if link.text:
(Pdb) link
<Element a at 0xffffb884b8f0>
(Pdb) pdf_link.text.strip()
*** NameError: name 'pdf_link' is not defined
(Pdb) link.text
'House Roll Call Vote 32 AS AMENDED '
(Pdb) link.text.strip()
'House Roll Call Vote 32 AS AMENDED' Thus even though my change might be a necessary part of the fix, it seems like it isn't sufficient to resolve this issue. A new clue arrives when we try testing a House Roll Call Vote 32 AS AMENDED
> /opt/openstates/openstates/scrapers/ct/bills.py(155)scrape_bill_page()
-> if pdf_link:
(Pdb) pdf_link
<Element a at 0xffffb84f3350>
(Pdb) bool(pdf_link)
/opt/openstates/openstates/scrapers/ct/bills.py:1: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead. Updating this section to: for link in page.xpath(
"//a[(contains(@href, '/pdf/') or contains(@href, '/PDF/')) and contains(@href, '/VOTE/')]"
):
# 2011 HJ 31 has a blank vote, others might too
print(link.text)
if link.text:
pdf_link = link
if pdf_link is not None:
yield from self.scrape_vote(
bill, pdf_link.text.strip(), link.attrib["href"]
)
print('Finished scraping webpage') Surmounts the URL issue, but then we land with this problem which is a bit more cryptic to me: 13:34:32 INFO scrapelib: GET - https://www.cga.ct.gov/asp/cgabillstatus/cgabillstatus.asp?selBillType=Bill&bill_num=HB6423&which_year=2021
House Roll Call Vote 32 AS AMENDED
Scraping the vote
13:34:32 INFO scrapelib: GET - https://www.cga.ct.gov/2021/VOTE/H/PDF/2021HV-00032-R00HB06423-HV.PDF
/root/.cache/pypoetry/virtualenvs/openstates-scrapers-vRcYrsYN-py3.7/lib/python3.7/site-packages/urllib3/connectionpool.py:1020: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.cga.ct.gov'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning,
Traceback (most recent call last):
File "/root/.cache/pypoetry/virtualenvs/openstates-scrapers-vRcYrsYN-py3.7/bin/os-update", line 8, in <module>
sys.exit(main())
File "/root/.cache/pypoetry/virtualenvs/openstates-scrapers-vRcYrsYN-py3.7/lib/python3.7/site-packages/openstates/cli/update.py", line 318, in main
report = do_update(args, other, juris)
File "/root/.cache/pypoetry/virtualenvs/openstates-scrapers-vRcYrsYN-py3.7/lib/python3.7/site-packages/openstates/cli/update.py", line 205, in do_update
report["scrape"] = do_scrape(juris, args, scrapers)
File "/root/.cache/pypoetry/virtualenvs/openstates-scrapers-vRcYrsYN-py3.7/lib/python3.7/site-packages/openstates/cli/update.py", line 89, in do_scrape
report[scraper_name] = scraper.do_scrape(**scrape_args)
File "/root/.cache/pypoetry/virtualenvs/openstates-scrapers-vRcYrsYN-py3.7/lib/python3.7/site-packages/openstates/scrape/base.py", line 163, in do_scrape
for obj in self.scrape(**kwargs) or []:
File "/opt/openstates/openstates/scrapers/ct/bills.py", line 34, in scrape
yield from self.scrape_bill_info(session, chambers)
File "/opt/openstates/openstates/scrapers/ct/bills.py", line 94, in scrape_bill_info
yield from self.scrape_bill_page(bill)
File "/opt/openstates/openstates/scrapers/ct/bills.py", line 156, in scrape_bill_page
bill, pdf_link.text.strip(), link.attrib["href"]
File "/opt/openstates/openstates/scrapers/ct/bills.py", line 183, in scrape_vote
yes_count = int(re.match(r"[^\d]*(\d+)[^\d]*", yes_count).group(1))
AttributeError: 'NoneType' object has no attribute 'group'
ERROR: 1 |
Some further poking at this seems to suggest the issue is the format of the PDF is not lining up with what the scraper expects. My understanding from the code is it wants parsable HTML, and is not getting that from the PDF:
Is some debugger output I used to inspect this situation. Is it fair to think the next step on this might be to write a PDF scraper/parser and use a library other than libxml that can work with PDF? |
Ah if the votes are in PDF format, then yes, pdftotext will need to be used (you can search other scrapers for convert_pdf) to see examples. I can also take a closer look at this soon if that'd help |
I can take a poke at it this weekend and see how big of a lift it might be, but if you happen to have bandwidth happy to see it happen sooner 😄. |
I haven't fully figured out the votes scraping yet, but made some good progress on this today. I pushed a work in progress commit if you want to see what I've done so far and have any suggestions. |
* Add vote code.
* Separate CT votes by whtiespace and push into a dict for easier parsing. * Take dict and attempt to add the vote with add_vote. This fails with: (Pdb) self.add_vote('Y', 'ABERCROMBIE') *** TypeError: add_vote() takes 2 positional arguments but 3 were given
@jamesturk Made some updates to this but am getting a weird error in the |
@@ -16,7 +17,16 @@ class SkipBill(Exception): | |||
class CTBillScraper(Scraper): | |||
latest_only = True | |||
|
|||
def add_vote(vote, voter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the explicit self parameter :)