WIP Potential Fix for CT Scrape Issue #3658

mzagaja · 2021-04-14T01:31:43Z

Potential fix for CT scraping issue adopts xpath search pattern to grab the PDF link instead of the “link next to the htm” (that happened to be a PDF in the previous iteration of the website). Currently not working nor emitting an error on my machine.

* Potential fix for CT scraping issue adopts xpath search pattern to grab the PDF link instead of the “link next to the htm” (that happened to be a PDF in the previous iteration of the website). Currently not working nor emitting an error on my machine.

jamesturk · 2021-04-20T15:09:48Z

thanks for this, on my list to review this week

jamesturk · 2021-04-21T19:19:56Z

when the scrape finishes what does the summary block look like?

ct (scrape)
  bills: {}
bills scrape:
  duration:  0:02:02.136713
  objects:
    bill: 3218
jurisdiction scrape:
  duration:  0:00:00.005249
  objects:
    jurisdiction: 1
    organization: 3

As you can see, I'm not seeing any votes from this change.

I'm curious if yours was different?

mzagaja · 2021-04-23T00:58:48Z

I also got:

ct (scrape)
  bills: {}
bills scrape:
  duration:  0:10:43.968659
  objects:
    bill: 3202
jurisdiction scrape:
  duration:  0:00:00.021614
  objects:
    jurisdiction: 1
    organization: 3

I do see the ones with votes in the cache, but have to try and map which bill JSON files they go with to investigate further now that I ran it fully. Going to try and investigate a bit more this weekend and maybe take another swing at it.

mzagaja · 2021-04-24T13:38:31Z

One bill that has a vote is:

_data/ct//bill_900ea9ec-a2ef-11eb-b816-0242ac120002.json

{
    "legislative_session": "2021",
    "identifier": "SJ00025",
    "title": "RESOLUTION CONFIRMING THE NOMINATION OF DR. JOHN BONETTI OF FARMINGTON TO BE A MEMBER OF THE PSYCHIATRIC SECURITY REVIEW BOARD.",
    "from_organization": "~{\"classification\": \"upper\"}",
    "classification": [
        "joint resolution"
    ],
    "subject": [
        
    ],
    "abstracts": [
        
    ],
    "other_titles": [
        
    ],
    "other_identifiers": [
        
    ],
    "actions": [
        {
            "description": "NOMINATION REFERRED TO COMMITTEE ON Executive and Legislative Nominations",
            "date": "2021-01-06",
            "organization_id": "~{\"classification\": \"upper\"}",
            "classification": [
                "referral-committee"
            ],
            "related_entities": [
                
            ]
        },
        {
            "description": "PUBLIC HEARING 0311",
            "date": "2021-03-11",
            "organization_id": "~{\"classification\": \"upper\"}",
            "classification": [
                
            ],
            "related_entities": [
                
            ]
        },
        {
            "description": "FAV. RPT., TAB. FOR CAL., SEN.",
            "date": "2021-03-15",
            "organization_id": "~{\"classification\": \"upper\"}",
            "classification": [
                
            ],
            "related_entities": [
                
            ]
        },
        {
            "description": "SENATE CALENDAR NUMBER 71",
            "date": "2021-03-15",
            "organization_id": "~{\"classification\": \"upper\"}",
            "classification": [
                
            ],
            "related_entities": [
                
            ]
        },
        {
            "description": "ADOPTED, SENATE",
            "date": "2021-03-23",
            "organization_id": "~{\"classification\": \"upper\"}",
            "classification": [
                "passage"
            ],
            "related_entities": [
                
            ]
        },
        {
            "description": "ON CONSENT CALENDAR",
            "date": "2021-03-23",
            "organization_id": "~{\"classification\": \"upper\"}",
            "classification": [
                
            ],
            "related_entities": [
                
            ]
        },
        {
            "description": "RULES SUSPENDED,TRANS.TO HOUSE",
            "date": "2021-03-23",
            "organization_id": "~{\"classification\": \"lower\"}",
            "classification": [
                
            ],
            "related_entities": [
                
            ]
        },
        {
            "description": "FAV. RPT., TABLED FOR HOUSE CALENDER",
            "date": "2021-04-12",
            "organization_id": "~{\"classification\": \"lower\"}",
            "classification": [
                
            ],
            "related_entities": [
                
            ]
        },
        {
            "description": "HOUSE CALENDAR NUMBER 272",
            "date": "2021-04-12",
            "organization_id": "~{\"classification\": \"lower\"}",
            "classification": [
                
            ],
            "related_entities": [
                
            ]
        }
    ],
    "sponsorships": [
        {
            "name": "Duff, Bob",
            "classification": "primary",
            "entity_type": "person",
            "primary": true,
            "person_id": "~{\"name\": \"Duff, Bob\"}",
            "organization_id": null
        },
        {
            "name": "Concepcion, Julio A.",
            "classification": "primary",
            "entity_type": "person",
            "primary": true,
            "person_id": "~{\"name\": \"Concepcion, Julio A.\"}",
            "organization_id": null
        }
    ],
    "related_bills": [
        
    ],
    "versions": [
        {
            "note": "Senate Joint Nomination",
            "links": [
                {
                    "url": "https://www.cga.ct.gov/2021/TOB/S/PDF/2021SJ-00025-R00-SB.PDF",
                    "media_type": "application/pdf"
                }
            ],
            "date": "",
            "classification": ""
        }
    ],
    "documents": [
        
    ],
    "sources": [
        {
            "url": "ftp://ftp.cga.ct.gov/pub/data/bill_info.csv",
            "note": ""
        },
        {
            "url": "https://www.cga.ct.gov/asp/cgabillstatus/cgabillstatus.asp?selBillType=Bill&bill_num=SJ25&which_year=2021",
            "note": ""
        }
    ],
    "extras": {
        
    },
    "_id": "900ea9ec-a2ef-11eb-b816-0242ac120002"
}

The cache links to a vote PDF:

mzagaja@MacBook-Pro ~/D/openstates-scrapers (ct-scrape-issue)> rg 'VOTE' _cache/
_cache/www.cga.ct.gov,asp,cgabillstatus,cgabillstatus.asp,selBillType=Bill&bill_num=SJ25&which_year=2021,5d50bf64a9cf8bf185c29c5b6b132421
1091:<tr><td bgcolor=black>&nbsp;</td><td><a href="/2021/VOTE/S/PDF/2021SV-00072-R00SJ00025-SV.PDF">Senate Roll Call Vote 72 </a></td></tr></tbody></table>

Which is a valid URL: https://cga.ct.gov/2021/VOTE/S/PDF/2021SV-00072-R00SJ00025-SV.PDF

I can confirm by dropping in a debugger statement that we see a link:

        for link in page.xpath(
            "//a[(contains(@href, '/pdf/') or contains(@href, '/PDF/')) and contains(@href, '/VOTE/')]"
        ):
            # 2011 HJ 31 has a blank vote, others might too
            print(link.text)
            import pdb; pdb.set_trace()
            if link.text:
                pdf_link = link
                if pdf_link:
                    yield from self.scrape_vote(
                        bill, pdf_link.text.strip(), link.attrib["href"]
                    )
        print('Finished scraping webpage')

13:18:58 INFO scrapelib: GET - https://www.cga.ct.gov/asp/cgabillstatus/cgabillstatus.asp?selBillType=Bill&bill_num=HB6423&which_year=2021
House Roll Call Vote 32 AS AMENDED
> /opt/openstates/openstates/scrapers/ct/bills.py(153)scrape_bill_page()
-> if link.text:
(Pdb) link
<Element a at 0xffffb884b8f0>
(Pdb) pdf_link.text.strip()
*** NameError: name 'pdf_link' is not defined
(Pdb) link.text
'House Roll Call Vote 32 AS AMENDED '
(Pdb) link.text.strip()
'House Roll Call Vote 32 AS AMENDED'

Thus even though my change might be a necessary part of the fix, it seems like it isn't sufficient to resolve this issue.

A new clue arrives when we try testing a pdf_link with the bool() method:

House Roll Call Vote 32 AS AMENDED
> /opt/openstates/openstates/scrapers/ct/bills.py(155)scrape_bill_page()
-> if pdf_link:
(Pdb) pdf_link
<Element a at 0xffffb84f3350>
(Pdb) bool(pdf_link)
/opt/openstates/openstates/scrapers/ct/bills.py:1: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.

Updating this section to:

        for link in page.xpath(
            "//a[(contains(@href, '/pdf/') or contains(@href, '/PDF/')) and contains(@href, '/VOTE/')]"
        ):
            # 2011 HJ 31 has a blank vote, others might too
            print(link.text)
            if link.text:
                pdf_link = link
                if pdf_link is not None:
                    yield from self.scrape_vote(
                        bill, pdf_link.text.strip(), link.attrib["href"]
                    )
        print('Finished scraping webpage')

Surmounts the URL issue, but then we land with this problem which is a bit more cryptic to me:

13:34:32 INFO scrapelib: GET - https://www.cga.ct.gov/asp/cgabillstatus/cgabillstatus.asp?selBillType=Bill&bill_num=HB6423&which_year=2021
House Roll Call Vote 32 AS AMENDED
Scraping the vote
13:34:32 INFO scrapelib: GET - https://www.cga.ct.gov/2021/VOTE/H/PDF/2021HV-00032-R00HB06423-HV.PDF
/root/.cache/pypoetry/virtualenvs/openstates-scrapers-vRcYrsYN-py3.7/lib/python3.7/site-packages/urllib3/connectionpool.py:1020: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.cga.ct.gov'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/openstates-scrapers-vRcYrsYN-py3.7/bin/os-update", line 8, in <module>
    sys.exit(main())
  File "/root/.cache/pypoetry/virtualenvs/openstates-scrapers-vRcYrsYN-py3.7/lib/python3.7/site-packages/openstates/cli/update.py", line 318, in main
    report = do_update(args, other, juris)
  File "/root/.cache/pypoetry/virtualenvs/openstates-scrapers-vRcYrsYN-py3.7/lib/python3.7/site-packages/openstates/cli/update.py", line 205, in do_update
    report["scrape"] = do_scrape(juris, args, scrapers)
  File "/root/.cache/pypoetry/virtualenvs/openstates-scrapers-vRcYrsYN-py3.7/lib/python3.7/site-packages/openstates/cli/update.py", line 89, in do_scrape
    report[scraper_name] = scraper.do_scrape(**scrape_args)
  File "/root/.cache/pypoetry/virtualenvs/openstates-scrapers-vRcYrsYN-py3.7/lib/python3.7/site-packages/openstates/scrape/base.py", line 163, in do_scrape
    for obj in self.scrape(**kwargs) or []:
  File "/opt/openstates/openstates/scrapers/ct/bills.py", line 34, in scrape
    yield from self.scrape_bill_info(session, chambers)
  File "/opt/openstates/openstates/scrapers/ct/bills.py", line 94, in scrape_bill_info
    yield from self.scrape_bill_page(bill)
  File "/opt/openstates/openstates/scrapers/ct/bills.py", line 156, in scrape_bill_page
    bill, pdf_link.text.strip(), link.attrib["href"]
  File "/opt/openstates/openstates/scrapers/ct/bills.py", line 183, in scrape_vote
    yes_count = int(re.match(r"[^\d]*(\d+)[^\d]*", yes_count).group(1))
AttributeError: 'NoneType' object has no attribute 'group'
ERROR: 1

mzagaja · 2021-05-01T18:45:12Z

Some further poking at this seems to suggest the issue is the format of the PDF is not lining up with what the scraper expects. My understanding from the code is it wants parsable HTML, and is not getting that from the PDF:

18:38:57 INFO scrapelib: GET - https://www.cga.ct.gov/2021/VOTE/H/PDF/2021HV-00032-R00HB06423-HV.PDF
> /opt/openstates/openstates/scrapers/ct/bills.py(184)scrape_vote()
-> yes_count = int(re.match(r"[^\d]*(\d+)[^\d]*", yes_count).group(1))
(Pdb) yes_count
''
(Pdb) page
<Element p at 0xffffb30e49b0>
(Pdb) page.content
*** AttributeError: 'HtmlElement' object has no attribute 'content'
(Pdb) page.text
'%PDF-1.5\r\n%����\r\n1 0 obj\r\n<>>>\r\nendobj\r\n2 0 obj\r\n<>\r\nendobj\r\n3 0 obj\r\n<>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 612 792] /Contents 4 0 R/Group<>/Tabs/S/StructParents 0>>\r\nendobj\r\n4 0 obj\r\n<>\r\nstream\r\nx���ے�Ƒ���w�aO�[g`bb"�MJ��&�nJ��Z�1��_���� ��@�Ϊʅ��%��ȅ�*�>>����_�����W�����|����?�����}������g����>{��O\x7f���ǿ���\x7f��ճ�W��\x7f��g�WW�i�����O������j�O纽��4�\x7f�O?9_�i��O?�ݓ�\x7f�~�?�x��y������������{����\x7f_�������7�l^�X��$[�zz>��q�z��\uf7bc|����SswO��i�\x7f�wO���I����_�\\`;���22�7��T��?{��\u05ff^�~�~:,�q]UO��������o�U���?d�S��k.ws�N�Ӧ�5㩯�y����|���Ԏ��?,����_\x7f\\�5�����o3���Sw��������9�M[�Y~yX7~{����ϕ�F���1T>N��zqw{�Yf*����Is�ٻ%���ϯΉ��U�����X��y�͛���9DE_���˚��������zD�?�a]�g>��z�-�s�!��7O~��_}��?��\x7f�x�R��W���_��H��y����~����]��1;\\��4vў���ݒ�[���i�����Sf3Ms>�U��b��|��i:��I�6R\'62�i#��ҭ���(E�)������\'\x7f\t������$Û���\r�(�v����;o��>^��4t�~2 ��\x7f��6����_,BS��П:h#m��}7�����h`��n9$N'
(Pdb) date = page.xpath("string(//span[contains(., 'Taken on')])")
(Pdb) date
''
(Pdb) date.text
*** AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'text'

Is some debugger output I used to inspect this situation. Is it fair to think the next step on this might be to write a PDF scraper/parser and use a library other than libxml that can work with PDF?

jamesturk · 2021-05-03T14:08:19Z

Ah if the votes are in PDF format, then yes, pdftotext will need to be used (you can search other scrapers for convert_pdf) to see examples. I can also take a closer look at this soon if that'd help

mzagaja · 2021-05-05T02:08:37Z

I can take a poke at it this weekend and see how big of a lift it might be, but if you happen to have bandwidth happy to see it happen sooner 😄.

mzagaja · 2021-05-09T01:22:23Z

I haven't fully figured out the votes scraping yet, but made some good progress on this today. I pushed a work in progress commit if you want to see what I've done so far and have any suggestions.

* Add vote code.

* Separate CT votes by whtiespace and push into a dict for easier parsing. * Take dict and attempt to add the vote with add_vote. This fails with: (Pdb) self.add_vote('Y', 'ABERCROMBIE') *** TypeError: add_vote() takes 2 positional arguments but 3 were given

mzagaja · 2021-06-27T23:05:35Z

@jamesturk Made some updates to this but am getting a weird error in the add_vote part of it. The error message claims I'm providing 3 positional arguments to add_vote but as far as I can tell I'm only providing two so am a bit flummoxed.

jamesturk · 2021-06-29T15:41:20Z

scrapers/ct/bills.py

@@ -16,7 +17,16 @@ class SkipBill(Exception):
 class CTBillScraper(Scraper):
    latest_only = True

+    def add_vote(vote, voter):


add the explicit self parameter :)

WIP CT PDF Scraping

c9bcc75

mzagaja added 3 commits May 9, 2021 10:56

WIP CT Update

a0c3e9f

* Add vote code.

Better Regex Matching

777cb56

Break Out CT Votes

6837276

* Separate CT votes by whtiespace and push into a dict for easier parsing. * Take dict and attempt to add the vote with add_vote. This fails with: (Pdb) self.add_vote('Y', 'ABERCROMBIE') *** TypeError: add_vote() takes 2 positional arguments but 3 were given

jamesturk reviewed Jun 29, 2021

View reviewed changes

jessemortenson mentioned this pull request Dec 19, 2023

CT: Vote data incorrect openstates/issues#233

Open

showerst closed this Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP Potential Fix for CT Scrape Issue #3658

WIP Potential Fix for CT Scrape Issue #3658

mzagaja commented Apr 14, 2021

jamesturk commented Apr 20, 2021

jamesturk commented Apr 21, 2021

mzagaja commented Apr 23, 2021

mzagaja commented Apr 24, 2021

mzagaja commented May 1, 2021

jamesturk commented May 3, 2021

mzagaja commented May 5, 2021

mzagaja commented May 9, 2021

mzagaja commented Jun 27, 2021

jamesturk Jun 29, 2021

WIP Potential Fix for CT Scrape Issue #3658

WIP Potential Fix for CT Scrape Issue #3658

Conversation

mzagaja commented Apr 14, 2021

jamesturk commented Apr 20, 2021

jamesturk commented Apr 21, 2021

mzagaja commented Apr 23, 2021

mzagaja commented Apr 24, 2021

mzagaja commented May 1, 2021

jamesturk commented May 3, 2021

mzagaja commented May 5, 2021

mzagaja commented May 9, 2021

mzagaja commented Jun 27, 2021

jamesturk Jun 29, 2021

Choose a reason for hiding this comment