Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Potential Fix for CT Scrape Issue #3658

Closed
wants to merge 5 commits into from

Conversation

mzagaja
Copy link

@mzagaja mzagaja commented Apr 14, 2021

  • Potential fix for CT scraping issue adopts xpath search pattern to grab the PDF link instead of the “link next to the htm” (that happened to be a PDF in the previous iteration of the website). Currently not working nor emitting an error on my machine.

* Potential fix for CT scraping issue adopts xpath search pattern to grab the PDF link instead of the “link next to the htm” (that happened to be a PDF in the previous iteration of the website). Currently not working nor emitting an error on my machine.
@jamesturk
Copy link
Member

thanks for this, on my list to review this week

@jamesturk
Copy link
Member

when the scrape finishes what does the summary block look like?

ct (scrape)
  bills: {}
bills scrape:
  duration:  0:02:02.136713
  objects:
    bill: 3218
jurisdiction scrape:
  duration:  0:00:00.005249
  objects:
    jurisdiction: 1
    organization: 3

As you can see, I'm not seeing any votes from this change.

I'm curious if yours was different?

@mzagaja
Copy link
Author

mzagaja commented Apr 23, 2021

I also got:

ct (scrape)
  bills: {}
bills scrape:
  duration:  0:10:43.968659
  objects:
    bill: 3202
jurisdiction scrape:
  duration:  0:00:00.021614
  objects:
    jurisdiction: 1
    organization: 3

I do see the ones with votes in the cache, but have to try and map which bill JSON files they go with to investigate further now that I ran it fully. Going to try and investigate a bit more this weekend and maybe take another swing at it.

@mzagaja
Copy link
Author

mzagaja commented Apr 24, 2021

One bill that has a vote is:

_data/ct//bill_900ea9ec-a2ef-11eb-b816-0242ac120002.json

{
    "legislative_session": "2021",
    "identifier": "SJ00025",
    "title": "RESOLUTION CONFIRMING THE NOMINATION OF DR. JOHN BONETTI OF FARMINGTON TO BE A MEMBER OF THE PSYCHIATRIC SECURITY REVIEW BOARD.",
    "from_organization": "~{\"classification\": \"upper\"}",
    "classification": [
        "joint resolution"
    ],
    "subject": [
        
    ],
    "abstracts": [
        
    ],
    "other_titles": [
        
    ],
    "other_identifiers": [
        
    ],
    "actions": [
        {
            "description": "NOMINATION REFERRED TO COMMITTEE ON Executive and Legislative Nominations",
            "date": "2021-01-06",
            "organization_id": "~{\"classification\": \"upper\"}",
            "classification": [
                "referral-committee"
            ],
            "related_entities": [
                
            ]
        },
        {
            "description": "PUBLIC HEARING 0311",
            "date": "2021-03-11",
            "organization_id": "~{\"classification\": \"upper\"}",
            "classification": [
                
            ],
            "related_entities": [
                
            ]
        },
        {
            "description": "FAV. RPT., TAB. FOR CAL., SEN.",
            "date": "2021-03-15",
            "organization_id": "~{\"classification\": \"upper\"}",
            "classification": [
                
            ],
            "related_entities": [
                
            ]
        },
        {
            "description": "SENATE CALENDAR NUMBER 71",
            "date": "2021-03-15",
            "organization_id": "~{\"classification\": \"upper\"}",
            "classification": [
                
            ],
            "related_entities": [
                
            ]
        },
        {
            "description": "ADOPTED, SENATE",
            "date": "2021-03-23",
            "organization_id": "~{\"classification\": \"upper\"}",
            "classification": [
                "passage"
            ],
            "related_entities": [
                
            ]
        },
        {
            "description": "ON CONSENT CALENDAR",
            "date": "2021-03-23",
            "organization_id": "~{\"classification\": \"upper\"}",
            "classification": [
                
            ],
            "related_entities": [
                
            ]
        },
        {
            "description": "RULES SUSPENDED,TRANS.TO HOUSE",
            "date": "2021-03-23",
            "organization_id": "~{\"classification\": \"lower\"}",
            "classification": [
                
            ],
            "related_entities": [
                
            ]
        },
        {
            "description": "FAV. RPT., TABLED FOR HOUSE CALENDER",
            "date": "2021-04-12",
            "organization_id": "~{\"classification\": \"lower\"}",
            "classification": [
                
            ],
            "related_entities": [
                
            ]
        },
        {
            "description": "HOUSE CALENDAR NUMBER 272",
            "date": "2021-04-12",
            "organization_id": "~{\"classification\": \"lower\"}",
            "classification": [
                
            ],
            "related_entities": [
                
            ]
        }
    ],
    "sponsorships": [
        {
            "name": "Duff, Bob",
            "classification": "primary",
            "entity_type": "person",
            "primary": true,
            "person_id": "~{\"name\": \"Duff, Bob\"}",
            "organization_id": null
        },
        {
            "name": "Concepcion, Julio A.",
            "classification": "primary",
            "entity_type": "person",
            "primary": true,
            "person_id": "~{\"name\": \"Concepcion, Julio A.\"}",
            "organization_id": null
        }
    ],
    "related_bills": [
        
    ],
    "versions": [
        {
            "note": "Senate Joint Nomination",
            "links": [
                {
                    "url": "https://www.cga.ct.gov/2021/TOB/S/PDF/2021SJ-00025-R00-SB.PDF",
                    "media_type": "application/pdf"
                }
            ],
            "date": "",
            "classification": ""
        }
    ],
    "documents": [
        
    ],
    "sources": [
        {
            "url": "ftp://ftp.cga.ct.gov/pub/data/bill_info.csv",
            "note": ""
        },
        {
            "url": "https://www.cga.ct.gov/asp/cgabillstatus/cgabillstatus.asp?selBillType=Bill&bill_num=SJ25&which_year=2021",
            "note": ""
        }
    ],
    "extras": {
        
    },
    "_id": "900ea9ec-a2ef-11eb-b816-0242ac120002"
}

The cache links to a vote PDF:

mzagaja@MacBook-Pro ~/D/openstates-scrapers (ct-scrape-issue)> rg 'VOTE' _cache/
_cache/www.cga.ct.gov,asp,cgabillstatus,cgabillstatus.asp,selBillType=Bill&bill_num=SJ25&which_year=2021,5d50bf64a9cf8bf185c29c5b6b132421
1091:<tr><td bgcolor=black>&nbsp;</td><td><a href="/2021/VOTE/S/PDF/2021SV-00072-R00SJ00025-SV.PDF">Senate Roll Call Vote 72 </a></td></tr></tbody></table>

Which is a valid URL: https://cga.ct.gov/2021/VOTE/S/PDF/2021SV-00072-R00SJ00025-SV.PDF

I can confirm by dropping in a debugger statement that we see a link:

        for link in page.xpath(
            "//a[(contains(@href, '/pdf/') or contains(@href, '/PDF/')) and contains(@href, '/VOTE/')]"
        ):
            # 2011 HJ 31 has a blank vote, others might too
            print(link.text)
            import pdb; pdb.set_trace()
            if link.text:
                pdf_link = link
                if pdf_link:
                    yield from self.scrape_vote(
                        bill, pdf_link.text.strip(), link.attrib["href"]
                    )
        print('Finished scraping webpage')
13:18:58 INFO scrapelib: GET - https://www.cga.ct.gov/asp/cgabillstatus/cgabillstatus.asp?selBillType=Bill&bill_num=HB6423&which_year=2021
House Roll Call Vote 32 AS AMENDED
> /opt/openstates/openstates/scrapers/ct/bills.py(153)scrape_bill_page()
-> if link.text:
(Pdb) link
<Element a at 0xffffb884b8f0>
(Pdb) pdf_link.text.strip()
*** NameError: name 'pdf_link' is not defined
(Pdb) link.text
'House Roll Call Vote 32 AS AMENDED '
(Pdb) link.text.strip()
'House Roll Call Vote 32 AS AMENDED'

Thus even though my change might be a necessary part of the fix, it seems like it isn't sufficient to resolve this issue.

A new clue arrives when we try testing a pdf_link with the bool() method:

House Roll Call Vote 32 AS AMENDED
> /opt/openstates/openstates/scrapers/ct/bills.py(155)scrape_bill_page()
-> if pdf_link:
(Pdb) pdf_link
<Element a at 0xffffb84f3350>
(Pdb) bool(pdf_link)
/opt/openstates/openstates/scrapers/ct/bills.py:1: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.

Updating this section to:

        for link in page.xpath(
            "//a[(contains(@href, '/pdf/') or contains(@href, '/PDF/')) and contains(@href, '/VOTE/')]"
        ):
            # 2011 HJ 31 has a blank vote, others might too
            print(link.text)
            if link.text:
                pdf_link = link
                if pdf_link is not None:
                    yield from self.scrape_vote(
                        bill, pdf_link.text.strip(), link.attrib["href"]
                    )
        print('Finished scraping webpage')

Surmounts the URL issue, but then we land with this problem which is a bit more cryptic to me:

13:34:32 INFO scrapelib: GET - https://www.cga.ct.gov/asp/cgabillstatus/cgabillstatus.asp?selBillType=Bill&bill_num=HB6423&which_year=2021
House Roll Call Vote 32 AS AMENDED
Scraping the vote
13:34:32 INFO scrapelib: GET - https://www.cga.ct.gov/2021/VOTE/H/PDF/2021HV-00032-R00HB06423-HV.PDF
/root/.cache/pypoetry/virtualenvs/openstates-scrapers-vRcYrsYN-py3.7/lib/python3.7/site-packages/urllib3/connectionpool.py:1020: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.cga.ct.gov'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/openstates-scrapers-vRcYrsYN-py3.7/bin/os-update", line 8, in <module>
    sys.exit(main())
  File "/root/.cache/pypoetry/virtualenvs/openstates-scrapers-vRcYrsYN-py3.7/lib/python3.7/site-packages/openstates/cli/update.py", line 318, in main
    report = do_update(args, other, juris)
  File "/root/.cache/pypoetry/virtualenvs/openstates-scrapers-vRcYrsYN-py3.7/lib/python3.7/site-packages/openstates/cli/update.py", line 205, in do_update
    report["scrape"] = do_scrape(juris, args, scrapers)
  File "/root/.cache/pypoetry/virtualenvs/openstates-scrapers-vRcYrsYN-py3.7/lib/python3.7/site-packages/openstates/cli/update.py", line 89, in do_scrape
    report[scraper_name] = scraper.do_scrape(**scrape_args)
  File "/root/.cache/pypoetry/virtualenvs/openstates-scrapers-vRcYrsYN-py3.7/lib/python3.7/site-packages/openstates/scrape/base.py", line 163, in do_scrape
    for obj in self.scrape(**kwargs) or []:
  File "/opt/openstates/openstates/scrapers/ct/bills.py", line 34, in scrape
    yield from self.scrape_bill_info(session, chambers)
  File "/opt/openstates/openstates/scrapers/ct/bills.py", line 94, in scrape_bill_info
    yield from self.scrape_bill_page(bill)
  File "/opt/openstates/openstates/scrapers/ct/bills.py", line 156, in scrape_bill_page
    bill, pdf_link.text.strip(), link.attrib["href"]
  File "/opt/openstates/openstates/scrapers/ct/bills.py", line 183, in scrape_vote
    yes_count = int(re.match(r"[^\d]*(\d+)[^\d]*", yes_count).group(1))
AttributeError: 'NoneType' object has no attribute 'group'
ERROR: 1

@mzagaja
Copy link
Author

mzagaja commented May 1, 2021

Some further poking at this seems to suggest the issue is the format of the PDF is not lining up with what the scraper expects. My understanding from the code is it wants parsable HTML, and is not getting that from the PDF:

18:38:57 INFO scrapelib: GET - https://www.cga.ct.gov/2021/VOTE/H/PDF/2021HV-00032-R00HB06423-HV.PDF
> /opt/openstates/openstates/scrapers/ct/bills.py(184)scrape_vote()
-> yes_count = int(re.match(r"[^\d]*(\d+)[^\d]*", yes_count).group(1))
(Pdb) yes_count
''
(Pdb) page
<Element p at 0xffffb30e49b0>
(Pdb) page.content
*** AttributeError: 'HtmlElement' object has no attribute 'content'
(Pdb) page.text
'%PDF-1.5\r\n%����\r\n1 0 obj\r\n<>>>\r\nendobj\r\n2 0 obj\r\n<>\r\nendobj\r\n3 0 obj\r\n<>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 612 792] /Contents 4 0 R/Group<>/Tabs/S/StructParents 0>>\r\nendobj\r\n4 0 obj\r\n<>\r\nstream\r\nx���ے�Ƒ���w�aO�[g`bb"�MJ��&�nJ��Z�1��_���� ��@�Ϊʅ��%��ȅ�*�>>����_�����W�����|����?�����}������g����>{��O\x7f���ǿ���\x7f��ճ�W��\x7f��g�WW�i�����O������j�O纽��4�\x7f�O?9_�i��O?�ݓ�\x7f�~�?�x��y������������{����\x7f_�������7�l^�X��$[�zz>��q�z��\uf7bc|����SswO��i�\x7f�wO���I����_�\\`;���22�7��T��?{��\u05ff^�~�~:,�q]UO��������o�U���?d�S��k.ws�N�Ӧ�5㩯�y����|���Ԏ��?,����_\x7f\\�5�����o3���Sw��������9�M[�Y~yX7~{����ϕ�F���1T>N��zqw{�Yf*����Is�ٻ%���ϯΉ��U�����X��y�͛���9DE_���˚��������zD�?�a]�g>��z�-�s�!��7O~��_}��?��\x7f�x�R��W���_��H��y����~����]��1;\\��4vў���ݒ�[���i�����Sf3Ms>�U��b��|��i:��I�6R\'62�i#��ҭ���(E�)������\'\x7f\t������$Û���\r�(�v����;o��>^��4t�~2 ��\x7f��6����_,BS��П:h#m��}7�����h`��n9$N'
(Pdb) date = page.xpath("string(//span[contains(., 'Taken on')])")
(Pdb) date
''
(Pdb) date.text
*** AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'text'

Is some debugger output I used to inspect this situation. Is it fair to think the next step on this might be to write a PDF scraper/parser and use a library other than libxml that can work with PDF?

@jamesturk
Copy link
Member

Ah if the votes are in PDF format, then yes, pdftotext will need to be used (you can search other scrapers for convert_pdf) to see examples. I can also take a closer look at this soon if that'd help

@mzagaja
Copy link
Author

mzagaja commented May 5, 2021

I can take a poke at it this weekend and see how big of a lift it might be, but if you happen to have bandwidth happy to see it happen sooner 😄.

@mzagaja
Copy link
Author

mzagaja commented May 9, 2021

I haven't fully figured out the votes scraping yet, but made some good progress on this today. I pushed a work in progress commit if you want to see what I've done so far and have any suggestions.

* Add vote code.
* Separate CT votes by whtiespace and push into a dict for easier parsing.
* Take dict and attempt to add the vote with add_vote.
This fails with:
(Pdb) self.add_vote('Y', 'ABERCROMBIE')
*** TypeError: add_vote() takes 2 positional arguments but 3 were given
@mzagaja
Copy link
Author

mzagaja commented Jun 27, 2021

@jamesturk Made some updates to this but am getting a weird error in the add_vote part of it. The error message claims I'm providing 3 positional arguments to add_vote but as far as I can tell I'm only providing two so am a bit flummoxed.

@@ -16,7 +17,16 @@ class SkipBill(Exception):
class CTBillScraper(Scraper):
latest_only = True

def add_vote(vote, voter):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the explicit self parameter :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants