Are there CivicPlus sites that run on non-CivicPlus domains? #82

zstumgoren · 2020-12-25T23:25:29Z

Our list of ~1500 known Civic Plus sites largely run on subdomains of CivicPlus.

For example:

https://nm-lascruces.civicplus.com/AgendaCenter/

However, there appears to be at least one (and possibly others) that are only accessible via non-CivicPlus domains (presumably on a domain the government agency set up or manages itself).

Napa County is one known example:

# Broken CivicPlus subdomain
https://napa-county.civicplus.com/AgendaCenter

# Working AgendaCenter location
https://www.countyofnapa.org/AgendaCenter

This issue first cropped up in #63 and affects #80

The text was updated successfully, but these errors were encountered:

DiPierro · 2020-12-29T12:28:17Z

While https://napa-county.civicplus.com/AgendaCenter is not valid, https://ca-napacounty.civicplus.com/AgendaCenter -- which follows the general formula as other counties with civicplus.com domains -- is live. https://napa-county.civicplus.com/AgendaCenter is a typo.

I've spent about an hour checking to see if any other CivicPlus sites with .gov or .org URLs do not correspond to a URL of the form stateabbreviation-agencyname.civicplus.com/AgendaCenter and am yet to find an example. Here are two websites that demonstrate this point:

# Valid
https://www.ks25jd.org/agendacenter

# Also valid
https://ks-25thjudicialdistrict.civicplus.com/agendacenter

# Valid
https://www.chickasha.org/AgendaCenter

# But also valid
https://ok-chickasha.civicplus.com/AgendaCenter

However, I can't definitively prove that this is always true. A more comprehensive fix would be to have more robust site detection capability (not to be confused with the method discussed in #69).

At present, our method of identifying Agenda Center sites involves manually searching an online subdomain enumeration tool. We could develop a way to programmatically identify websites built using CivicPlus's Agenda Center product. More generally, in the future, we may want to automatically detect websites built using other meeting software, e.g., Legistar.

The best solution I can think of is to write a script that uses both a Google Custom Search API and subdomain enumerating libraries. The Google API could be used to detect, for example, the first 1,000 or so results for the searches site:.gov/AgendaCenter, site:.com/AgendaCenter and site:.org/AgendaCenter. The enumerating libraries would merely search for all civicplus.com subdomains.

zstumgoren · 2020-12-29T19:22:35Z

@DiPierro Thanks for digging into this! This sounds like good news -- i.e. it appears we can generally assume that CivicPlus sites have a working subdomain. It may be that our initial site discovery methodology which you describe unearthed URLs that are no longer valid, so it may simply be a matter of identifying and updating the canonical URLs for problematic sites in our canonical list of known CivicPlus sites. That list includes a lot of http URLs rather than https URLs. The fomer often seem to redirect to the latter, and can significantly slow down or outright break the scraping process. In the few cases I've tested, using the https version of the site seems to fix the slowness/breakage, although the Napa County case is one where I didn't realize the site also had a working, standard URL that follows the expected pattern of https://<place>-<agencyname>.civicplus.com/AgendaCenter (nice find on that!).

I think we can address this as a mixed task -- part coding and part research. We should be able to easily write a script that steps through all URLs and tests http sites for redirects and/or equivalent https URLs. The requests library has support for checking redirect status and seems like the simplest initial approach. That should let us flag for additional research any URLs that do indeed redirect to https or return 404s on https.

That process should help us figure out if all CivicPlus sites that we're aware of have standard subdomains on CivicPlus and help us decide what, if any, changes are needed to address the "unique name" issue described in #80.

@DiPierro Do you want to take on that scripting/research as part of the aw-scripts library? Alternatively, we can flag this as a "help wanted" issue to see if we can find volunteers to take a stab.

DiPierro · 2020-12-30T13:45:11Z

Hi @zstumgoren - would you mind flagging this scripting/research task as "help wanted" for now? I'm not certain how much time I'll have in the coming week or so. The task strikes me as a good fit for other volunteers should they have interest, and I wouldn't want to delay. Thank you.

DiPierro · 2021-02-03T16:29:29Z

@zstumgoren I've started stepping through our list of CivicPlus domains using a modified version of generate_civicplus_sites.py so that we know we're using a clean list of domains. The script produces a csv that includes these fields:

The original URL we've fed into the scraper - ex. http://ar-garlandcounty.civicplus.com/AgendaCenter
The year this website started posting meeting documents
The year this website last posted meeting documents
Name of the site (based on its URL; will need to be cleaned up by hand)
State
Country
Government level (based on its URL; will need to be cleaned up by hand)
Names of meeting bodies listed on site
Status code - The response value generated by a call to requests.get(URL, allow_redirects = True)
History - The result of accessing the history attribute of a response object; blank if no redirects occurred, or a 302 status code if a redirect happened,
Alias - The URL we've been redirected to.

Can you think of other fields I should be tracking? Should I separately pass each domain into civic-scraper to see how if there are any problems?

DiPierro · 2021-02-03T22:02:25Z

Here's a csv merging the public list of URLs with the status_code, history, and alias fields described above:

https://docs.google.com/spreadsheets/d/19t6vnl514kUyoSHKq3rMVA8y3O_hQ6KXk-HiUBB78xo/edit?usp=sharing

zstumgoren added research help wanted Extra attention is needed labels Dec 25, 2020

This was referenced Dec 25, 2020

Fix unique name for CivicPlus sites #80

Open

test_download_asset_list fails #63

Closed

DiPierro self-assigned this Dec 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are there CivicPlus sites that run on non-CivicPlus domains? #82

Are there CivicPlus sites that run on non-CivicPlus domains? #82

zstumgoren commented Dec 25, 2020

DiPierro commented Dec 29, 2020

zstumgoren commented Dec 29, 2020 •

edited

Loading

DiPierro commented Dec 30, 2020

DiPierro commented Feb 3, 2021 •

edited

Loading

DiPierro commented Feb 3, 2021

Are there CivicPlus sites that run on non-CivicPlus domains? #82

Are there CivicPlus sites that run on non-CivicPlus domains? #82

Comments

zstumgoren commented Dec 25, 2020

DiPierro commented Dec 29, 2020

zstumgoren commented Dec 29, 2020 • edited Loading

DiPierro commented Dec 30, 2020

DiPierro commented Feb 3, 2021 • edited Loading

DiPierro commented Feb 3, 2021

zstumgoren commented Dec 29, 2020 •

edited

Loading

DiPierro commented Feb 3, 2021 •

edited

Loading