Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are there CivicPlus sites that run on non-CivicPlus domains? #82

Open
zstumgoren opened this issue Dec 25, 2020 · 5 comments
Open

Are there CivicPlus sites that run on non-CivicPlus domains? #82

zstumgoren opened this issue Dec 25, 2020 · 5 comments
Assignees
Labels
help wanted Extra attention is needed research

Comments

@zstumgoren
Copy link
Member

Our list of ~1500 known Civic Plus sites largely run on subdomains of CivicPlus.

For example:

https://nm-lascruces.civicplus.com/AgendaCenter/

However, there appears to be at least one (and possibly others) that are only accessible via non-CivicPlus domains (presumably on a domain the government agency set up or manages itself).

Napa County is one known example:

# Broken CivicPlus subdomain
https://napa-county.civicplus.com/AgendaCenter

# Working AgendaCenter location
https://www.countyofnapa.org/AgendaCenter

This issue first cropped up in #63 and affects #80

@zstumgoren zstumgoren added research help wanted Extra attention is needed labels Dec 25, 2020
@DiPierro DiPierro self-assigned this Dec 29, 2020
@DiPierro
Copy link
Contributor

While https://napa-county.civicplus.com/AgendaCenter is not valid, https://ca-napacounty.civicplus.com/AgendaCenter -- which follows the general formula as other counties with civicplus.com domains -- is live. https://napa-county.civicplus.com/AgendaCenter is a typo.

I've spent about an hour checking to see if any other CivicPlus sites with .gov or .org URLs do not correspond to a URL of the form stateabbreviation-agencyname.civicplus.com/AgendaCenter and am yet to find an example. Here are two websites that demonstrate this point:

# Valid
https://www.ks25jd.org/agendacenter

# Also valid
https://ks-25thjudicialdistrict.civicplus.com/agendacenter

# Valid
https://www.chickasha.org/AgendaCenter

# But also valid
https://ok-chickasha.civicplus.com/AgendaCenter

However, I can't definitively prove that this is always true. A more comprehensive fix would be to have more robust site detection capability (not to be confused with the method discussed in #69).

At present, our method of identifying Agenda Center sites involves manually searching an online subdomain enumeration tool. We could develop a way to programmatically identify websites built using CivicPlus's Agenda Center product. More generally, in the future, we may want to automatically detect websites built using other meeting software, e.g., Legistar.

The best solution I can think of is to write a script that uses both a Google Custom Search API and subdomain enumerating libraries. The Google API could be used to detect, for example, the first 1,000 or so results for the searches site:.gov/AgendaCenter, site:.com/AgendaCenter and site:.org/AgendaCenter. The enumerating libraries would merely search for all civicplus.com subdomains.

@zstumgoren
Copy link
Member Author

zstumgoren commented Dec 29, 2020

@DiPierro Thanks for digging into this! This sounds like good news -- i.e. it appears we can generally assume that CivicPlus sites have a working subdomain. It may be that our initial site discovery methodology which you describe unearthed URLs that are no longer valid, so it may simply be a matter of identifying and updating the canonical URLs for problematic sites in our canonical list of known CivicPlus sites. That list includes a lot of http URLs rather than https URLs. The fomer often seem to redirect to the latter, and can significantly slow down or outright break the scraping process. In the few cases I've tested, using the https version of the site seems to fix the slowness/breakage, although the Napa County case is one where I didn't realize the site also had a working, standard URL that follows the expected pattern of https://<place>-<agencyname>.civicplus.com/AgendaCenter (nice find on that!).

I think we can address this as a mixed task -- part coding and part research. We should be able to easily write a script that steps through all URLs and tests http sites for redirects and/or equivalent https URLs. The requests library has support for checking redirect status and seems like the simplest initial approach. That should let us flag for additional research any URLs that do indeed redirect to https or return 404s on https.

That process should help us figure out if all CivicPlus sites that we're aware of have standard subdomains on CivicPlus and help us decide what, if any, changes are needed to address the "unique name" issue described in #80.

@DiPierro Do you want to take on that scripting/research as part of the aw-scripts library? Alternatively, we can flag this as a "help wanted" issue to see if we can find volunteers to take a stab.

@DiPierro
Copy link
Contributor

Hi @zstumgoren - would you mind flagging this scripting/research task as "help wanted" for now? I'm not certain how much time I'll have in the coming week or so. The task strikes me as a good fit for other volunteers should they have interest, and I wouldn't want to delay. Thank you.

@DiPierro
Copy link
Contributor

DiPierro commented Feb 3, 2021

@zstumgoren I've started stepping through our list of CivicPlus domains using a modified version of generate_civicplus_sites.py so that we know we're using a clean list of domains. The script produces a csv that includes these fields:

  • The original URL we've fed into the scraper - ex. http://ar-garlandcounty.civicplus.com/AgendaCenter
  • The year this website started posting meeting documents
  • The year this website last posted meeting documents
  • Name of the site (based on its URL; will need to be cleaned up by hand)
  • State
  • Country
  • Government level (based on its URL; will need to be cleaned up by hand)
  • Names of meeting bodies listed on site
  • Status code - The response value generated by a call to requests.get(URL, allow_redirects = True)
  • History - The result of accessing the history attribute of a response object; blank if no redirects occurred, or a 302 status code if a redirect happened,
  • Alias - The URL we've been redirected to.

Can you think of other fields I should be tracking? Should I separately pass each domain into civic-scraper to see how if there are any problems?

@DiPierro
Copy link
Contributor

DiPierro commented Feb 3, 2021

Here's a csv merging the public list of URLs with the status_code, history, and alias fields described above:

https://docs.google.com/spreadsheets/d/19t6vnl514kUyoSHKq3rMVA8y3O_hQ6KXk-HiUBB78xo/edit?usp=sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed research
Projects
None yet
Development

No branches or pull requests

2 participants