-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create Manually Curated List of Sites to Crawl #7
Comments
As per our last meeting, I've been doing some preliminary research for the sites we should crawl, and I've found some options that we could consider.
Those are just a couple of the options available, however, a lot of the other ones that I've seen look like they require some sort of payment, or only allow you to view the top 50 per category, or don't offer a .csv download. |
What about a custom list? The problem with Tranco is that is has .gov, .org, and other sites where probably not much is going on in terms of data collection and sharing.
Possibly, we can start with Tranco and adapt it. |
I was able to download the BuiltWith list for free, which has their top million sites. It also says on this page that we can use it however we like (besides selling it), so it seems like we have the green light for that list. |
To add on to this, we need to look into what variables we want to investigate and why we're investigating them. In this issue, I'll start a log of different angles I've tried (e.g. comparing Builtwith results vs. Tranco, using a VPN in California, etc.) so that in the future, we can refer back. |
I ran a quick scrape of the top 100 sites on builtwith (without a vpn). Two things that were interesting were that there were significantly fewer HumanCheck errors and that there was significantly more data being gathered. I think I'm going to keep looking at the builtwith list for now, looking at different domains, and later switching my location with the VPN. |
Here are some results from just running sites with .gov from builtwith (no vpn):
Looking at the differences between US and non-US government sites might be an interesting thing to look into. |
I ran a scrape of ~100 of the top sites from BuiltWith that were .edu, both with and without a VPN. The VPN was set to Los Angeles, California.
Note for later: hss.edu would cause the crawler to crash entirely, so for the time being, be sure to remove it from a crawl list |
Here are my suggestions for the crawl lists:
In terms of actual lists, here are my two suggestions (the actual lists will be attached at the bottom): BuiltWith top 2000:
Some downsides:
Second option (my preferred) I think this list is better because it provides us with the solid foundation of Builtwith, and then on top of that, we get sites that are likely to be used by everyday users, such as shopping sites and social media, among others. This way, we get not only a good spread of TLDs, but also good coverage of the different ways in which people use the internet. I've attached both lists here. |
Thanks, @dadak-dom! 1. Here is how I see it at the moment
2. Questions
|
I can start making a suggestion of 1000 sites from this list, yes 👍
That makes sense. It seems like we feel much more comfortable with the Tranco list, so I'll start working with that (as above).
I'll look into this as well 👍 |
@SebastianZimmeck Here's an idea for the locations we could use:
With this list, I was trying to get a nice spread of privacy laws and locations. You can see I've kept it fairly English, but I think that we can swap out some locations if we want a greater diversity. The Tranco list seems to do a better job of creating a diverse pool of websites. Still has an English skew, but not as much as Builtwith, I don't think. I've also attached the Tranco list (with modifications) that you asked for. |
Nice work, @dadak-dom! As we discussed today in our meeting, let's go for five locations:
For each we crawl with a generic top 1,000 list that is the same for each location. Then, we have a specific top 1,000 list depending on location (e.g., Brazil would be the top 1,000 .br country domains). This will give us comparability across the set of locations but also allow us to capture some location-specific results. We will first need to spot-check for non-English speaking countries if the returned Privacy Pioneer results are good, i.e., the analysis works even if there are partially intermingled Portuguese words in the HTTP messages for the Brazil analysis. Since all US states will have the same location list (unless we use state-specific lists, not sure how to do that, the whois database or BuiltWith maybe?), we will have some more locations for the 10,000 site budget. So, we could also think of adding one, two, three more countries/states to our list. An Asian country, maybe? Texas? |
@SebastianZimmeck For the country-specific lists, is there any reason why we can't use .com for the US? Based on what I could find, it seems like the US claims to have control over the domain, so it seems that we could argue in favor of that. It would also make more sense than .us, since so few sites use it compared to .com. What do you think? |
@dadak-dom, yes, in general I see no strong reasons for why not. Some minor reasons may be that that country-specific list would be close to the generic list. But if it is the reality that the US dominates the top websites, then that is what it is. A second point is that we used .us as country-specific list for the ML training data. But again, in my mind, this is not a reason why we couldn't switch to .com now. So, unless I am missing something, yes, let's switch to .com. |
@danielgoldelman Here's what I could gather for the "testing" I was assigned: Brazil summary: From what I could tell, it seems like PP definitely works on certain sites, while on others it finds nothing. Everything it did find seemed to be from servers that are based in English, though, so maybe it can't find any requests with Portuguese. This would probably need to be investigated further. Ukraine sites used: Ukraine summary: Similar to Brazil. Everything PP finds seems normal, so it seems more likely that, if there is a problem, it would be a problem with detecting requests that PP should (so false negative, I believe?). I will look into other countries soon. In general, it looks like PP works, but exactly how effectively, I'm not sure. |
It would be great if you can make a call, @dadak-dom. I'd say, if about 10% of a set of foreign language sites that run fail to produce analysis results, we should not use that country. So, which countries clear that threshold? |
Of the countries I have tested (Australia, Ukraine, Brazil, Ireland, and Singapore), I think we could use Ireland and Ukraine, as they had the fewest sites with no results. Based on my results, I do not think that we should use Brazil, Singapore, or Australia. None of the countries had less than 10% failure, but that could be due to a small sample. If this seems alright, I can make a list for Ukraine TLDs before the crawl. |
Thanks, @dadak-dom!
A site can have a lot or just a few results. Either is OK. What matters is whether the analysis is correct on the results that are available, if any. So, take a look at the Privacy Pioneer analysis results and then try to manually evaluate whether they are correct, i.e., evaluate the ground truth. For example, you can use the browser developer tools and manually check (@danielgoldelman can provide more info on how to do a ground truth analysis). I checked the first three Ukrainian sites (https://sinoptik.ua/, https://www.olx.ua/uk/, https://www.pravda.com.ua/). None of them had locations or personal data. Can you test for locations (ZIP code, region, latitude, longitude, street address) and personal data (email address, phone number, custom keywords). Those would be much harder tasks than tracking and monetization because those just use deterministic techniques (e.g., rules matching URLs). Locations use our ML model. |
If we haven't already, it might also be useful to ensure PP still functions well when dealing with non-English language keywords, like a Brazilian city (possibly with non-English diacritics, like in "São Paulo") or ZIP code, that way we know PP works both:
Maybe visiting with a VPN based in the country in question could accomplish this? |
Absolutely! @dadak-dom and @danielgoldelman, could you take care of that? |
The idea is to crawl 525 location-specific sites (total location-specific sites 5,250) and 525 general sites (total general sites 5,250) for the following countries and US states (total 10,500):
|
Okay, so here's what it looks like in terms of the foreign languages: If we are concerning ourselves with custom, general keywords, then we might want to replace Spain, Ukraine, and Brazil. By this, I mean that we are crawling with an instance of PP that will be on the lookout for a custom keyword that has an accent. If not, I think we can keep Brazil and Spain. However, I think that Ukraine needs to be replaced. It doesn't look like PP knows how to handle the different alphabet, so it would flood the extension with false positives for keywords. For replacing Ukraine, I would suggest the following three countries, and then @SebastianZimmeck , if you could let me know what you think, that'd be great.
|
@dadak-dom Do you happen to have an example of a site and keyword from which the issue could be replicated? Regardless of whether or not we end up implementing general keywords for the crawl, it's probably worth looking into potential fixes for PP. |
@JoeChampeau That makes sense. For the accents, an example would be if you go to sodexobeneficios.com.br and search your keyword in the search bar (for example, my keyword was "hollà", and PP would identify it as "holl&". If I did something like "hollàcom", then PP wouldn't find anything) For Ukrainian, I would translate something like "hello" and paste it into the search bar of https://sinoptik.ua/. Once I had a keyword in Ukrainian, PP would find a bunch of keywords that didn't actually exist. If I remember correctly, it would claim that I had a keyword "reqU", and it would find the keyword in a bunch of requests. Hopefully this helps. |
OK, let's remove Brazil, Spain, and Ukraine. Here is a new list:
@dadak-dom, can you check:
@JoeChampeau, maybe take a shallow look into the character issue @dadak-dom describes. If this is an easy fix or implementation mistake, we can fix it. But probably not worth it to spend a huge amount of time on it. |
@SebastianZimmeck Just looked into your questions, and here's what I could gather:
I'll get started on Canada and Germany, and if you could let me know a preference for the third, that would be great. Maybe France? It doesn't look like we have any more Asian countries to choose from. |
OK, then let's replace it.
France would be good. But possibly there are also issues with accents. If that is the case, let's pick Florida US to have one US location without a privacy law. |
We are using the following list:
The reason is that we are not testing for keywords, emails, and phone numbers. Location should be good even for non-English sites. |
@dadak-dom, where in the Google Drive are the lists of sites to crawl? It does not seem to be the Web_Crawl_Site_List folder. For example, I do not see Australia there. (cc'ing @atlasharry, @PattonYin, @ananafrida) |
Looks like the site lists are in this repo. But Ireland is missing. @dadak-dom? If so, what are the other lists in the Google Drive? |
Also, where is the final methodology for creating the crawl lists (and test lists)? |
@SebastianZimmeck Yes, the site lists are in this repo. The lists in the Google Drive detail which sites were removed and why. I wrote down a methodology that was used when we were still using VPNs, but other than connecting to a VPN, the process remains the same. @atlasharry should have more details about test list methodology, since he made the most recent one (and also the Korea list, I believe). |
OK, thanks, @dadak-dom! Here is the final list:
Also, the generic list we apply to all locations is the United States list. |
The text was updated successfully, but these errors were encountered: