Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Manually Curated List of Sites to Crawl #7

Closed
JoeChampeau opened this issue Oct 27, 2023 · 34 comments
Closed

Create Manually Curated List of Sites to Crawl #7

JoeChampeau opened this issue Oct 27, 2023 · 34 comments
Assignees
Labels
discussion Let's talk about this

Comments

@JoeChampeau
Copy link
Collaborator

  1. What set of websites do we want to crawl? How large should this set be? A selection from the Tranco list is always a good option.
  2. Do we want to do region-specific crawls using a VPN?
@dadak-dom
Copy link
Collaborator

As per our last meeting, I've been doing some preliminary research for the sites we should crawl, and I've found some options that we could consider.

  • Tranco list : this seems to be the dataset that you guys have used in the past, or at least are familiar with. It also looks like it tries to correct for the shortcomings of the other lists, so it might be a solid all-around choice.
  • BuiltWith : Another option that was suggested in the meeting. Claims to be "more focused on businesses that implement other businesses technology", so this might be a good dataset to use if we want to investigate how costumers are getting tracked. Depending on what we value for our research, this could be a good choice.
  • DomCop : This list uses data from the Open PageRank Initiative, using data from the Common Crawl and Common Search. From their page, I couldn't really discern what they do differently compared to the others, but I wanted to list them as an open source option.
  • Cisco Umbrella : Presents itself as a competitor with the now defunct Alexa list, "based on passive DNS usage across [Cisco's] global network".
  • Majestic Million : Defined as the "million domains we find with the most referring subnets". This list is also a part of the Tranco list.

Those are just a couple of the options available, however, a lot of the other ones that I've seen look like they require some sort of payment, or only allow you to view the top 50 per category, or don't offer a .csv download.

@SebastianZimmeck
Copy link
Member

SebastianZimmeck commented Nov 8, 2023

What about a custom list? The problem with Tranco is that is has .gov, .org, and other sites where probably not much is going on in terms of data collection and sharing.

  • How many potentially irrelevant sites are there estimated on the Tranco Top 10,000? Can you run 100 sites or so and also spotcheck?
  • Can/should we filter out non-relevant sites from Tranco (some other list)?
  • BuiltWith (and others) are likely paid, right? Who gives us a 10,000 domain list for free?

Possibly, we can start with Tranco and adapt it.

@dadak-dom
Copy link
Collaborator

I was able to download the BuiltWith list for free, which has their top million sites. It also says on this page that we can use it however we like (besides selling it), so it seems like we have the green light for that list.
I took a quick look at both the BuiltWith and Tranco lists, and both seemed to have ".edu", ".gov", etc., so it seems like we'll probably have to adapt whatever list we choose to use. I'll try running a quick scrape tomorrow with the Tranco and see what happens.

@dadak-dom dadak-dom changed the title Determine what sites to crawl and how Create Manually Curated List of Sites to Crawl Nov 10, 2023
@dadak-dom
Copy link
Collaborator

  1. What set of websites do we want to crawl? How large should this set be? A selection from the Tranco list is always a good option.

    1. Do we want to do region-specific crawls using a VPN?

To add on to this, we need to look into what variables we want to investigate and why we're investigating them. In this issue, I'll start a log of different angles I've tried (e.g. comparing Builtwith results vs. Tranco, using a VPN in California, etc.) so that in the future, we can refer back.

@dadak-dom
Copy link
Collaborator

I ran a quick scrape of the top 100 sites on builtwith (without a vpn). Two things that were interesting were that there were significantly fewer HumanCheck errors and that there was significantly more data being gathered. I think I'm going to keep looking at the builtwith list for now, looking at different domains, and later switching my location with the VPN.
Here's the data I gathered, in case anyone wanted to see:
nov13run.txt

@dadak-dom
Copy link
Collaborator

Here are some results from just running sites with .gov from builtwith (no vpn):

  • I noticed there was a lot of government sites for other countries, especially the UK and Brazil. Are these sites that we care about? Or only US government sites?
  • A little more than half of all snippets came from sites that end in .gov (so not .gov.uk, for example). However, there were about two times more sites with .gov only, so this might imply that PP is finding more evidence for sites outside the US, but this would need to be investigated further

Looking at the differences between US and non-US government sites might be an interesting thing to look into.
Here's the data from this scrape:
nov14run.csv

@dadak-dom
Copy link
Collaborator

I ran a scrape of ~100 of the top sites from BuiltWith that were .edu, both with and without a VPN. The VPN was set to Los Angeles, California.

  • It seems that there was more evidence gathered when not using a VPN (i.e., location set to Connecticut). Seems like using a VPN will give interesting results
  • Judging by the amount of evidence gathered, we should definitely consider using .edu sites in the final list.
    Data:
    nov16runVPN.csv
    nov16runNO_VPN.csv

Note for later: hss.edu would cause the crawler to crash entirely, so for the time being, be sure to remove it from a crawl list

@dadak-dom
Copy link
Collaborator

Here are my suggestions for the crawl lists:
In terms of locations, my suggestion is the following:

  • CT (no vpn)
  • California
  • UK
  • Hong Kong
  • South Africa
    Since, as Joe pointed out in the last meeting, we have to be aware of what languages each country is going to use. This option gives a heavy skew to the anglophone world, but that might make sense if we are looking at sites primarily used by English speakers.
    There is also the fact that the VPN that we have is rather limited when it comes to locations in Africa and South America (e.g. South Africa is the only African server and Brazil is the only one in South America). A possible modification to this list would be to swap out a location for a country in the EU, so then we could compare the UK with countries in the union.

In terms of actual lists, here are my two suggestions (the actual lists will be attached at the bottom):

BuiltWith top 2000:
After some modifications, I think this is a strong list because of the following:

  • Smaller TLDs are about representative of what you can find out there on the Internet, e.g. .edu makes up about 3% of the list.
  • The list was generated based on amount of spending on third-party services, so there will definitely be data for us on these sites (I confirmed this by running my own mini-crawls)

Some downsides:

  • Just from looking at the names of the sites and looking at what they're about, they tend to be webpages for companies that your average consumer is probably not visiting, so this could hinder how applicable our data is to your average internet surfer.
  • .com seems to be over-represented (makes up about 75%, although this can be expected when looking for the most popular sites, so I don't think it's a big deal)

Second option (my preferred)
BuiltWith + Majestic Million:
I made this list by combining the top 1000 of each list (Majestic Million ranks their sites by sites with the most referring subnets).

I think this list is better because it provides us with the solid foundation of Builtwith, and then on top of that, we get sites that are likely to be used by everyday users, such as shopping sites and social media, among others. This way, we get not only a good spread of TLDs, but also good coverage of the different ways in which people use the internet.

I've attached both lists here.
sugg1builtwith.txt
sugg2combo.txt

@SebastianZimmeck
Copy link
Member

Thanks, @dadak-dom!

1. Here is how I see it at the moment

Location VPN Privacy Law Official Language Connection Strength (per Mullvad
Middletown, Connecticut, US No CTDPA English N/A
Los Angeles, California, US Yes CCPA English 10 Gbps
Miami, Florida, US Yes None English 10 Gbps
Dublin, Ireland Yes GDPR, ePrivacy Directive English 10 Gbps
  • We use Connecticut without VPN just in case the VPN use makes a difference.
  • Using the VPN, we have one state with a privacy law (California) and one without (Florida). Maybe, we see a difference.
  • I would rather pick Dublin, Ireland over London, UK because the UK is no longer part of the EU. Thus, the GDPR does not apply directly; there is a UK-GDPR, but that may introduce some specialties. A straight EU country would be better.
  • Then, I am not sure, South Africa, Singapore, or Hong Kong. Hong Kong may be impacted by China. So, maybe Singapore or South Africa.

2. Questions

  • Can we use the Tranco List (possibly, omitting some URLs based on our own criteria)?
  • @dadak-dom, you imply that we are using the same list for all. That makes sense in terms of comparability. Should we use a customized list per VPN location to make the set of sites more relevant for the locations? I see comparability as more important at the moment.
  • Is there an explanation of the different BuiltWith categories ("TechSpend" or "Traffic")? (I think no; calling @katehausladen.) I am a bit hesitant of making arguments based on methodologies that we do not understand or have information about how they were applied. At a minimum, we would need to acknowledge this point as a limitation in our paper.
  • Instead of 5 locations with 2,000 sites each should we do 10 locations with 1,000 sites each? That way, we could get a more comprehensive picture of the different privacy laws. I would also think that 2,000 would not give twice the insight that 1,000 sites would give. If we opt for 10 locations, what would those be?

@dadak-dom
Copy link
Collaborator

Can we use the Tranco List (possibly, omitting some URLs based on our own criteria)?

I can start making a suggestion of 1000 sites from this list, yes 👍

Is there an explanation of the different BuiltWith categories ("TechSpend" or "Traffic")? (I think no; calling @katehausladen.) I am a bit hesitant of making arguments based on methodologies that we do not understand or have information about how they were applied. At a minimum, we would need to acknowledge this point as a limitation in our paper.

That makes sense. It seems like we feel much more comfortable with the Tranco list, so I'll start working with that (as above).

Instead of 5 locations with 2,000 sites each should we do 10 locations with 1,000 sites each? That way, we could get a more comprehensive picture of the different privacy laws. I would also think that 2,000 would not give twice the insight that 1,000 sites would give. If we opt for 10 locations, what would those be?

I'll look into this as well 👍

@dadak-dom
Copy link
Collaborator

@SebastianZimmeck Here's an idea for the locations we could use:

Location VPN Privacy Law Language Connection Strength
Miami, Florida Yes None English 10 gbps
Los Angeles, California Yes CCPA English 10 gbps
London, UK Yes UK-GDPR English 10-20 gpbs
Dublin, Ireland Yes GDPR, ePrivacy Directive English 10 gbps
Kyiv, Ukraine Yes On Protection of Personal Data Ukrainian 10 gbps
Johannesburg, South Africa Yes Protection of Personal Information Act English (among others) 10 gbps
Singapore Yes Personal Data Protection Act English (among others) 10 gbps
Melbourne, Australia Yes Privacy Act 1988 English 10 gbps
Auckland, New Zealand Yes The Privacy Act 2020 English 10 gbps
Sao Paulo, Brazil Yes LGPD Portugese Either 1 or 10 gbps, depends on server used

With this list, I was trying to get a nice spread of privacy laws and locations. You can see I've kept it fairly English, but I think that we can swap out some locations if we want a greater diversity. The Tranco list seems to do a better job of creating a diverse pool of websites. Still has an English skew, but not as much as Builtwith, I don't think.

I've also attached the Tranco list (with modifications) that you asked for.

sugg3tranco.txt

@SebastianZimmeck
Copy link
Member

Nice work, @dadak-dom!

As we discussed today in our meeting, let's go for five locations:

  • Connecticut
  • California
  • Florida
  • Dublin, Ireland
  • Brazil

For each we crawl with a generic top 1,000 list that is the same for each location. Then, we have a specific top 1,000 list depending on location (e.g., Brazil would be the top 1,000 .br country domains). This will give us comparability across the set of locations but also allow us to capture some location-specific results.

We will first need to spot-check for non-English speaking countries if the returned Privacy Pioneer results are good, i.e., the analysis works even if there are partially intermingled Portuguese words in the HTTP messages for the Brazil analysis.

Since all US states will have the same location list (unless we use state-specific lists, not sure how to do that, the whois database or BuiltWith maybe?), we will have some more locations for the 10,000 site budget. So, we could also think of adding one, two, three more countries/states to our list. An Asian country, maybe? Texas?

@dadak-dom
Copy link
Collaborator

@SebastianZimmeck For the country-specific lists, is there any reason why we can't use .com for the US? Based on what I could find, it seems like the US claims to have control over the domain, so it seems that we could argue in favor of that. It would also make more sense than .us, since so few sites use it compared to .com. What do you think?

@SebastianZimmeck
Copy link
Member

@dadak-dom, yes, in general I see no strong reasons for why not. Some minor reasons may be that that country-specific list would be close to the generic list. But if it is the reality that the US dominates the top websites, then that is what it is. A second point is that we used .us as country-specific list for the ML training data. But again, in my mind, this is not a reason why we couldn't switch to .com now. So, unless I am missing something, yes, let's switch to .com.

@dadak-dom
Copy link
Collaborator

@danielgoldelman Here's what I could gather for the "testing" I was assigned:
Brazil sites used (for documentation purposes):
https://uol.com.br
https://shopee.com.br
https://www.amazon.com.br/
https://www.gov.br/pt-br
https://olx.com.br
https://mercadolivre.com.br
https://terra.com.br
https://caixa.gov.br
https://acesso.gov.br
https://www.magazineluiza.com.br/

Brazil summary: From what I could tell, it seems like PP definitely works on certain sites, while on others it finds nothing. Everything it did find seemed to be from servers that are based in English, though, so maybe it can't find any requests with Portuguese. This would probably need to be investigated further.

Ukraine sites used:
https://sinoptik.ua/
https://www.olx.ua/uk/
https://www.pravda.com.ua/
https://prom.ua/
https://tsn.ua/
https://24tv.ua/
https://epicentrk.ua/
https://alerts.in.ua/
https://www.unian.ua/
https://tabletki.ua/

Ukraine summary: Similar to Brazil. Everything PP finds seems normal, so it seems more likely that, if there is a problem, it would be a problem with detecting requests that PP should (so false negative, I believe?).

I will look into other countries soon. In general, it looks like PP works, but exactly how effectively, I'm not sure.

@SebastianZimmeck
Copy link
Member

Everything it did find seemed to be from servers that are based in English, though, so maybe it can't find any requests with Portuguese. This would probably need to be investigated further.

It would be great if you can make a call, @dadak-dom. I'd say, if about 10% of a set of foreign language sites that run fail to produce analysis results, we should not use that country. So, which countries clear that threshold?

@dadak-dom
Copy link
Collaborator

Of the countries I have tested (Australia, Ukraine, Brazil, Ireland, and Singapore), I think we could use Ireland and Ukraine, as they had the fewest sites with no results. Based on my results, I do not think that we should use Brazil, Singapore, or Australia. None of the countries had less than 10% failure, but that could be due to a small sample. If this seems alright, I can make a list for Ukraine TLDs before the crawl.

@SebastianZimmeck
Copy link
Member

Thanks, @dadak-dom!

as they had the fewest sites with no results

A site can have a lot or just a few results. Either is OK. What matters is whether the analysis is correct on the results that are available, if any. So, take a look at the Privacy Pioneer analysis results and then try to manually evaluate whether they are correct, i.e., evaluate the ground truth. For example, you can use the browser developer tools and manually check (@danielgoldelman can provide more info on how to do a ground truth analysis).

I checked the first three Ukrainian sites (https://sinoptik.ua/, https://www.olx.ua/uk/, https://www.pravda.com.ua/). None of them had locations or personal data.

Can you test for locations (ZIP code, region, latitude, longitude, street address) and personal data (email address, phone number, custom keywords). Those would be much harder tasks than tracking and monetization because those just use deterministic techniques (e.g., rules matching URLs). Locations use our ML model.

@JoeChampeau
Copy link
Collaborator Author

JoeChampeau commented Dec 28, 2023

If we haven't already, it might also be useful to ensure PP still functions well when dealing with non-English language keywords, like a Brazilian city (possibly with non-English diacritics, like in "São Paulo") or ZIP code, that way we know PP works both:

  1. in non-English contexts (on non-English sites), and
  2. when targeting non-English data (like "São Paulo").

Maybe visiting with a VPN based in the country in question could accomplish this?

@SebastianZimmeck
Copy link
Member

If we haven't already, it might also be useful to ensure PP still functions well when dealing with non-English language keywords

Absolutely! @dadak-dom and @danielgoldelman, could you take care of that?

danielgoldelman pushed a commit that referenced this issue Dec 28, 2023
@SebastianZimmeck
Copy link
Member

The idea is to crawl 525 location-specific sites (total location-specific sites 5,250) and 525 general sites (total general sites 5,250) for the following countries and US states (total 10,500):

  • Australia
  • Brazil
  • Ireland
  • New Zealand
  • Spain
  • Singapore
  • South Africa
  • Ukraine
  • US - California (same location-specific site as US - Colorado)
  • US - Colorado (same locations-specific site as US - California)

@dadak-dom
Copy link
Collaborator

If we haven't already, it might also be useful to ensure PP still functions well when dealing with non-English language keywords

Absolutely! @dadak-dom and @danielgoldelman, could you take care of that?

Okay, so here's what it looks like in terms of the foreign languages:
Accents didn't seem to have any impact on the detection of locations. I compared what Privacy Pioneer found versus all the requests, and it didn't look like the extension was missing anything, even when the city had an accent. However, keywords were a different story. I'm not entirely sure why, but PP would replace accents with some other character, so it would not be able to find custom keywords with accents accurately. So here are my suggestions, depending on how we want to do the crawl.

If we are concerning ourselves with custom, general keywords, then we might want to replace Spain, Ukraine, and Brazil. By this, I mean that we are crawling with an instance of PP that will be on the lookout for a custom keyword that has an accent. If not, I think we can keep Brazil and Spain.

However, I think that Ukraine needs to be replaced. It doesn't look like PP knows how to handle the different alphabet, so it would flood the extension with false positives for keywords.

For replacing Ukraine, I would suggest the following three countries, and then @SebastianZimmeck , if you could let me know what you think, that'd be great.

  • Canada (they seem to have a relatively new privacy law, from 2020)
  • France (would be another EU country)
  • Switzerland (also has a relatively new law from 2020)

@JoeChampeau
Copy link
Collaborator Author

@dadak-dom Do you happen to have an example of a site and keyword from which the issue could be replicated? Regardless of whether or not we end up implementing general keywords for the crawl, it's probably worth looking into potential fixes for PP.

@dadak-dom
Copy link
Collaborator

@JoeChampeau That makes sense. For the accents, an example would be if you go to sodexobeneficios.com.br and search your keyword in the search bar (for example, my keyword was "hollà", and PP would identify it as "holl&". If I did something like "hollàcom", then PP wouldn't find anything)

For Ukrainian, I would translate something like "hello" and paste it into the search bar of https://sinoptik.ua/. Once I had a keyword in Ukrainian, PP would find a bunch of keywords that didn't actually exist. If I remember correctly, it would claim that I had a keyword "reqU", and it would find the keyword in a bunch of requests.

Hopefully this helps.

@SebastianZimmeck
Copy link
Member

OK, let's remove Brazil, Spain, and Ukraine. Here is a new list:

  • Australia (EN)
  • Canada (EN)
  • Germany
  • Hong Kong
  • Ireland (EN)
  • New Zealand (EN)
  • Singapore (EN)
  • South Africa
  • US - California (same location-specific site as US - Colorado) (EN)
  • US - Colorado (same locations-specific site as US - California) (EN)

@dadak-dom, can you check:

  • Does Hong Kong have English sites or Chinese? I assume Chinese would not work. If Hong Kong does not work, we can replace it.
  • Do keywords work on non-English non-special character sites? E.g., Germany?
  • Does South Africa work? They may also have some non-English sites.

@JoeChampeau, maybe take a shallow look into the character issue @dadak-dom describes. If this is an easy fix or implementation mistake, we can fix it. But probably not worth it to spend a huge amount of time on it.

@dadak-dom
Copy link
Collaborator

@SebastianZimmeck Just looked into your questions, and here's what I could gather:

  • From a cursory glance, Hong Kong looks like it has a mix of English and Chinese sites. When I make the list, I could potentially just remove the Chinese sites, but I think it might make more sense to just replace it.
  • Keywords without special characters seem to work just fine, so Germany should be good to go
  • I believe that all the South Africa sites are in English. I checked keywords on a few of those, and they seemed to work as usual.

I'll get started on Canada and Germany, and if you could let me know a preference for the third, that would be great. Maybe France? It doesn't look like we have any more Asian countries to choose from.

@SebastianZimmeck
Copy link
Member

From a cursory glance, Hong Kong looks like it has a mix of English and Chinese sites. When I make the list, I could potentially just remove the Chinese sites, but I think it might make more sense to just replace it.

OK, then let's replace it.

I'll get started on Canada and Germany, and if you could let me know a preference for the third, that would be great. Maybe France?

France would be good. But possibly there are also issues with accents. If that is the case, let's pick Florida US to have one US location without a privacy law.

@SebastianZimmeck
Copy link
Member

One more point, Germany has ä etc. Not sure if that makes a difference.

Also, in general, which sites we select and how they will work depends on what we are going to test, i.e., the testing protocol (#9). Is testing the keywords (#12 ) even part of the protocol?

@SebastianZimmeck
Copy link
Member

We are using the following list:

  • Australia
  • Brazil
  • Ireland
  • New Zealand
  • Spain
  • Singapore
  • South Africa
  • Ukraine
  • US - California (same location-specific site as US - Colorado)
  • US - Colorado (same locations-specific site as US - California)

The reason is that we are not testing for keywords, emails, and phone numbers. Location should be good even for non-English sites.

@SebastianZimmeck
Copy link
Member

@dadak-dom, where in the Google Drive are the lists of sites to crawl?

It does not seem to be the Web_Crawl_Site_List folder. For example, I do not see Australia there.

(cc'ing @atlasharry, @PattonYin, @ananafrida)

@SebastianZimmeck
Copy link
Member

SebastianZimmeck commented Sep 20, 2024

Looks like the site lists are in this repo. But Ireland is missing. @dadak-dom?

If so, what are the other lists in the Google Drive?

@SebastianZimmeck
Copy link
Member

Also, where is the final methodology for creating the crawl lists (and test lists)?

@dadak-dom
Copy link
Collaborator

@SebastianZimmeck Yes, the site lists are in this repo. The lists in the Google Drive detail which sites were removed and why.

I wrote down a methodology that was used when we were still using VPNs, but other than connecting to a VPN, the process remains the same. @atlasharry should have more details about test list methodology, since he made the most recent one (and also the Korea list, I believe).

@SebastianZimmeck
Copy link
Member

SebastianZimmeck commented Sep 20, 2024

OK, thanks, @dadak-dom! Here is the final list:

  1. Australia
  2. Brazil
  3. Canada
  4. Germany
  5. India
  6. Singapore
  7. South Africa
  8. South Korea
  9. Spain
  10. United States (Los Angeles, California)

Also, the generic list we apply to all locations is the United States list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Let's talk about this
Projects
None yet
Development

No branches or pull requests

5 participants