Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test crawler performance #9

Open
SebastianZimmeck opened this issue Dec 1, 2023 · 61 comments
Open

Test crawler performance #9

SebastianZimmeck opened this issue Dec 1, 2023 · 61 comments
Assignees
Labels
omnibus An issue that covers multiple connected (smaller) sub-issues testing An issue related to testing

Comments

@SebastianZimmeck
Copy link
Member

Before we start the crawl, we need to test the crawler's performance. So, we need to compare the manually observed groundtruth with the analysis results. We probably need a 100-site test set.

  • How do we select the test set sites given the different locations and states (issue Create Manually Curated List of Sites to Crawl #7) so that we have good test coverage?
  • One issue is that different loads of a site may lead to different trackers etc. detected. So, we need to look for the groundtruth and analysis results at exactly the same site load. So, maybe, just load one site, get both groundtruth and analysis results and check?
  • We need to document all of that

(@JoeChampeau and @jjeancharles feel free to participate here as well.)

@SebastianZimmeck SebastianZimmeck added the testing An issue related to testing label Dec 1, 2023
@SebastianZimmeck
Copy link
Member Author

Where are we with the testing protocol, @danielgoldelman?

@danielgoldelman
Copy link
Collaborator

danielgoldelman commented Jan 11, 2024

Preliminary testing protocol

  1. Run the crawl, collect the data.

  2. Separate the pp data from the entries data.

For pp data:

  1. Create spreadsheet for each root url

  2. Log every piece of data into the spreadsheet with everything pp gives us, separated by pp data type

For all http request data:


  1. Create spreadsheet for each root url

  2. Do the most generic string matching with the values we are looking for. Note: we will have lists of keywords per vpn, we can get the ipinfo location while using the vpn by going to their site, and we can find monetization labels within the http requests. EX: if the zip code should be 10001, instead of a regex of \D10001\D, we look for just the string 10001. For every single key we could be looking for, we run it on the http requests gathered. Coallate these possible data stealing requests

  3. Go through every http request and label, adding to the spreadsheet when necessary

Now to bring both together:


  1. We have the two spreadsheet documents now. Time to classify

  2. Potentially in a new spreadsheet, place all http requests that occur in both pp and all http requests first. Then all that only occurred in pp. then all that only occurred in the http requests.
  3. Perform classification

@SebastianZimmeck
Copy link
Member Author

@danielgoldelman, can you reformat and reorder your comment? The order is very hard to follow, there are multiple numbers 1 and 2 after each other, etc.

@danielgoldelman
Copy link
Collaborator

@SebastianZimmeck sorry, the original comment was written on GitHub mobile, so formatting was hard to check. Changes made above.

@SebastianZimmeck
Copy link
Member Author

@danielgoldelman and @dadak-dom, please carefully read section 3.2.2 and 3.3 (in particular, 3.3.2) of our paper. We can re-use much of the approach there. I do not think that we need an annotation of the ground truth, but both of you should check the ground truth (for whatever our definition of ground truth is) and come to the same conclusion.

We have to create a testing protocol along the lines of the following:

  1. Select the set of analysis functionality that we are testing and how

    • By default all analysis functionalities
    • But how are we going to test for keywords, for example? How for email addresses, how for phone numbers, ...?
  2. Pick a set of websites to test

    • How many? Probably, 100 to 200. We need some reasonable standard deviation. For example, it is meaningless to test a particular analysis functionality for just one site because a successful test would not allow us to extrapolate and claim that we are successful for, say, 1,000 sites in our crawl set with that functionality. So, we need, say, 10 sites successfully analyzed to make that claim. Can you solidify that? What is the statistical significance of 10 sites? @JoeChampeau can help with the statistics. We should have some statistical power along the lines of "with 95% confidence our analysis of latitude is within the bounds of a 10% error rate" (e.g., if we detect 1,000 sites having a latitude with 95% confidence the real result is between 900 and 1,100 sites).
    • Which sites to select? Again, the selected set should allow us to make the claim that if an analysis functionality works properly on the test set, it also works for the large set of sites that we crawl. So, we would need to pick a diverse set of sites covering every analysis functionality for each region that we cover. There should be no bias. For example, there will be no problems for monetization categories because they occur so frequently, but how do we ensure, e.g., that there is a meaningful number of sites that collect latitudes? Maybe, pick map sites from somewhere? How do we pick sites for keywords (assuming we are analyzing keywords)?
    • Are we using the same test set of sites for each country/state? Yes, no, is some overlap OK, is it harmful, is it good, ...?
    • How are we selecting sites randomly? Use random.org.
    • We can't select any sites that we used for preliminary testing, i.e., validation. So, which are the sites, if any, that need to be excluded? If we randomly select an excluded site, how do we pick a new one? Maybe, just the next one on a given list.
  3. Running the test

    • Are we testing one site a time or run the complete test set? If we do the former we need to record all site data (and be absolutely sure that there are no errors and nothing omitted in the recording). We need to get both the analysis results and the ground truth data at the same time. The reason is that when we load a site multiple times, there is a good chance that not all trackers and other network connections are identical for both loads. So, the analysis results could diverge from the ground truth if the latter is based on a different load. We need to check the ground truth for the exact site load from which we got the analysis results. The alternative to a complete test set crawl is to do the analysis for one site at a time, i.e., visit a site, record the PP analysis results, use browser developer tools (and other tools, as necessary) to check the ground truth, record the evidence, record the ground truth evidence and result, then analyze the next site and so on. So, we would be doing multiple crawls of one site.
    • We will also need to change the VPN for every different location.
    • Who is going to run the test? @JoeChampeau has the computer. Is it you, @dadak-dom or @danielgoldelman? Both the PP analysis results and the ground truth should be checked by two people independently. This seems easier if only one test set crawl is done as opposed to the site-by-site approach.
  4. Ground truth analysis

    • How do we analyze the ground truth? Per your comment above, @danielgoldelman, I'll take that we do string matching in HTTP messages. Is that a reliable indicator? Maybe, we would also need to look at, say, browser API usage for latitude, i.e., the browser prompting the user to allow location access. What are the criteria to reliably analyze the ground truth? This can be different for our different functionalities.

These question cannot be answered in the abstract. @danielgoldelman and @dadak-dom, please play around with some sites for each analysis functionality and come up with a protocol to analyze it. For each functionality you need to be convinced that you can reliably identify true positives (and exclude false positives and false negatives). In other words, please do some validation tests.

@dadak-dom
Copy link
Collaborator

Who is going to run the test?

Would it make sense if @JoeChampeau runs the test, and then hands the data over to Daniel and me? I thought it would make sense since that's the computer that we will use to run the actual crawl. That way, we could avoid any potential issues arising when switching between windows and mac. Just a thought.

@SebastianZimmeck , the way I understand it, we will end up with three different site lists for each country (please correct me if I'm wrong)

  1. Validation (what Daniel and I are doing now)
  2. Test set (what we're preparing for and will soon be running)
  3. The actual crawl list.
    We cannot have any overlap between the validation and the test set, but can the test set (and/or the validation) be derived from the actual crawl list? I would need to know this before I start making any lists for the test set.

@SebastianZimmeck
Copy link
Member Author

Would it make sense if @JoeChampeau runs the test, and then hands the data over to Daniel and me?

It certainly makes sense, but that would depend on if @JoeChampeau has time as the task was originally @danielgoldelman's. (Given our slow speed, the point may more or less resolve itself since we will be all back on campus soon anyways.)

(please correct me if I'm wrong)

All correct.

but can the test set (and/or the validation) be derived from the actual crawl list?

Yes, the validation and test set can be derived from the crawl list.

@dadak-dom
Copy link
Collaborator

I have added my proposed crawl testing lists to the branch connected with this issue (issue-9). Here was my procedure:

  1. For each country that we will crawl, create a new .csv file.
  2. Go to random.org and have it generate a list of random integers from 1-525
  3. Take the first six integers and find the matching URL from the general list
  4. Regenerate the random integers and find the six matching URLs from the country specific list
  5. If there seems to be a bias for one functionality, throw the list out and try again. (Or, if there is any overlap with sites that were used for validation. Luckily, this was never the case for me)
  6. Repeat the process for each location we will crawl, so ten times total.

With point 5 I tried my best to include a fair share of sites that take locations, as monetization was easy to come by. @SebastianZimmeck let me know if any changes need to be made.

@SebastianZimmeck
Copy link
Member Author

OK, sounds good!

So, our test set has a total of 120 sites? For each of the 10 countries/states 6 sites from the general list and 6 from the country-specific list.

With point 5 I tried my best to include a fair share of sites that take locations

How did you make the guess that a site takes locations?

@dadak-dom
Copy link
Collaborator

So, our test set has a total of 120 sites? For each of the 10 countries/states 6 sites from the general list and 6 from the country-specific list.

Yes, 120 sites total.

How did you make the guess that a site takes locations?

A couple of ways, e.g. visiting the site and seeing if it requests the location from the browser, or if PP detects a location, or if I know from my own browsing that the site would take locations.

@SebastianZimmeck
Copy link
Member Author

OK, sounds good!

Feel free to go ahead with that test set then. As we discussed yesterday, maybe the performance is good. Otherwise we call that set a validation set, pick a new test set, and repeat the test (after fixing any shortcomings with the crawler and/or extension).

One important point, the PP analysis needs to be set up exactly as it would be in the real crawl, i.e., with VPN, a crawl not just the extension. Though, it does not need to be on the crawl computer.

@dadak-dom
Copy link
Collaborator

One more thing: I noticed this morning that there are a lot of sites in the general list that redirect to sites that are already on the list. Can't believe I didn't catch that sooner, so I'll fix that ASAP. Just to be safe, I'll also redo the general list part of the test set.

@SebastianZimmeck
Copy link
Member Author

Great!

@dadak-dom
Copy link
Collaborator

@SebastianZimmeck I'm compiling the first round of test data, but so far I'm not getting as many location requests found as I'd like. You mention in one of the comments above that it might be worthwhile to make a list of, say, map sites. If I were to make a test list of sites with the clear intention of finding location requests, how can I make it random? Would it be valid to find, for example, a list of 200 map sites (not necessarily from the lists that we have), and pick randomly from that? If not, what are some valid strategies?

@SebastianZimmeck
Copy link
Member Author

what are some valid strategies?

Just map sites would probably be too narrow of a category. There may be techniques that are map site-specific. In that case our test set would only claim that we are good at identifying locations on map sites. So, we need more categories of sites, ideally, all categories of sites that typically get people's location.

Here is a starting point: Can you give some examples of websites that use geolocation to target local customers? So, the categories mentioned there, plus map sites, plus any other category of site that you found in your tests that collect location data. Maybe, there are generic lists (Tranco, BuiltWith, ...) that have categories of sites. Compile a list out of those and then randomly pick from them. That may be an option, but maybe you have a better idea.

So, maybe our test set is comprised of two parts:

  1. Location test set
  2. Monetization and Tracking test set

Maybe, it even has three parts if tracking pixel, browser fingerprinting, and/or IP address collection (the Tracking categories) are also rare. Then, we would also need to do a more intricate test set construction for the Tracking categories as well. I would expect no shortage of sites with Monetization.

There are no hard rules for testing. The overall question is:

What test would convince you that the crawl results are correct? (as to lat/lon, IP address, ... )

What arguments could someone make if they wanted to punch a hole in our claim that the analysis results demonstrate our crawl results are correct? Some I can think of: too small test set, not enough breadth in the test set, i.e., not covering all the techniques that we use or types of sites, sites not randomly selected, i.e., biased towards sites we know work ... (maybe there are more).

I would think we need at least 100 sites in the test set overall and generally not less than 10 sites for each practice we detect (lat/lon, tracking pixel, ...). Anything less has likely not enough statistical power and would not convince me.

@dadak-dom
Copy link
Collaborator

I've just added the lists and data that we can use for the first go at testing. A couple things to note:

  1. I managed to get a set where PP detected at least 10 of nearly every analysis functionality we were looking for, except for Zip Code and Lat/Long. My theory is that using the VPN makes it harder for sites to take this information, and so there's no requests with this information for PP to find. Of course, this will only be verified after testing fully, but I wanted to raise the possibility that these two analysis functions may not be possible with the setup we are going with. Just from the amount of sites, it's strange that none of them took lat/long, and yet many took region and city. I also did a quick test where I found a site I knew would take lat/long or zip code, and visited it without a vpn to make sure PP found those things. I then connected to the same site with a VPN, and PP wouldn't find lat/long or zip, but it still found Region and City. The good news is that Region and City seem to pop up quite a bit, so I believe we should have no problem testing for them.
  2. For documentation, here was my procedure for generating the lists:
  • I had two pools of sites: one was a mixture of the top sites of different categories, as well as sites that I had encountered in my personal browsing. The other set was the top 100 sites from the builtwith list.
  • For each list, generate a set of random integers. Use the corresponding row number for the site that you'll select
  • Take six sites from the mixture, and six sites from the pre-compiled list, for each country that we crawl.
  • In theory, you should have 12 sites. However, a few of them were bound to crash, and so as long as fewer than two crashed for a given crawl, I thought it was ok, since we'll still have over 100 sites. So you should have 10-12 sites per country/state.
  • If you generate a random integer that corresponds to a URL that was already taken, use the next available URL.

When crawling, I made sure that I was connected to the corresponding VPN for each list, i.e. when crawling using the South Africa list, I was connected to South Africa.

@SebastianZimmeck
Copy link
Member Author

Good progress, @dadak-dom!

I then connected to the same site with a VPN, and PP wouldn't find lat/long or zip, but it still found Region and City.

Not having lat/long would be substantial. Can you try playing around with the Mullvad VPN settings?

Screenshot 2024-01-17 at 10 06 50 PM

Can you try allowing as much as possible? Our goal is to have the sites trigger as most as possible of their tracking functionality.

Also, while I assume that the issue is not related to Firefox settings since you get lat/long with presumably the same settings in Firefox with VPN and Firefox without VPN, we should also have the Firefox settings as allowing as much as possible.

Maybe, also try a different VPN. What happens with the Wesleyan VPN, for example?

The bottom line: Try to think of ways to get the lat/long to show up.

@dadak-dom
Copy link
Collaborator

I messed around with the settings for both Firefox Nightly and Mullvad, no luck there.

I've tried crawling and regularly browsing with both Mullvad and the Wesleyan VPN. I was able to get Wesleyan VPN to show coarse location when browsing, but not when crawling. Under Mullvad, coarse/fine location never shows up.

However, when trying to figure this out, I noticed something that may be of interest. Per the Privacy Pioneer readme, the location value that PP uses to look for lat/long in HTTP requests is taken from the Geolocation API. Using the developer console, I noticed that this value doesn't change, regardless of the location that you are using for a VPN.
Maybe something strange is going on with my machine, so to check what I did, I encourage anyone to try the following:

  1. Without a VPN connection, visit any website.
  2. Paste in the following code into your developer console:

const options = {
enableHighAccuracy: true,
timeout: 5000,
maximumAge: 0,
};

function success(pos) {
const crd = pos.coords;

console.log("Your current position is:");
console.log(Latitude : ${crd.latitude});
console.log(Longitude: ${crd.longitude});
console.log(More or less ${crd.accuracy} meters.);
}

function error(err) {
console.warn(ERROR(${err.code}): ${err.message});
}

navigator.geolocation.getCurrentPosition(success, error, options);

  1. Compare this value to what ipinfo.io gives you by visiting ipinfo.io (without a VPN, they should be roughly the same)
  2. Now do steps 2 and 3 while connected to a VPN in a different country

When I do these steps, I end up with a different value for ipinfo, but the value from geolocation API stays the same (the above code should be set to not use a cached position, i.e. maximiumAge: 0)
I then looked at location evidence that PP collected for crawls I did when connected to other countries. Sure enough, PP would find the region and city, because that info is provided by ipinfo. However, PP would miss the lat/long that was in the same request, most likely because the geolocation API is feeding it a different value, and so PP is looking for something else.

However, this doesn't explain why PP doesn't generate entries for coarse and fine location when crawling without a VPN. From looking at the ground truth of some small test crawls, there clearly are latitudes and longitudes of the user being sent, but for some reason PP doesn't flag them. @danielgoldelman , maybe you have some idea as to what is going on? This doesn't seem to be a VPN issue as I initially thought.

@danielgoldelman
Copy link
Collaborator

Interesting. I was having different experiences, @dadak-dom ... lat and lng seemed to be accurately obtained when performing the crawls before. Have you modified the .ext file?

@dadak-dom
Copy link
Collaborator

No, I didn't make any changes to the .ext file, @danielgoldelman . Was I supposed to?

@SebastianZimmeck
Copy link
Member Author

Good progress, @dadak-dom!

Using the developer console, I noticed that this value doesn't change, regardless of the location that you are using for a VPN.

When I do these steps, I end up with a different value for ipinfo, but the value from geolocation API stays the same

Hm, is this even a larger issue not related to the VPN? In other words, even in the non-VPN scenario do we have a bug that the location is not properly updated? This is the first point we should check. (Maybe, going to a cafe or other place with WiFi can be used to get a second location to test.)

What is not clear to me is that when we crawled with different VPN locations for constructing our training/validation/test set, we got instances of all location types. So, I am not sure what has changed since then.

@danielgoldelman, can you look into that?

@dadak-dom
Copy link
Collaborator

I forgot to use the hashtag in my most recent commit, but @danielgoldelman and I seem to have solved the lat/long issue. Apparently the browser that selenium created did not have the geo.provider.network.url preference set, and so the extension wasn't able to evaluate a lat or long when crawling. My most recent commit to issue-9 should fix this, but this should be applied to the main crawler as well. Hopefully, this means that we can get started with gathering test data and testing.

@danielgoldelman
Copy link
Collaborator

Additionally, we have run the extension as if we were the computer, and compared our results for lat/lng with what we would expect the crawl to reasonably find. This approach worked! We used the preliminary validation set we designated earlier on, so this claim should be supported via further testing when we perform the performance metric crawl, but on first approach the crawl is working as intended for lat/lng.

@SebastianZimmeck
Copy link
Member Author

Great! Once you think the crawler and analysis works as expected, feel free to move to the test set.

dadak-dom added a commit that referenced this issue Jul 8, 2024
dadak-dom added a commit that referenced this issue Jul 8, 2024
@SebastianZimmeck SebastianZimmeck added the omnibus An issue that covers multiple connected (smaller) sub-issues label Jul 9, 2024
@SebastianZimmeck
Copy link
Member Author

SebastianZimmeck commented Jul 9, 2024

Here is the overall plan. This issue is an omnibus testing issue.

@dadak-dom has opened a separate issue for comparing the Crawler + VM results vs the local results (#49).

There is also a dedicated issue for performing the South Korea test (#54).

@dadak-dom
Copy link
Collaborator

Here are the results from my testing. As discussed, @atlasharry and I will look into possible explanations for lat/long and zipCode behaviors.
image

@dadak-dom
Copy link
Collaborator

Quick update from my end:
Initially, I was going to feed the false negative results from my test run on the VM into both Daniel's final model, and the model that I was tinkering with, with the expectation that the original Pytorch model would incorrectly classify the snippets, and maybe we'd see some improvements in the tuned model. Interestingly, feeding the Pytorch model with the mislabeled snippets actually resulted in correct results for nearly all the snippets I tried. In other words, there doesn't seem to be a problem with the original Pytorch model.
However, manually feeding the snippets into the TensorFlowJS model produced the same results as the test (false negative). I've confirmed with Daniel that there was no performance drop in converting from Pytorch to TensorFlow, so the only other thing I could think of is the known issue regarding the conversion from TF to TFJS. I think my next step will be to try replicating Daniel's conversion and seeing if I can get anywhere with that.

@SebastianZimmeck
Copy link
Member Author

Interestingly, feeding the Pytorch model with the mislabeled snippets actually resulted in correct results for nearly all the snippets I tried. ... I've confirmed with Daniel that there was no performance drop in converting from Pytorch to TensorFlow, so the only other thing I could think of is the tensorflow/tfjs#8025 regarding the conversion from TF to TFJS.

That is a very good point!

It is possible that the performance drop, at least some of it, has to do with the conversion from PyTorch/TensorFlow (in Python) to TensorFlow (in JavaScript). However, why are we seeing a bigger drop than we saw earlier?

From our paper:

PyTorch/TensorFlow (in Python) results:

Screenshot 2024-08-03 at 1 13 53 PM

TensorFlow (in JavaScript) results:

Screenshot 2024-08-03 at 1 14 28 PM

For example, latitude dropped from a recall of 0.97 to 0.82. However, the current results above show a drop to a recall of 0.5 (I am assuming fineLocation is latitude or longitude. As a side note, can we also test for latitude and longitude and not fineLocations as results for latitude and longitude can differ?)

The only reason I can think of why that is the case is that the current test set, more specifically, the instances of the current test set for which we get the false negative location results, are different compared to the original test set and/or we have more of those instances now.

So, can we run the current test set on the version of the code at the time?

If we are able to replicate @danielgoldelman's testing procedure with the code at that time but with our current test set, we should be getting results identical or at least close to what we are getting now with the new code and the current test set (if the conversion is the issue).

I am making this suggestion as we discussed that it is difficult to replicate running the test set at the time on our new code. That would also be an option if possible. In that case the old code and new code should return identical or at least very similar results.

I think my next step will be to try replicating Daniel's conversion and seeing if I can get anywhere with that.

That is also worthwhile to get a better understanding in general. Maybe, there is something we are currently not thinking of.

@SebastianZimmeck
Copy link
Member Author

SebastianZimmeck commented Sep 21, 2024

1. Current Status of the Model

It works! @atlasharry and @PattonYin made some great progress! As it turns out, there is nothing wrong with the model. Both the (1) PyTorch/TensorFlow Python and (2) TensorFlow.js versions perform in line with the test results reported in the paper. @atlasharry used the model from Hugging Face that corresponds to our GitHub served model and fixed a small issue in the conversion and also slightly re-tuned the parameters (which, however, did not make much of a difference).

There were a handful of incorrect classifications when testing on the 30+ additional test instances that @dadak-dom created. However, the validity of our original test set stands. The 30+ instances test could have just been an unlucky pick of test set sites or there was something different about those test instances as those were created manually and not according to the prior process.

@PattonYin and @atlasharry, please feel free to include additional information here, if there is anything important to add, to conclude this point.

2. Accuracy Testing

As we know that the model is classifying accurately, we can finally start testing the accuracy of the crawler (by "crawler" I mean the crawler including model, extension and VM).

First, here is the 100-site test set, and here is the methodology how the test set was constructed.

2.1 Test Set

Very important: @atlasharry, @PattonYin, and @ananafrida, if you do any testing on your own, please do not use any of the sites in the test set. The crawler is not allowed to see any test instances to guarantee an unbiased test. If you have used any sites from the test set, we need to randomly replace those sites with unseen ones per the described methodology.

2.2 Two Analyses

As I see it, we need to perform two analyses:

  1. Just do a normal evaluation. In other words, just as we ran the extension (with the model) on its own, we just need to run the crawler checking the ground truth against the analysis data.
  2. Compare crawler (i.e., crawler including model, extension and VM) results with normal (i.e., just extension/model) results. The point of this exercise is to make a valid claim that our crawler results are actually similar to what a normal user experiences.

For both steps, we should calculate precision, recall, and F1.

Important: We should only calculate these scores from the positive instances and not the negative instances (i.e., not use weighted scores).

2.3 How to Do the Crawler Analysis?

Now, while we have no problem of performing the first analysis (just extension/model) on the same test set run, i.e., evaluating the ground truth and analysis data based on the same test set run, this is naturally not possible for the second analysis (i.e., running the crawler vs running just the extension/model are necessarily two different runs). So, this brings us naturally to the problem how can we distinguish natural fluctuations in site loads in different runs from differences caused by adding our crawler infrastructure?

I think if we run each test, say, three times, we get a sense of the natural differences of each type of run and can distinguish those from the crawler-created differences. In other words, run just the extension/model three times on the test set and check their differences. Then, run the crawler three times and check their differences. Now, averaging out the extension/model runs and the crawler runs, do these averaged runs look very different from one another. So far the theory. I hope it works in practice. 😄

Now, next question: How can we get a good comparison? We would need to be physically in a place of a VM location and run the extension/model as a normal user. Since I am at a workshop in the Bay Area on November 7, I can do that. I can run just the extension/model (without crawler and VM) and collect the ground truth data (and extension data) as a real California user. Since that is still a few weeks out, the November 7 date should not stop us from already doing the crawl when we are ready. We probably have a sufficient intuitive understanding of whether the accuracy is good. Even if the November 7 test turns out to be not good, we can still re-crawl. 10K sites is "small big data" and can be done fairly quickly.

@SebastianZimmeck
Copy link
Member Author

@atlasharry and @PattonYin found that while the model works as expected, there is still an issue with the analysis performed by the extension, i.e., the model results do not seem to be properly processed by the extension.

We discussed that @PattonYin and @atlasharry will:

  1. Run the analysis on the complete test set and compare the results against ground truth results to determine the magnitude of the issue. So far there has only been one instance that is not processed correctly. Are there more?
  2. The results from task 1 will tell us to which extent we should address the issue, if at all. If there are more instances of incorrect processing by the extension, the set of instances can help us to pinpoint the issue. E.g., what do these instances have in common?

@SebastianZimmeck
Copy link
Member Author

@PattonYin and @atlasharry found that the incorrect results produced by the extension need to be addressed indeed. It is a bigger issue.

There are two points to note:

  1. The classification suffers from recall; precision is not a problem
  2. Errors occur more when numbers are involved (latitude, longitude, ZIP code) as opposed to alphabetical character strings (region, city)

@atlasharry and @PattonYin will look into the extension code, e.g., logging the process at various stages. @PattonYin mentioned that the search functionality may not run correctly.

@SebastianZimmeck
Copy link
Member Author

@atlasharry and @PattonYin have resolved the issue of the incorrect classifications.

The issue was caused by an incorrect implementation of escape sequences. In particular, there was one line of code in the pre-processing of the HTTP messages that added multiple backslashes to large parts of a message. This pre-processing happened right before a message was passed to the model for classification and caused mis-classifications.

It is not fully clear why this issue did not come up during the Privacy Pioneer extension testing for the PETS paper. One reason could be that the test instances did not have (as much) escaping as our current test set.

@PattonYin, can you link the file and line of code here?

@atlasharry and @PattonYin, please also add relevant details here, if any.

@SebastianZimmeck
Copy link
Member Author

We will proceed as follows:

  1. Implement the pre-processing fix in both extension and crawler
  2. Create a new test set. @atlasharry will also add to the test set creation protocol how we selected the sites for the new test set
  3. Once we have the test set, we start with the first test phase (under "Just do a normal evaluation. ...). For that, we need to crawl the sites and also check manually the underlying data of the test run to tell whether the classifications are correct. We should also save the data (i.e., HTTP messages) for later reference.

This week, we will do 1 and 2 and prepare 3. So that by Friday next week we can do 3.

@PattonYin
Copy link
Member

PattonYin commented Oct 25, 2024

This

@atlasharry and @PattonYin have resolved the issue of the incorrect classifications.

The issue was caused by an incorrect implementation of escape sequences. In particular, there was one line of code in the pre-processing of the HTTP messages that added multiple backslashes to large parts of a message. This pre-processing happened right before a message was passed to the model for classification and caused mis-classifications.

It is not fully clear why this issue did not come up during the Privacy Pioneer extension testing for the PETS paper. One reason could be that the test instances did not have (as much) escaping as our current test set.

@PattonYin, can you link the file and line of code here?

@atlasharry and @PattonYin, please also add relevant details here, if any.

Sure, we added a preprocessing step to the input with the following line of code:
const input_cleaned = input.replace(/\\+\"/g, '\\"');

The key issue we identified is that the number of backslashes preceding the double quotation mark (") directly affects the model's performance. As illustrated in the example, the second snippet containing 2 backslashes + quotation mark \\" is tokenized differently (adding token 1032) compared to the case with 1 backslash + quotation mark \", leading to different predictions. To address this, the most straightforward solution is to replace any number of backslashes + quotation mark* with 1 backslash + quotation mark.

The explanation for this behavior lies in how the backslash is used as an escape character. When a quotation mark needs to appear within a string, the backslash signals that the quotation mark is part of the string rather than an indicator of its end. Additionally, when this string is saved to files like JSON, another backslash is added as part of the escape sequence. Therefore, two backslashes are used to represent a single backslash in the stored string, compounding the number of backslashes introduced.

Since these additional backslashes don't alter the meaning of the text, they can be safely removed during preprocessing.

Tokenized string:
aea10284e1aaaa3ad9a6dc9f7a4e343
String before tokenization:
4d09d193e4d0d7be2f870af9ae7bde6

@atlasharry
Copy link
Member

Here is the updated Version with the input clean-up in the Privacy Pioneer.

Adding to Patton's point, different numbers of backslashes would result in different outputs in the tokenizer. For example:
For the string literals in the snippet: \" would result a tokenized input [1032 1000], while \\\" would be [1032 1032 1032 1000] This difference resulted by incorrect implementation of escape sequences could possibly shift the model's attention or change the importance that assigns to certain parts of the input.

@atlasharry
Copy link
Member

This is the sheet that we use to track the second crawl analysis, we haven't calculated the scores yet, but some intermediary results/observations we have is that:

Among those sites that collect the user's location, the crawler+extension performs pretty well on prediction. In the few cases that lead to a false negative is that

  • The location information extensions chosen to match(call from ipinfo API) do not usually match the sites' location information. E.g. Ipinfo identifies the VM's zip code as 90009, while some sites identify the zip code as 90060.
  • For sites that use more than one API to collect the users' location, the extension can only identify one among all. (we are still investigating why this happened)

We will continue doing the score calculations and analysis.

@SebastianZimmeck
Copy link
Member Author

Thanks, @atlasharry (and @PattonYin)!

A few suggestions:

  • Can you identify in the Sheet what the ground truth values are and what the analysis results are?
  • Can you include in the Sheet logic for calculating precision, recall, and F1?
  • Generally, it is not very clear what is what in the Sheet. Maybe, add a tab with explanations and/or more meaningful tab/column/row/etc. names.

Ideally, we want to have the complete results by Tuesday.

Also, if you can prepare the protocol for next Wednesday when I am in California for the second part of the test, that would be good. At that time, we have to be sure it works, which is why it would be good it you also test it yourself (from CT; can somebody not knowing about the details of your testing follow it and run the test?).

@SebastianZimmeck
Copy link
Member Author

SebastianZimmeck commented Nov 5, 2024

As there were still some incorrect results @PattonYin and @atlasharry identified the following issue as @PattonYin describes:

Sebastian Zimmeck, we checked the data flow in the extension, and found that the LengthHeuristic and the reqURL is the main cause.

Length:
According to the existing code and paper, we won't analyze any http message exceeding 100,000 characters. And our analysis revealed that, in many FN cases, the location is not identified because the message containing it exceeds the 100,000-character limit.
To fix this, we're thinking about changing the heuristic a bit by having it analyzing the first 100,000 characters of the message.
But I'm a little worried on the "consistency" issue, since according to the paper, we will directly skip such message.

reqURL:
However, removing that heuristic doesn't resolve the issue. Although the snippets are now extracted, the model predicts false frequently.
We found this is because of the reqURL appended at the beginning of the text snippet. After removing the reqUrl, the model predicts correctly. (Please Check Image 1)

Because of this, we're thinking about 2 potential fixes, 1. modify the lengthHeuristic so that it will still analyze the first 100,000 characters and 2. remove the reqUrl in the beginning. Only when both are implemented, the model successfully identifies the "region". (Please check image 2)

Just want to confirm if these fixes are acceptable. If these 2 fixes are okay, we can run a crawler test immediately, and see if that improves the model performance.

Additional Note: I belive we can safely remove the reqUrl because: this string came from "JSON.stringnify" (please check image 3), which have nothing to do with the snippet itself.

image

image

image

Thus, we decided to fix these two issues (especially, different from what I thought responses with 100,000+ characters were dropped instead of having the first 100,000 characters analyzed; the character limit is less an issue for requests as the code is specific to responses because there are not many requests with 100,000+ characters).

@SebastianZimmeck
Copy link
Member Author

Per @atlasharry, we make the following modification:

Hi Sebastian Zimmeck, we have done some initial crawl with these new changes and we realized removing the requestUrl would indeed causes some unintended consequences. For example, for a reuqestUrl "https://recsys-engine-prd.smiles.com.br/v2/recommendation?place=home_1&id=59fdc7db-8ab9-40bf-8645-74d9c7ed4591&lat=41.5551982&lon=-72.6519254&trace_id=82f83e85-5e39-42a6-92e8-c3f6ba9a0154&request_id=b65f013e-2aa0-411f-a92a-e66156dfb4fa" the Url itself contains the lon and lat. Removing the Url would skip some location information.
To fix that, we uses a seemly redundant but useful way to seperately feed reqestUrl, requestBody, responsedata, and a combination of all three(the combined data of all three is how the extension originally did) to the model. In this way, we have fix the original issue and avoid unintended consequences by providing redundancy. After this change, we also tested on the websites that originally extension failed to indetifies location informations, they are now all good.

image

Per @atlasharry, we also make the following improvement:

Another potential improvement we could have is to update the model by the new model I trained in the beginning of the semester. The original model seems to be a little overfitting the training set due to the training parameter and steps they set. Therefore, in cases like this picture shows, if the snippet has some irrelevent data that contains mostly ramdom characters( which it never sees in the training set), our original model would classifies as False. The new model would not have this problem. We have tested more than 10 FN snippets which the original model failed to identifies, the new model all correctly classifies them.

图像

@SebastianZimmeck
Copy link
Member Author

By next week @PattonYin and @atlasharry will perform the analysis as described under 2.2.1 above and present the performance results.

@atlasharry made the most excellent observation that we may not need to perform the analysis under 2.2.2 because PP's mechanism for identfying all location types (lat, ZIP, city, etc) is reliant on the IP address and that mechanism works the same way in every geographic location regardless of the specific IP address. Whether or not PP uses a VM IP address or real IP address does not matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
omnibus An issue that covers multiple connected (smaller) sub-issues testing An issue related to testing
Projects
None yet
Development

No branches or pull requests

8 participants