-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test crawler performance #9
Comments
Where are we with the testing protocol, @danielgoldelman? |
Preliminary testing protocol
For pp data:
For all http request data:
Now to bring both together:
|
@danielgoldelman, can you reformat and reorder your comment? The order is very hard to follow, there are multiple numbers 1 and 2 after each other, etc. |
@SebastianZimmeck sorry, the original comment was written on GitHub mobile, so formatting was hard to check. Changes made above. |
@danielgoldelman and @dadak-dom, please carefully read section 3.2.2 and 3.3 (in particular, 3.3.2) of our paper. We can re-use much of the approach there. I do not think that we need an annotation of the ground truth, but both of you should check the ground truth (for whatever our definition of ground truth is) and come to the same conclusion. We have to create a testing protocol along the lines of the following:
These question cannot be answered in the abstract. @danielgoldelman and @dadak-dom, please play around with some sites for each analysis functionality and come up with a protocol to analyze it. For each functionality you need to be convinced that you can reliably identify true positives (and exclude false positives and false negatives). In other words, please do some validation tests. |
Would it make sense if @JoeChampeau runs the test, and then hands the data over to Daniel and me? I thought it would make sense since that's the computer that we will use to run the actual crawl. That way, we could avoid any potential issues arising when switching between windows and mac. Just a thought. @SebastianZimmeck , the way I understand it, we will end up with three different site lists for each country (please correct me if I'm wrong)
|
It certainly makes sense, but that would depend on if @JoeChampeau has time as the task was originally @danielgoldelman's. (Given our slow speed, the point may more or less resolve itself since we will be all back on campus soon anyways.)
All correct.
Yes, the validation and test set can be derived from the crawl list. |
I have added my proposed crawl testing lists to the branch connected with this issue (issue-9). Here was my procedure:
With point 5 I tried my best to include a fair share of sites that take locations, as monetization was easy to come by. @SebastianZimmeck let me know if any changes need to be made. |
OK, sounds good! So, our test set has a total of 120 sites? For each of the 10 countries/states 6 sites from the general list and 6 from the country-specific list.
How did you make the guess that a site takes locations? |
Yes, 120 sites total.
A couple of ways, e.g. visiting the site and seeing if it requests the location from the browser, or if PP detects a location, or if I know from my own browsing that the site would take locations. |
OK, sounds good! Feel free to go ahead with that test set then. As we discussed yesterday, maybe the performance is good. Otherwise we call that set a validation set, pick a new test set, and repeat the test (after fixing any shortcomings with the crawler and/or extension). One important point, the PP analysis needs to be set up exactly as it would be in the real crawl, i.e., with VPN, a crawl not just the extension. Though, it does not need to be on the crawl computer. |
One more thing: I noticed this morning that there are a lot of sites in the general list that redirect to sites that are already on the list. Can't believe I didn't catch that sooner, so I'll fix that ASAP. Just to be safe, I'll also redo the general list part of the test set. |
Great! |
@SebastianZimmeck I'm compiling the first round of test data, but so far I'm not getting as many location requests found as I'd like. You mention in one of the comments above that it might be worthwhile to make a list of, say, map sites. If I were to make a test list of sites with the clear intention of finding location requests, how can I make it random? Would it be valid to find, for example, a list of 200 map sites (not necessarily from the lists that we have), and pick randomly from that? If not, what are some valid strategies? |
Just map sites would probably be too narrow of a category. There may be techniques that are map site-specific. In that case our test set would only claim that we are good at identifying locations on map sites. So, we need more categories of sites, ideally, all categories of sites that typically get people's location. Here is a starting point: Can you give some examples of websites that use geolocation to target local customers? So, the categories mentioned there, plus map sites, plus any other category of site that you found in your tests that collect location data. Maybe, there are generic lists (Tranco, BuiltWith, ...) that have categories of sites. Compile a list out of those and then randomly pick from them. That may be an option, but maybe you have a better idea. So, maybe our test set is comprised of two parts:
Maybe, it even has three parts if tracking pixel, browser fingerprinting, and/or IP address collection (the Tracking categories) are also rare. Then, we would also need to do a more intricate test set construction for the Tracking categories as well. I would expect no shortage of sites with Monetization. There are no hard rules for testing. The overall question is: What test would convince you that the crawl results are correct? (as to lat/lon, IP address, ... ) What arguments could someone make if they wanted to punch a hole in our claim that the analysis results demonstrate our crawl results are correct? Some I can think of: too small test set, not enough breadth in the test set, i.e., not covering all the techniques that we use or types of sites, sites not randomly selected, i.e., biased towards sites we know work ... (maybe there are more). I would think we need at least 100 sites in the test set overall and generally not less than 10 sites for each practice we detect (lat/lon, tracking pixel, ...). Anything less has likely not enough statistical power and would not convince me. |
I've just added the lists and data that we can use for the first go at testing. A couple things to note:
When crawling, I made sure that I was connected to the corresponding VPN for each list, i.e. when crawling using the South Africa list, I was connected to South Africa. |
Good progress, @dadak-dom!
Not having lat/long would be substantial. Can you try playing around with the Mullvad VPN settings? Can you try allowing as much as possible? Our goal is to have the sites trigger as most as possible of their tracking functionality. Also, while I assume that the issue is not related to Firefox settings since you get lat/long with presumably the same settings in Firefox with VPN and Firefox without VPN, we should also have the Firefox settings as allowing as much as possible. Maybe, also try a different VPN. What happens with the Wesleyan VPN, for example? The bottom line: Try to think of ways to get the lat/long to show up. |
I messed around with the settings for both Firefox Nightly and Mullvad, no luck there. I've tried crawling and regularly browsing with both Mullvad and the Wesleyan VPN. I was able to get Wesleyan VPN to show coarse location when browsing, but not when crawling. Under Mullvad, coarse/fine location never shows up. However, when trying to figure this out, I noticed something that may be of interest. Per the Privacy Pioneer readme, the location value that PP uses to look for lat/long in HTTP requests is taken from the Geolocation API. Using the developer console, I noticed that this value doesn't change, regardless of the location that you are using for a VPN.
const options = { function success(pos) { console.log("Your current position is:"); function error(err) { navigator.geolocation.getCurrentPosition(success, error, options);
When I do these steps, I end up with a different value for ipinfo, but the value from geolocation API stays the same (the above code should be set to not use a cached position, i.e. maximiumAge: 0) However, this doesn't explain why PP doesn't generate entries for coarse and fine location when crawling without a VPN. From looking at the ground truth of some small test crawls, there clearly are latitudes and longitudes of the user being sent, but for some reason PP doesn't flag them. @danielgoldelman , maybe you have some idea as to what is going on? This doesn't seem to be a VPN issue as I initially thought. |
Interesting. I was having different experiences, @dadak-dom ... lat and lng seemed to be accurately obtained when performing the crawls before. Have you modified the .ext file? |
No, I didn't make any changes to the .ext file, @danielgoldelman . Was I supposed to? |
Good progress, @dadak-dom!
Hm, is this even a larger issue not related to the VPN? In other words, even in the non-VPN scenario do we have a bug that the location is not properly updated? This is the first point we should check. (Maybe, going to a cafe or other place with WiFi can be used to get a second location to test.) What is not clear to me is that when we crawled with different VPN locations for constructing our training/validation/test set, we got instances of all location types. So, I am not sure what has changed since then. @danielgoldelman, can you look into that? |
I forgot to use the hashtag in my most recent commit, but @danielgoldelman and I seem to have solved the lat/long issue. Apparently the browser that selenium created did not have the |
Additionally, we have run the extension as if we were the computer, and compared our results for lat/lng with what we would expect the crawl to reasonably find. This approach worked! We used the preliminary validation set we designated earlier on, so this claim should be supported via further testing when we perform the performance metric crawl, but on first approach the crawl is working as intended for lat/lng. |
Great! Once you think the crawler and analysis works as expected, feel free to move to the test set. |
Here is the overall plan. This issue is an omnibus testing issue. @dadak-dom has opened a separate issue for comparing the Crawler + VM results vs the local results (#49). There is also a dedicated issue for performing the South Korea test (#54). |
Here are the results from my testing. As discussed, @atlasharry and I will look into possible explanations for lat/long and zipCode behaviors. |
Quick update from my end: |
That is a very good point! It is possible that the performance drop, at least some of it, has to do with the conversion from PyTorch/TensorFlow (in Python) to TensorFlow (in JavaScript). However, why are we seeing a bigger drop than we saw earlier? From our paper: PyTorch/TensorFlow (in Python) results: TensorFlow (in JavaScript) results: For example, latitude dropped from a recall of 0.97 to 0.82. However, the current results above show a drop to a recall of 0.5 (I am assuming fineLocation is latitude or longitude. As a side note, can we also test for latitude and longitude and not fineLocations as results for latitude and longitude can differ?) The only reason I can think of why that is the case is that the current test set, more specifically, the instances of the current test set for which we get the false negative location results, are different compared to the original test set and/or we have more of those instances now. So, can we run the current test set on the version of the code at the time? If we are able to replicate @danielgoldelman's testing procedure with the code at that time but with our current test set, we should be getting results identical or at least close to what we are getting now with the new code and the current test set (if the conversion is the issue). I am making this suggestion as we discussed that it is difficult to replicate running the test set at the time on our new code. That would also be an option if possible. In that case the old code and new code should return identical or at least very similar results.
That is also worthwhile to get a better understanding in general. Maybe, there is something we are currently not thinking of. |
1. Current Status of the ModelIt works! @atlasharry and @PattonYin made some great progress! As it turns out, there is nothing wrong with the model. Both the (1) PyTorch/TensorFlow Python and (2) TensorFlow.js versions perform in line with the test results reported in the paper. @atlasharry used the model from Hugging Face that corresponds to our GitHub served model and fixed a small issue in the conversion and also slightly re-tuned the parameters (which, however, did not make much of a difference). There were a handful of incorrect classifications when testing on the 30+ additional test instances that @dadak-dom created. However, the validity of our original test set stands. The 30+ instances test could have just been an unlucky pick of test set sites or there was something different about those test instances as those were created manually and not according to the prior process. @PattonYin and @atlasharry, please feel free to include additional information here, if there is anything important to add, to conclude this point. 2. Accuracy TestingAs we know that the model is classifying accurately, we can finally start testing the accuracy of the crawler (by "crawler" I mean the crawler including model, extension and VM). First, here is the 100-site test set, and here is the methodology how the test set was constructed. 2.1 Test SetVery important: @atlasharry, @PattonYin, and @ananafrida, if you do any testing on your own, please do not use any of the sites in the test set. The crawler is not allowed to see any test instances to guarantee an unbiased test. If you have used any sites from the test set, we need to randomly replace those sites with unseen ones per the described methodology. 2.2 Two AnalysesAs I see it, we need to perform two analyses:
For both steps, we should calculate precision, recall, and F1. Important: We should only calculate these scores from the positive instances and not the negative instances (i.e., not use weighted scores). 2.3 How to Do the Crawler Analysis?Now, while we have no problem of performing the first analysis (just extension/model) on the same test set run, i.e., evaluating the ground truth and analysis data based on the same test set run, this is naturally not possible for the second analysis (i.e., running the crawler vs running just the extension/model are necessarily two different runs). So, this brings us naturally to the problem how can we distinguish natural fluctuations in site loads in different runs from differences caused by adding our crawler infrastructure? I think if we run each test, say, three times, we get a sense of the natural differences of each type of run and can distinguish those from the crawler-created differences. In other words, run just the extension/model three times on the test set and check their differences. Then, run the crawler three times and check their differences. Now, averaging out the extension/model runs and the crawler runs, do these averaged runs look very different from one another. So far the theory. I hope it works in practice. 😄 Now, next question: How can we get a good comparison? We would need to be physically in a place of a VM location and run the extension/model as a normal user. Since I am at a workshop in the Bay Area on November 7, I can do that. I can run just the extension/model (without crawler and VM) and collect the ground truth data (and extension data) as a real California user. Since that is still a few weeks out, the November 7 date should not stop us from already doing the crawl when we are ready. We probably have a sufficient intuitive understanding of whether the accuracy is good. Even if the November 7 test turns out to be not good, we can still re-crawl. 10K sites is "small big data" and can be done fairly quickly. |
@atlasharry and @PattonYin found that while the model works as expected, there is still an issue with the analysis performed by the extension, i.e., the model results do not seem to be properly processed by the extension. We discussed that @PattonYin and @atlasharry will:
|
@PattonYin and @atlasharry found that the incorrect results produced by the extension need to be addressed indeed. It is a bigger issue. There are two points to note:
@atlasharry and @PattonYin will look into the extension code, e.g., logging the process at various stages. @PattonYin mentioned that the search functionality may not run correctly. |
@atlasharry and @PattonYin have resolved the issue of the incorrect classifications. The issue was caused by an incorrect implementation of escape sequences. In particular, there was one line of code in the pre-processing of the HTTP messages that added multiple backslashes to large parts of a message. This pre-processing happened right before a message was passed to the model for classification and caused mis-classifications. It is not fully clear why this issue did not come up during the Privacy Pioneer extension testing for the PETS paper. One reason could be that the test instances did not have (as much) escaping as our current test set. @PattonYin, can you link the file and line of code here? @atlasharry and @PattonYin, please also add relevant details here, if any. |
We will proceed as follows:
This week, we will do 1 and 2 and prepare 3. So that by Friday next week we can do 3. |
This
Sure, we added a preprocessing step to the input with the following line of code: The key issue we identified is that the number of backslashes preceding the double quotation mark ( The explanation for this behavior lies in how the backslash is used as an escape character. When a quotation mark needs to appear within a string, the backslash signals that the quotation mark is part of the string rather than an indicator of its end. Additionally, when this string is saved to files like JSON, another backslash is added as part of the escape sequence. Therefore, two backslashes are used to represent a single backslash in the stored string, compounding the number of backslashes introduced. Since these additional backslashes don't alter the meaning of the text, they can be safely removed during preprocessing. |
Here is the updated Version with the input clean-up in the Privacy Pioneer. Adding to Patton's point, different numbers of backslashes would result in different outputs in the tokenizer. For example: |
This is the sheet that we use to track the second crawl analysis, we haven't calculated the scores yet, but some intermediary results/observations we have is that: Among those sites that collect the user's location, the crawler+extension performs pretty well on prediction. In the few cases that lead to a false negative is that
We will continue doing the score calculations and analysis. |
Thanks, @atlasharry (and @PattonYin)! A few suggestions:
Ideally, we want to have the complete results by Tuesday. Also, if you can prepare the protocol for next Wednesday when I am in California for the second part of the test, that would be good. At that time, we have to be sure it works, which is why it would be good it you also test it yourself (from CT; can somebody not knowing about the details of your testing follow it and run the test?). |
As there were still some incorrect results @PattonYin and @atlasharry identified the following issue as @PattonYin describes:
Thus, we decided to fix these two issues (especially, different from what I thought responses with 100,000+ characters were dropped instead of having the first 100,000 characters analyzed; the character limit is less an issue for requests as the code is specific to responses because there are not many requests with 100,000+ characters). |
Per @atlasharry, we make the following modification:
Per @atlasharry, we also make the following improvement:
|
By next week @PattonYin and @atlasharry will perform the analysis as described under 2.2.1 above and present the performance results. @atlasharry made the most excellent observation that we may not need to perform the analysis under 2.2.2 because PP's mechanism for identfying all location types (lat, ZIP, city, etc) is reliant on the IP address and that mechanism works the same way in every geographic location regardless of the specific IP address. Whether or not PP uses a VM IP address or real IP address does not matter. |
Before we start the crawl, we need to test the crawler's performance. So, we need to compare the manually observed groundtruth with the analysis results. We probably need a 100-site test set.
(@JoeChampeau and @jjeancharles feel free to participate here as well.)
The text was updated successfully, but these errors were encountered: