Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognize arguments in URL paths #28

Open
mthuurne opened this issue Aug 3, 2021 · 2 comments
Open

Recognize arguments in URL paths #28

mthuurne opened this issue Aug 3, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@mthuurne
Copy link
Member

mthuurne commented Aug 3, 2021

If a site uses URLs such as items/123/detail, the item ID 123 in the path should be recognized as an argument (as opposed to a unique page) such that the requests per page limit can be applied to it. Otherwise spidering an application with a large database behind it takes forever.

@mthuurne mthuurne added the enhancement New feature or request label Aug 4, 2021
@mthuurne
Copy link
Member Author

mthuurne commented Sep 8, 2021

If such auto-detection is not feasible, an alternative approach would be to build a site map and randomly pick a path from the tree to check, with a limit on the total number of checks.

As a quick hack, I tried randomly picking from a list of discovered URLs, but such a list quickly becomes dominated by pages reachable from the first few picks, so the checked pages may not be representative of the whole site.

@mthuurne
Copy link
Member Author

mthuurne commented Sep 9, 2021

The per-page query limit (see #18) could become a per-node limit instead, so it would apply to inner nodes as well. Such a limit can be set high enough that it wouldn't be hit when there are only static subpaths. Then we wouldn't have to guess about the meaning of the path name, which simplifies things a lot.

The similarity between queries and inner nodes doesn't have to end there: we can also generate synthetic requests for inner nodes and check whether we either get an OK (2xx) or client error (4xx) result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant