Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Decouple HTTP client #2661

Open
wants to merge 25 commits into
base: master
Choose a base branch
from
Open

refactor: Decouple HTTP client #2661

wants to merge 25 commits into from

Conversation

janbuchar
Copy link
Contributor

@janbuchar janbuchar commented Sep 6, 2024

See https://gist.github.com/janbuchar/3a4724927de2c3a0bb16c46bb5940236 for an example curl-impersonate client.

The following got-scraping options were ignored (they will still work, but they're not part of the new interface):

  • decompress,
  • resolveBodyOnly,
  • allowGetBody,
  • dnsLookup,
  • dnsCache,
  • dnsLookupIpVersion,
  • retry,
  • hooks,
  • parseJson,
  • stringifyJson,
  • request,
  • cache,
  • cacheOptions,
  • http2
  • https
  • agent
  • localAddress
  • createConnection
  • pagination
  • setHost
  • maxHeaderSize
  • methodRewriting
  • enableUnixSockets
  • context

@janbuchar janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 6, 2024
@github-actions github-actions bot added this to the 97th sprint - Tooling team milestone Sep 6, 2024
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

@barjin
Copy link
Contributor

barjin commented Sep 30, 2024

So if I understand this correctly, we cannot get rid of got-scraping just yet because of the breaking change in BasicCrawlingContext.sendRequest, right?

Does it mean we have to model the BaseHttpClient to follow the got-scraping interfaces, though? It would make more sense to me to make BaseHttpClient independent (or have it follow some well-known API like fetch) and only care for the got-scraping interface where it matters, i.e. sendRequest.

In sendRequest, we would then translate from the got-scraping interface to the independent fetch API (and, in case of GotScrapingClient, later from fetch API back to the got-scraping calls).

@barjin
Copy link
Contributor

barjin commented Sep 30, 2024

Also, a completely unrelated thought / nit, but what file and directory casing convention are we following in Crawlee? It's already messy in the current codebase, but seeing snake_case files together with kebab-case in this PR made me think about it once again 😅

@janbuchar
Copy link
Contributor Author

So if I understand this correctly, we cannot get rid of got-scraping just yet because of the breaking change in BasicCrawlingContext.sendRequest, right?

Well, yeah, since got is part of our public API, we cannot completely remove it 🤷. We'll see what our default http client for 4.0 will be.

Does it mean we have to model the BaseHttpClient to follow the got-scraping interfaces, though? It would make more sense to me to make BaseHttpClient independent (or have it follow some well-known API like fetch) and only care for the got-scraping interface where it matters, i.e. sendRequest.

Not really, but I'd prefer to implicitly support got stuff that won't be part of the http client interface (e.g., parseJson or cacheOptions, and this way, we can do it kinda easily with index signatures on http request/response. It would be doable if we used a fetch-like interface, but a bit more difficult.

Also, a completely unrelated thought / nit, but what file and directory casing convention are we following in Crawlee? It's already messy in the current codebase, but seeing snake_case files together with kebab-case in this PR made me think about it once again 😅

Well, I didn't notice any, so I just look at the nearest files and do my best effort.

@B4nan
Copy link
Member

B4nan commented Sep 30, 2024

Historically things were snake case, but I believe nobody from the current team is in favor of that (I never like this). I would go with kebab case for anything new, and unify in v4 maybe (or sooner, I don't think it has any effect downstream, we don't allow deep imports anywhere).

@janbuchar janbuchar marked this pull request as ready for review October 4, 2024 15:34
/**
* HTTP Request as accepted by {@apilink BaseHttpClient} methods.
*/
export interface HttpRequest<TResponseType extends keyof ResponseTypes = 'text'> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may not want to keep this. If people use responseType in current sendRequest, the type is incorrect anyway. And I'd guess that using e.g., gotScraping({...}).json() is more prevalent anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTTP client switching
3 participants