-
Notifications
You must be signed in to change notification settings - Fork 643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Decouple HTTP client #2661
Open
janbuchar
wants to merge
25
commits into
master
Choose a base branch
from
decouple-http-client
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+570
−89
Open
Changes from 4 commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
a995220
Introduce BaseHttpClient interface
janbuchar 3c638ef
Add GotScrapingHttpClient
janbuchar 34043cb
Use the http client in BasicCrawler
janbuchar e3846bc
Finalize using HttpClient in send_request
janbuchar 374e973
Lint
janbuchar 12b7eb5
Format
janbuchar fb01709
Simplify types in cookie-related code
janbuchar 8e3709a
Merge remote-tracking branch 'origin/master' into decouple-http-client
janbuchar 07da1f9
Decouple got-scraping from HttpCrawler
janbuchar c037552
Adapt FileDownload class
janbuchar 936d064
Add httpClient to validation schema
janbuchar 1f42f86
Lint
janbuchar d80641e
Fix type of context.sendRequest
janbuchar 448b1ec
Make BasicHttpClient an interface
janbuchar 66c0dae
Remove niche properties from the response type
janbuchar 9330017
Adjust cookie jar type
janbuchar b97f5a6
Extract sendRequest from BasicCrawler
janbuchar d0a1863
Handle searchParams before delegating to http client
janbuchar c72e81a
Handle json and form directly in sendRequest
janbuchar 8034dab
Refactor sendRequest, handle username/password
janbuchar 0e62fd2
Unused import
janbuchar 12b94c6
Mistake
janbuchar 2ccd065
Merge remote-tracking branch 'origin/master' into decouple-http-client
janbuchar e97084f
Lint
janbuchar 7f22ad5
Add missing docblocks
janbuchar File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,139 @@ | ||
import type { Readable } from 'stream'; | ||
import { FormDataLike } from './form_data_like'; | ||
|
||
type Timeout = | ||
| { | ||
lookup: number; | ||
connect: number; | ||
secureConnect: number; | ||
socket: number; | ||
send: number; | ||
response: number; | ||
} | ||
| { request: number }; | ||
|
||
type Method = | ||
| 'GET' | ||
| 'POST' | ||
| 'PUT' | ||
| 'PATCH' | ||
| 'HEAD' | ||
| 'DELETE' | ||
| 'OPTIONS' | ||
| 'TRACE' | ||
| 'get' | ||
| 'post' | ||
| 'put' | ||
| 'patch' | ||
| 'head' | ||
| 'delete' | ||
| 'options' | ||
| 'trace'; | ||
|
||
export type ResponseTypes = { | ||
'json': unknown; | ||
'text': string; | ||
'buffer': Buffer; | ||
}; | ||
|
||
type ToughCookieJar = { | ||
getCookieString: (( | ||
currentUrl: string, | ||
options: Record<string, unknown>, | ||
callback: (error: Error | null, cookies: string) => void, | ||
) => void) & | ||
((url: string, callback: (error: Error | null, cookieHeader: string) => void) => void); | ||
setCookie: (( | ||
cookieOrString: unknown, | ||
currentUrl: string, | ||
options: Record<string, unknown>, | ||
callback: (error: Error | null, cookie: unknown) => void, | ||
) => void) & | ||
((rawCookie: string, url: string, callback: (error: Error | null, result: unknown) => void) => void); | ||
}; | ||
|
||
type PromiseCookieJar = { | ||
getCookieString: (url: string) => Promise<string>; | ||
setCookie: (rawCookie: string, url: string) => Promise<unknown>; | ||
}; | ||
|
||
// Omitted (https://github.com/sindresorhus/got/blob/main/documentation/2-options.md): | ||
// - decompress, | ||
// - resolveBodyOnly, | ||
// - allowGetBody, | ||
// - dnsLookup, | ||
// - dnsCache, | ||
// - dnsLookupIpVersion, | ||
// - retry, | ||
// - hooks, | ||
// - parseJson, | ||
// - stringifyJson, | ||
// - request, | ||
// - cache, | ||
// - cacheOptions, | ||
// - http2 | ||
// - https | ||
// - agent | ||
// - localAddress | ||
// - createConnection | ||
// - pagination | ||
// - setHost | ||
// - maxHeaderSize | ||
// - methodRewriting | ||
// - enableUnixSockets | ||
// - context | ||
export interface HttpRequest<TResponseType extends keyof ResponseTypes = 'text'> { | ||
[k: string]: unknown; // TODO BC with got - remove in 4.0 | ||
|
||
url: string | URL; | ||
method?: Method; | ||
searchParams?: string | URLSearchParams | Record<string, string | number | boolean | null | undefined>; | ||
signal?: AbortSignal; | ||
headers?: Record<string, string | string[] | undefined>; | ||
body?: string | Buffer | Readable | Generator | AsyncGenerator | FormDataLike; | ||
form?: Record<string, string>; | ||
json?: unknown; | ||
|
||
username?: string; | ||
password?: string; | ||
|
||
cookieJar?: ToughCookieJar | PromiseCookieJar; | ||
followRedirect?: boolean | ((response: any) => boolean); // TODO BC with got - specify type better in 4.0 | ||
maxRedirects?: number; | ||
|
||
timeout?: Partial<Timeout>; | ||
|
||
encoding?: BufferEncoding; | ||
responseType?: TResponseType; | ||
throwHttpErrors?: boolean; | ||
|
||
// from got-scraping Context | ||
proxyUrl?: string; | ||
headerGeneratorOptions?: Record<string, unknown>; | ||
useHeaderGenerator?: boolean; | ||
headerGenerator?: { | ||
getHeaders: (options: Record<string, unknown>) => Record<string, string>; | ||
}; | ||
insecureHTTPParser?: boolean; | ||
sessionToken?: object; | ||
} | ||
|
||
export interface HttpResponse<TResponseType extends keyof ResponseTypes = keyof ResponseTypes> { | ||
[k: string]: any; // TODO BC with got - remove in 4.0 | ||
|
||
request: HttpRequest<TResponseType>; | ||
|
||
redirectUrls: URL[]; | ||
url: string; | ||
|
||
ip?: string; | ||
statusCode: number; | ||
|
||
body: ResponseTypes[TResponseType]; | ||
} | ||
|
||
export abstract class BaseHttpClient { | ||
janbuchar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
abstract sendRequest<TResponseType extends keyof ResponseTypes = 'text'>( | ||
request: HttpRequest<TResponseType>, | ||
): Promise<HttpResponse<TResponseType>>; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
/** | ||
* This is copied from https://github.com/octet-stream/form-data-encoder | ||
*/ | ||
|
||
interface FileLike { | ||
/** | ||
* Name of the file referenced by the File object. | ||
*/ | ||
readonly name: string; | ||
/** | ||
* Returns the media type ([`MIME`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types)) of the file represented by a `File` object. | ||
*/ | ||
readonly type: string; | ||
/** | ||
* Size of the file parts in bytes | ||
*/ | ||
readonly size: number; | ||
/** | ||
* The last modified date of the file as the number of milliseconds since the Unix epoch (January 1, 1970 at midnight). Files without a known last modified date return the current date. | ||
*/ | ||
readonly lastModified: number; | ||
/** | ||
* Returns a [`ReadableStream`](https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream) which upon reading returns the data contained within the [`File`](https://developer.mozilla.org/en-US/docs/Web/API/File). | ||
*/ | ||
stream(): ReadableStream<Uint8Array> | AsyncIterable<Uint8Array>; | ||
readonly [Symbol.toStringTag]?: string; | ||
} | ||
|
||
/** | ||
* A `string` or `File` that represents a single value from a set of `FormData` key-value pairs. | ||
*/ | ||
type FormDataEntryValue = string | FileLike; | ||
/** | ||
* This interface reflects minimal shape of the FormData | ||
*/ | ||
export interface FormDataLike { | ||
/** | ||
* Appends a new value onto an existing key inside a FormData object, | ||
* or adds the key if it does not already exist. | ||
* | ||
* The difference between `set()` and `append()` is that if the specified key already exists, `set()` will overwrite all existing values with the new one, whereas `append()` will append the new value onto the end of the existing set of values. | ||
* | ||
* @param name The name of the field whose data is contained in `value`. | ||
* @param value The field's value. This can be [`Blob`](https://developer.mozilla.org/en-US/docs/Web/API/Blob) | ||
or [`File`](https://developer.mozilla.org/en-US/docs/Web/API/File). If none of these are specified the value is converted to a string. | ||
* @param fileName The filename reported to the server, when a Blob or File is passed as the second parameter. The default filename for Blob objects is "blob". The default filename for File objects is the file's filename. | ||
*/ | ||
append(name: string, value: unknown, fileName?: string): void; | ||
/** | ||
* Returns all the values associated with a given key from within a `FormData` object. | ||
* | ||
* @param {string} name A name of the value you want to retrieve. | ||
* | ||
* @returns An array of `FormDataEntryValue` whose key matches the value passed in the `name` parameter. If the key doesn't exist, the method returns an empty list. | ||
*/ | ||
getAll(name: string): FormDataEntryValue[]; | ||
/** | ||
* Returns an [`iterator`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Iteration_protocols) allowing to go through the `FormData` key/value pairs. | ||
* The key of each pair is a string; the value is a [`FormDataValue`](https://developer.mozilla.org/en-US/docs/Web/API/FormDataEntryValue). | ||
*/ | ||
entries(): IterableIterator<[string, FormDataEntryValue]>; | ||
/** | ||
* An alias for FormDataLike#entries() | ||
*/ | ||
[Symbol.iterator](): IterableIterator<[string, FormDataEntryValue]>; | ||
readonly [Symbol.toStringTag]?: string; | ||
} |
22 changes: 22 additions & 0 deletions
22
packages/core/src/http_clients/got_scraping_http_client.ts
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
import { BaseHttpClient, HttpRequest, HttpResponse, ResponseTypes } from './base_http_client'; | ||
Check failure on line 1 in packages/core/src/http_clients/got_scraping_http_client.ts GitHub Actions / Lint
|
||
import { gotScraping } from '@crawlee/utils'; | ||
|
||
export class GotScrapingHttpClient extends BaseHttpClient { | ||
override async sendRequest<TResponseType extends keyof ResponseTypes>( | ||
request: HttpRequest<TResponseType>, | ||
): Promise<HttpResponse<TResponseType>> { | ||
const gotResult = await gotScraping({ | ||
...request, | ||
retry: { | ||
limit: 0, | ||
...(request.retry as Record<string, unknown> | undefined), | ||
}, | ||
}); | ||
|
||
return { | ||
...gotResult, | ||
body: gotResult.body as ResponseTypes[TResponseType], | ||
request: { url: request.url, ...gotResult.request }, | ||
}; | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
export * from './base_http_client'; | ||
export * from './got_scraping_http_client'; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may not want to keep this. If people use
responseType
in currentsendRequest
, the type is incorrect anyway. And I'd guess that using e.g.,gotScraping({...}).json()
is more prevalent anyway.