Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add iframe expansion to parseWithCheerio in browsers #2542

Merged
merged 6 commits into from
Jun 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions packages/browser-crawler/src/internals/browser-crawler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -342,6 +342,7 @@ export abstract class BrowserCrawler<
persistCookiesPerSession: ow.optional.boolean,
useSessionPool: ow.optional.boolean,
proxyConfiguration: ow.optional.object.validate(validators.proxyConfiguration),
ignoreShadowRoots: ow.optional.boolean,
};

/**
Expand Down Expand Up @@ -370,6 +371,7 @@ export abstract class BrowserCrawler<
failedRequestHandler,
handleFailedRequestFunction,
headless,
ignoreShadowRoots,
...basicCrawlerOptions
} = options;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -600,6 +600,28 @@ export async function saveSnapshot(page: Page, options: SaveSnapshotOptions = {}
export async function parseWithCheerio(page: Page, ignoreShadowRoots = false): Promise<CheerioRoot> {
ow(page, ow.object.validate(validators.browserPage));

if (page.frames().length > 1) {
const frames = await page.$$('iframe');

await Promise.all(
frames.map(async (frame) => {
const iframe = await frame.contentFrame();

if (iframe) {
const contents = await iframe.content();

await frame.evaluate((f, c) => {
const replacementNode = document.createElement('div');
replacementNode.innerHTML = c;
replacementNode.className = 'crawlee-iframe-replacement';

f.replaceWith(replacementNode);
}, contents);
}
}),
);
}

const html = ignoreShadowRoots
? null
: ((await page.evaluate(`(${expandShadowRoots.toString()})(document)`)) as string);
Expand Down
22 changes: 22 additions & 0 deletions packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,28 @@ export async function injectJQuery(page: Page, options?: { surviveNavigations?:
export async function parseWithCheerio(page: Page, ignoreShadowRoots = false): Promise<CheerioRoot> {
ow(page, ow.object.validate(validators.browserPage));

if (page.frames().length > 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to duplicate this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is, @crawlee/playwright and @crawlee/puppeteer are separate packages, so we would have to create a new package for this shared code (any other crawlee package doesn't / cannot depend on playwright or puppeteer(?)).

I see that these two are verbatim copies, but that's only because here we're using the subsets of PW / PP interfaces that are equal... other utils methods are different for PW / PP. I like to think of these as "platform" specific ports of the same features.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't it be put in @crawlee/browser-crawler somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of what I mentioned above, it would be very awkward - see here:

export async function extractUrlsFromPage(
// eslint-disable-next-line @typescript-eslint/ban-types
page: { $$eval: Function },
selector: string,
baseUrl: string,
): Promise<string[]> {

Or here:

export interface CommonPage {
close(...args: unknown[]): Promise<unknown>;
url(): string | Promise<string>;
}

Dependency injection... or something, I guess.

With this as an alternative, I'm more than happy to have "duplicate" separate implementations for both libraries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I guess you'd have to write quite a lot of boilerplate types. I guess I'm equally unhappy with both approaches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crawlee/browser package has optional peer dependencies on both playwright and puppeteer, so you can surely have a code that works with both of them inside it. But to do that without hacks like ts-ignore comments and dynamic imports, you would need to introduce separate exports for each library that wouldn't be exported from the root index file. Probably not worth it now.

const frames = await page.$$('iframe');

await Promise.all(
frames.map(async (frame) => {
const iframe = await frame.contentFrame();

if (iframe) {
const contents = await iframe.content();

await frame.evaluate((f, c) => {
const replacementNode = document.createElement('div');
replacementNode.innerHTML = c;
replacementNode.className = 'crawlee-iframe-replacement';

f.replaceWith(replacementNode);
}, contents);
}
}),
);
}

const html = ignoreShadowRoots
? null
: ((await page.evaluate(`(${expandShadowRoots.toString()})(document)`)) as string);
Expand Down
18 changes: 18 additions & 0 deletions test/core/playwright_utils.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,24 @@ describe('playwrightUtils', () => {
}
});

test('parseWithCheerio() iframe expansion works', async () => {
const browser = await launchPlaywright(launchContext);

try {
const page = await browser.newPage();
await page.goto(new URL('/special/outside-iframe', serverAddress).toString());

const $ = await playwrightUtils.parseWithCheerio(page);

const headings = $('h1')
.map((i, el) => $(el).text())
.get();
expect(headings).toEqual(['Outside iframe', 'In iframe']);
} finally {
await browser.close();
}
});

describe('blockRequests()', () => {
let browser: Browser = null;
beforeAll(async () => {
Expand Down
18 changes: 18 additions & 0 deletions test/core/puppeteer_utils.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,24 @@ describe('puppeteerUtils', () => {
}
});

test('parseWithCheerio() iframe expansion works', async () => {
const browser = await launchPuppeteer(launchContext);

try {
const page = await browser.newPage();
await page.goto(new URL('/special/outside-iframe', serverAddress).toString());

const $ = await puppeteerUtils.parseWithCheerio(page);

const headings = $('h1')
.map((i, el) => $(el).text())
.get();
expect(headings).toEqual(['Outside iframe', 'In iframe']);
} finally {
await browser.close();
}
});

describe('blockRequests()', () => {
let browser: Browser = null;
beforeAll(async () => {
Expand Down
30 changes: 30 additions & 0 deletions test/shared/_helper.ts
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,28 @@ console.log('Hello world!');
</div>
</body>
</html>`,
outsideIframe: `
<!DOCTYPE html>
<html>
<head>
<title>Outside iframe</title>
</head>
<body>
<h1>Outside iframe</h1>
<iframe src="./inside-iframe"></iframe>
</body>
</html>`,
insideIframe: `
<!DOCTYPE html>
<html>
<head>
<title>In iframe</title>
</head>
<body>
<h1>In iframe</h1>
<p>Some content from inside of an iframe.</p>
</body>
</html>`,
};

export async function runExampleComServer(): Promise<[Server, number]> {
Expand Down Expand Up @@ -268,6 +290,14 @@ export async function runExampleComServer(): Promise<[Server, number]> {
special.get('/cloudflareBlocking', async (_req, res) => {
res.type('html').status(403).send(responseSamples.cloudflareBlocking);
});

special.get('/outside-iframe', (_req, res) => {
res.type('html').send(responseSamples.outsideIframe);
});

special.get('/inside-iframe', (_req, res) => {
res.type('html').send(responseSamples.insideIframe);
});
})();

// "cacheable" site with one page, scripts and stylesheets
Expand Down