Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: improve crawlee seo ranking #2472

Merged
merged 7 commits into from
May 17, 2024
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/examples/crawl_sitemap.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ import CheerioSource from '!!raw-loader!roa-loader!./crawl_sitemap_cheerio.ts';
import PuppeteerSource from '!!raw-loader!roa-loader!./crawl_sitemap_puppeteer.ts';
import PlaywrightSource from '!!raw-loader!roa-loader!./crawl_sitemap_playwright.ts';

This example downloads and crawls the URLs from a sitemap, by using the <ApiLink to="utils/class/Sitemap">`Sitemap`</ApiLink> utility class provided by the <ApiLink to="utils">`@crawlee/utils`</ApiLink> module.
This example builds a sitemap crawler which downloads and crawls the URLs from a sitemap, by using the <ApiLink to="utils/class/Sitemap">`Sitemap`</ApiLink> utility class provided by the <ApiLink to="utils">`@crawlee/utils`</ApiLink> module.

<Tabs groupId="crawler-type">

Expand Down
6 changes: 3 additions & 3 deletions docs/examples/crawler-plugins/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ import PlaywrightExtraSource from '!!raw-loader!roa-loader!./playwright-extra.ts
[`puppeteer-extra`](https://www.npmjs.com/package/puppeteer-extra) and [`playwright-extra`](https://www.npmjs.com/package/playwright-extra) are community-built
libraries that bring in a plugin system to enhance the usage of [`puppeteer`](https://www.npmjs.com/package/puppeteer) and
[`playwright`](https://www.npmjs.com/package/playwright) respectively (bringing in extra functionality, like improving stealth for
example by using the [`puppeteer-extra-plugin-stealth`](https://www.npmjs.com/package/puppeteer-extra-plugin-stealth) plugin).
example by using the [`puppeteer-extra-plugin-stealth`](https://www.npmjs.com/package/puppeteer-extra-plugin-stealth) Puppeteer Stealth plugin).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its weird to mention the name after the package name, feels quite random to me and it is inconsistent with the change on line 26


:::tip Available plugins

Expand All @@ -23,15 +23,15 @@ For [`playwright`](https://www.npmjs.com/package/playwright), please see [`playw

:::

In this example, we'll show you how to use the [`puppeteer-extra-plugin-stealth`](https://www.npmjs.com/package/puppeteer-extra-plugin-stealth) plugin
In this example, we'll show you how to use the Puppeteer Stealth [(`puppeteer-extra-plugin-stealth`)](https://www.npmjs.com/package/puppeteer-extra-plugin-stealth) plugin
to help you avoid bot detections when crawling your target website.

<Tabs>
<TabItem value="puppeteer" label="Puppeteer & puppeteer-extra" default>

:::info Before you begin

Make sure you've installed the `puppeteer-extra` and `puppeteer-extra-plugin-stealth` packages via your preferred package manager
Make sure you've installed the Puppeteer Extra (`puppeteer-extra`) and Puppeteer Stealth plugin(`puppeteer-extra-plugin-stealth`) packages via your preferred package manager

```bash
npm install puppeteer-extra puppeteer-extra-plugin-stealth
Expand Down
2 changes: 1 addition & 1 deletion docs/examples/http_crawler.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
import ApiLink from '@site/src/components/ApiLink';
import HttpCrawlerSource from '!!raw-loader!roa-loader!./http_crawler.ts';

This example demonstrates how to use <ApiLink to="http-crawler/class/HttpCrawler">`HttpCrawler`</ApiLink> to crawl a list of URLs from an external file, load each URL using a plain HTTP request, and save HTML.
This example demonstrates how to use <ApiLink to="http-crawler/class/HttpCrawler">`HttpCrawler`</ApiLink> to build a crawler that crawls a list of URLs from an external file, load each URL using a plain HTTP request, and save HTML.

<RunnableCodeBlock className="language-js" type="cheerio">
{HttpCrawlerSource}
Expand Down
5 changes: 3 additions & 2 deletions docs/examples/http_crawler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,8 @@ const crawler = new HttpCrawler({
// Store the results to the dataset. In local configuration,
// the data will be stored as JSON files in ./storage/datasets/default
await Dataset.pushData({
url: request.url,
body,
url: request.url, // URL of the page
body, // HTML code of the page
});
},

Expand All @@ -47,6 +47,7 @@ const crawler = new HttpCrawler({
});

// Run the crawler and wait for it to finish.
// It will crawl a list of URLs from an external file, load each URL using a plain HTTP request, and save HTML
await crawler.run([
'https://crawlee.dev',
]);
Expand Down
6 changes: 3 additions & 3 deletions docs/guides/cheerio_crawler.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ import ApiLink from '@site/src/components/ApiLink';

## What is Cheerio

[Cheerio](https://www.npmjs.com/package/cheerio) is essentially [jQuery](https://jquery.com/) for Node.js. It offers the same API, including the familiar `$` object. You can use it, as you would use jQuery for manipulating the DOM of an HTML page. In crawling, you'll mostly use it to select the needed elements and extract their values - the data you're interested in. But jQuery runs in a browser and attaches directly to the browser's DOM. Where does `cheerio` get its HTML? This is where the `Crawler` part of <ApiLink to="cheerio-crawler/class/CheerioCrawler">`CheerioCrawler`</ApiLink> comes in.
[Cheerio](https://cheerio.js.org/) is essentially [jQuery](https://jquery.com/) for Node.js. It offers the same API, including the familiar `$` object. You can use it, as you would use jQuery for manipulating the DOM of an HTML page. In crawling, you'll mostly use it to select the needed elements and extract their values - the data you're interested in. But jQuery runs in a browser and attaches directly to the browser's DOM. Where does `cheerio` get its HTML? This is where the `Crawler` part of <ApiLink to="cheerio-crawler/class/CheerioCrawler">`CheerioCrawler`</ApiLink> comes in.

## How the crawler works

Expand All @@ -23,7 +23,7 @@ Modern web pages often do not serve all of their content in the first HTML respo

:::

Once the page's HTML is retrieved, the crawler will pass it to [Cheerio](https://www.npmjs.com/package/cheerio) for parsing. The result is the typical `$` function, which should be familiar to jQuery users. You can use the `$` function to do all sorts of lookups and manipulation of the page's HTML, but in scraping, you will mostly use it to find specific HTML elements and extract their data.
Once the page's HTML is retrieved, the crawler will pass it to [Cheerio](https://github.com/cheeriojs/cheerio) for parsing. The result is the typical `$` function, which should be familiar to jQuery users. You can use the `$` function to do all sorts of lookups and manipulation of the page's HTML, but in scraping, you will mostly use it to find specific HTML elements and extract their data.

Example use of Cheerio and its `$` function in comparison to browser JavaScript:

Expand All @@ -41,7 +41,7 @@ $('[href]')

:::note

This is not to show that Cheerio is better than plain browser JavaScript. Some might actually prefer the more expressive way plain JS provides. Unfortunately, the browser JavaScript methods are not available in Node.js, so Cheerio is your best bet to do the parsing in Node.
This is not to show that Cheerio is better than plain browser JavaScript. Some might actually prefer the more expressive way plain JS provides. Unfortunately, the browser JS methods are not available in Node.js, so Cheerio is your best bet to do the parsing in Node.js.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one looks weird, we should generally refer to JavaScript as JavaScript and not just JS.


:::

Expand Down
2 changes: 1 addition & 1 deletion website/src/components/Highlights.jsx
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ const FeatureList = [
<>
We believe websites are best scraped in the language they're written in. Crawlee <b>runs on Node.js
and it's <a href="https://crawlee.dev/docs/guides/typescript-project">built in TypeScript</a></b> to improve code completion in your IDE,
even if you don't use TypeScript yourself.
even if you don't use TypeScript yourself. Crawlee supports both TypeScript and JavaScript crawling.
</>
),
},
Expand Down
Loading