Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add-blog #2539

Merged
merged 5 commits into from
Jun 19, 2024
Merged

docs: add-blog #2539

merged 5 commits into from
Jun 19, 2024

Conversation

souravjain540
Copy link
Collaborator

Adding new blog to Crawlee Blog

@souravjain540 souravjain540 self-assigned this Jun 14, 2024
@souravjain540
Copy link
Collaborator Author

Cc: @janbuchar / @barjin

Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice article. I found several minor issues worth addressing. Also it'd be nice to

  1. link to a repo with the complete source code, ideally it should be hosted on Apify's github
  2. run all code snippets through a code formatter


## Prerequisites

To use Crawlee, you need to have Node.js 16 or higher version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To use Crawlee, you need to have Node.js 16 or higher version.
To use Crawlee, you need to have Node.js 16 or newer.


You can install the latest version of Node.js from the [official website](https://nodejs.org/en/). This great [Node.js installation guide](https://blog.apify.com/how-to-install-nodejs/) gives you tips to avoid issues later on.

## Creating React app
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Creating React app
## Creating a React app

npx create-vite@latest
```

You can check out the [Vite Docs](https://vitejs.dev/guide/) to create a React app.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can check out the [Vite Docs](https://vitejs.dev/guide/) to create a React app.
You can check out the [Vite Docs](https://vitejs.dev/guide/) for more details on how to create a React app.


Additionally, Crawlee supports headless browser libraries like [Playwright](https://playwright.dev/) and [Puppeteer](https://pptr.dev/) for scraping of websites that are JavaScript-rendered.

After installing the libraries, it’s time to create the scraper code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, which ones do I need? Netflix is an SPA, so I'll need Playwright or Puppeteer, right? Which one do we want?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Netflix is an SPA but in this use case, it works good with CheerioCrawler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we should just remove this or reword it so that it's clear that playwright will not be necessary.

Comment on lines 118 to 129
const allShows = [];
let genreShows = [];
shows.forEach((show) => {
genreShows.push(show);
if (genreShows.length === 40) {
allShows.push(genreShows);
genreShows = [];
}
});
if (genreShows.length > 0) {
allShows.push(genreShows);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of relying on the shows array to be sorted by genre and having exactly 40 items, I'd make a Map with the genre as key and an array of show titles as value.

Copy link
Contributor

@barjin barjin Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or - even better - capture the structure straight from the HTML. You can call $(element) to query the element's subtree.

const out = $('[data-uia="collections-row"]').map((_, el) => { // get a genre row
  const genre = $(el).find('[data-uia="collections-row-title"]').text(); // pick its (genre) title
  const items = $(el).find('[data-uia="collections-title"]').map((_, el) => $(el).text()).get(); // pick all items in the genre

  return { genre, items };
});

npm start
```

After running this command, you will see a `storage` folder with the `key_value_stores/default/results.json` file. The scrapped data will be stored in JSON format in this file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
After running this command, you will see a `storage` folder with the `key_value_stores/default/results.json` file. The scrapped data will be stored in JSON format in this file.
After running this command, you will see a `storage` folder with the `key_value_stores/default/results.json` file. The scraped data will be stored in JSON format in this file.

}

function App() {
const [count, setCount] = useState(null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not a count, it's an index into an array of shows, isn't it?

Add the following code in `'scripts'` object:

```
'start': 'node src/scraper.js'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'start': 'node src/scraper.js'
"start": "node src/scraper.js"

Comment on lines +2 to +5
slug: netflix-show-recommender
title: 'Building a Netflix show recommender using Crawlee and React'
tags: [community]
description: 'Create a Netflix show recommendation system using Crawlee to scrape the data, JavaScript to code, and React to build the front end.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the wording we agreed upon? 😄 If I clicked on an article about recommender systems with scraping (sounds super cool) and got a simple React App, I would be a bit disappointed.

Netflix is a large player in the realm of recommender systems, with the Netflix Prize, their research papers, and stuff... This article is going to have a lot of very strong SEO competition with this name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we agreed on it. I got the point, and I agree you are right, but this article is not from us. As said in the beginning, it is from one of the community members, and we made it clear that it is not supposed to be that perfect; it is just an app showing something like this can be created through a little work.


:::tip
Before we start this tutorial, we recommend you [visit Crawlee's GitHub](https://github.com/apify/crawlee) and check out the codebase and installation guide. If you like Crawlee, do give us a star.
If you are liking this blog so far, we request you to [give Crawlee a star on GitHub](https://github.com/apify/crawlee), it helps us to reach and help more developers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One paragraph in? That's a bit early to like it.

Comment on lines 169 to 170
genre: genres,
shows: shows,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks suspicious - storing genres (plural) under genre (singular)

@souravjain540
Copy link
Collaborator Author

@janbuchar can we please make it live today? I need to give link to marketing for the newsletter :)

To use Crawlee, you need to have Node.js 16 or newer.

:::tip
If you are Crawlee blog so far, we request you to [give Crawlee a star on GitHub](https://github.com/apify/crawlee), it helps us to reach and help more developers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you are Crawlee blog so far, we request you to [give Crawlee a star on GitHub](https://github.com/apify/crawlee), it helps us to reach and help more developers.
If you like the posts on the Crawlee blog so far, please consider [giving Crawlee a star on GitHub](https://github.com/apify/crawlee), it helps us to reach and help more developers.

@janbuchar
Copy link
Contributor

@janbuchar can we please make it live today? I need to give link to marketing for the newsletter :)

I believe so, there is just a handful of comments to resolve.

@souravjain540
Copy link
Collaborator Author

@janbuchar done :)

@janbuchar janbuchar merged commit b5d063b into master Jun 19, 2024
9 checks passed
@janbuchar janbuchar deleted the add-blog branch June 19, 2024 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants