Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node crash on Crawlee running fs.stat on a request_queue lock file #2606

Open
1 task
Clearmist opened this issue Aug 7, 2024 · 4 comments
Open
1 task
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@Clearmist
Copy link

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/core

Issue description

The crawler, while running, will randomly crash Node. I tried using the experimental option of disabling locking, but it still happens. I doubt this is a permission issue because my user has write permission to this entire directory structure and I've also tried running as administrator.

I'm okay if I don't get the root of this issue fixed. At the least I'd like to know where I can put a try/catch so this error doesn't crash Node and the crawler can continue.

Obviously Node is trying to get file information from a lock file and dies.

node:internal/process/promises:289
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[Error: EPERM: operation not permitted, stat 'C:\Users\{username}\Repositories\crawler-app\storage\request_queues\2fdd8a2d-a180-48a1-9f36-28d5a2793b36\y0jxi0Gs1ISlI1y.json.lock'] {
  errno: -4048,
  code: 'EPERM',
  syscall: 'stat',
  path: 'C:\\Users\\{username}\\Repositories\\crawler-app\\storage\\request_queues\\2fdd8a2d-a180-48a1-9f36-28d5a2793b36\\y0jxi0Gs1ISlI1y.json.lock'
}
  1. Start a Cheerio crawler instance with a custom request queue name on a Windows machine.

Code sample

import { randomUUID } from 'node:crypto';
import { app } from 'electron';

const alias = randomUUID();

const address = 'https://{testing-address}';

const config = new Configuration({
  storageClientOptions: {
    localDataDirectory: path.join(app.getPath('userData'), 'crawlerStorage'),
  },
});

const requestQueue = await RequestQueue.open(alias);

await requestQueue.addRequest({ url: address });

const options = {
  experiments: {
    // Request locking is enabled by default since 3.10.0.
    // I've tried setting it to false and it still locks request json files.
    requestLocking: false,
  },
  requestQueue,
  ...
};

const crawler = new CheerioCrawler(options, config);

await crawler.run();

Package version

3.11.1

Node.js version

20.10.0

Operating system

Windows 10

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

3.11.2-beta.17

Other context

No response

@Clearmist Clearmist added the bug Something isn't working. label Aug 7, 2024
@B4nan
Copy link
Member

B4nan commented Aug 7, 2024

3.11.2-beta.17

Have you really seen that on the latest beta?

// Request locking is enabled by default since 3.10.0.
// I've tried setting it to false and it still locks request json files.
requestLocking: false,

That feature is about something else, what you see are local file locks in memory storage, that's an implementation detail of working with the file system.

cc @vladfrangu, not sure if #2603 was supposed to help with this one too, also, doesn't this mean the lock is not acquirable and we are missing a retry somewhere?

@B4nan B4nan added the t-tooling Issues with this label are in the ownership of the tooling team. label Aug 7, 2024
@vladfrangu
Copy link
Member

That PR wasn't supposed to help with that, unrelated things.

I don't think I've ever seen a stat error like that ever. Also to note that it looks like the path the user provided isn't being used for the storages if the stacktrace is anything to go by, which semi hints at wrong variable passed somewhere?

Also I'm semi certain we try to lock 3 times before giving up. @Clearmist can you get us a full stack trace please? 🙏

@Clearmist
Copy link
Author

Clearmist commented Aug 7, 2024

@B4nan, yes I am getting this issue on the @next branch. I was using latest, but moved to @next after seeing the field on the bug report form.

@vladfrangu Good catch about the path being different from what I've set using the localDataDirectory option! I hadn't noticed that. I'll try running with the default value of that option. Okay, it still failed with the same error.

I have the request_queues directory open and see this. Are the json.lock files supposed to be seen as directories by the host OS? Maybe Node running fs.stat on a directory is the reason for the crash.

image

I'd love to get a stack trace, but I tried these three and none of these callbacks were called.

process.on('uncaughtException', (err) => {...
process.on('unhandledRejection', (reason, p) => {...
process.on('SIGINT', () => {...

I even tried wrapping crawler.run() in try/catch.

try {
  await crawler.run();
} catch (error) {
...

Do you know of other ways I can generate a full stack trace when the Node process crashes? Maybe somewhere in Crawlee where I can regularly print a stack trace (if that would help).

@Clearmist
Copy link
Author

Clearmist commented Sep 3, 2024

I updated to 3.11.3 and this issue is still present.

[Error: EPERM: operation not permitted, stat]
{
  errno: -4048,
  code: 'EPERM',
  syscall: 'stat',
  path: 'C:\\Users\\{username}\\Repositories\\crawler-app\\storage\\request_queues\\nasa3d.arc.nasa.gov\\4fC3CInttKDsieR.json.lock'
}

I can see that the path in the error is not where I told crawlee to store the local data. Here is my configuration object:

const config = new Configuration({
  storageClientOptions: {
    localDataDirectory: path.join(app.getPath('userData'), 'crawlerStorage'),
  },
});

What crawlee uses
C:\Users\{username}\Repositories\crawler-app\storage\

What I told it to use
C:\Users\{username}\AppData\Roaming\Electron\crawlerStorage\

The datasets are stored in the right place, but the request_queues are being stored in the incorrect directory.

Also, the .lock files are showing up as directories in Windows 10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants