Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read file in chunks instead of reading it in memory at once? #5

Open
wapsi opened this issue Jun 13, 2024 · 2 comments
Open

Read file in chunks instead of reading it in memory at once? #5

wapsi opened this issue Jun 13, 2024 · 2 comments
Assignees

Comments

@wapsi
Copy link

wapsi commented Jun 13, 2024

Pantastic seems to be a fantasic piece of software! But I benchmarked it a little bit, and noticed that if it's scanning some very large file(s) (like 10GB+) the Python process uses RAM as much as the file size is. Is that correct, or did I misanalyze it?

I think it would more efficient, if it reads the file in 128KB to 100MB chunks (maybe that could be even configurable parameter), instead of reading the whole file in RAM at once. Or if it does that chunk based reading already, maybe it should free up the used memory immediately after it has scanned that particular chunk (before reading the next one). Of course, in that case, some last bytes from the previous chunk need to remain in the RAM if the PAN is split between two chunks.

@Centurix
Copy link
Owner

Thanks for trying it out. I suspect you're correct in the benchmarking with it using as many resources as it can get its hands on. The case where I was using it involved a single server that sole job was scanning a network for PANs so resource monitoring wasn't a high priority at the time.

From memory, it uses mmap to handle file contents, so I'd say that chunk-reading the file would be a good option to reduce the memory usage without affecting performance too much. There may be some chunk edge overlap necessary for the digit grouping to work correctly, but I can't imagine that being a blocker.

Hmm, just looking at the source it does appear to be already chunking the file reads into 1Mb sections:

                    file_buffer = mm.read(1024**2)
                    if not file_buffer:
                        break

So it may be a case of the application not releasing memory. I'll take a look and see if there's gains somewhere.

@Centurix Centurix self-assigned this Jun 15, 2024
@wapsi
Copy link
Author

wapsi commented Jun 15, 2024

Ah, good point. Maybe it's about releasing the used memory then, like you suggested. I noticed that when it scans a large file, it increases the memory footprint while it's reading the file, and after it moves to the next file, it releases the used memory for the finished file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants