Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow deliting files 1 stage #51

Open
dill-shower opened this issue Apr 29, 2024 · 3 comments
Open

Very slow deliting files 1 stage #51

dill-shower opened this issue Apr 29, 2024 · 3 comments

Comments

@dill-shower
Copy link

At the first stage, the script quickly calculates embeddings and proceeds to deletion. But the deletion of files is very slow. The average speed is 7 files per second, while deleting about 50 thousand files. As a result, deleting similar screenshots alone takes several hours

@cyber-meow
Copy link
Owner

cyber-meow commented Apr 30, 2024

Hello,

This is not normal. I quickly tested on both my laptop (with 3070ti) and personal server (with 4090). On both I get around 11it/s. Knowing that I am processing by default 16 images per batch (set with detect_duplicate_batch_size), this means we are processing around 100 images per second. Processing 6k images is effectively done within 1 minute. As I am using mobilenetv3 here the speed should not be an issue. The second stage is generally much slower and can effectively take hours (I am not sure if the cropping has been improved from waifuc since then).

Screenshot from 2024-04-30 08-16-47

@dill-shower
Copy link
Author

dill-shower commented Apr 30, 2024

Perhaps I have not described the problem clearly enough. I am satisfied with the speed of calculation of embeddings, but after calculating them and forming a list of files to delete in the code, deleting them is very slow. Perhaps a screenshot will make it clearer.
KsJb

12 minutes to delete 6,000 files is a lot and this particular step takes me 80% of the time.
Technical information:
I am using a fast ssd and other programs can delete a thousand files per second(but they are not suitable for the purpose of computing similar screenshots).
I use WSL. The files are on Windows OS ssd. I tried moving them to the wsl file system, but it didn't give any meaningful speed gain at this point

@cyber-meow
Copy link
Owner

cyber-meow commented Apr 30, 2024

Normally this part is done mostly instantaneously. It is just calling some basic function as below

        for sample_id in tqdm(samples_to_remove):
            img_path, _ = dataset[sample_id]
            os.remove(img_path)
            related_paths = get_related_paths(img_path)
            for related_path in related_paths:
                if os.path.exists(related_path):
                    os.remove(related_path)

It is hard to say why this is the case. You may want to run profiling to see where this comes from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants