Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use dask to speed up SAM algorithm in mineral.py #168

Open
aheermann opened this issue Sep 18, 2019 · 8 comments
Open

Use dask to speed up SAM algorithm in mineral.py #168

aheermann opened this issue Sep 18, 2019 · 8 comments

Comments

@aheermann
Copy link
Contributor

The SAM algorithm in mineral.py takes the majority of time when trying to classify an image, specifically these loops.

# for each pixel in the image
    for x in range(M):

        for y in range(N):

            # read the pixel from the file
            pixel = data[x,y]

            # if it is not a no data pixel
            if not numpy.isclose(pixel[0], -0.005) and not pixel[0]==-50:

                # resample the pixel ignoring NaNs from target bands that don't overlap
                # TODO fix spectral library so that bands are in order
                resampled_pixel = numpy.nan_to_num(resample(pixel))

                # calculate spectral angles
                angles = spectral.spectral_angles(resampled_pixel[numpy.newaxis,
                                                                 numpy.newaxis,
                                                                 ...],
                                                  library.spectra)

                # normalize confidence values from [pi,0] to [0,1]
                for z in range(angles.shape[2]):
                    angles[0,0,z] = 1-angles[0,0,z]/math.pi

                # get index of class with largest confidence value
                index_of_max = numpy.argmax(angles)

                # get confidence value of the classied pixel
                score = angles[0,0,index_of_max]

                # classify pixel if confidence above threshold
                if score > threshold:

                    # index from one (after zero for no data)
                    classified[x,y] = index_of_max + 1

                    if scores_file_name is not None:
                        # store score value
                        scored[x,y] = score

Speeding up this method with parallelization should prove beneficial in reducing runtimes. I think that trying the Dask module would be a good start to speeding up the process.
https://github.com/dask/dask
https://dask.org/

@lewismc
Copy link
Member

lewismc commented Sep 18, 2019

Hi @aheermann can you put this on the agenda for the next meeting? I am really keen to see what your plan for this is. Also, it might be appropriate for us to split this into smaller tasks... this may end up a pretty large undertaking.

@lewismc lewismc added this to the 0.6 milestone Sep 18, 2019
@aheermann
Copy link
Contributor Author

Yep, I'll put it on the agenda. As to the undertaking, our idea for this was to just do some preliminary investigation and trials with this module, to see if it could work. We also have Jonathan and Dennis investigating using Pytorch for parallelization of the same code, so that we move forward with the most appropriate module

@lewismc
Copy link
Member

lewismc commented Sep 19, 2019 via email

@lewismc
Copy link
Member

lewismc commented Sep 27, 2019

Early branch available at https://github.com/capstone-coal/pycoal/tree/dask_trial

@aheermann
Copy link
Contributor Author

aheermann commented Oct 1, 2019

Thus far, we have been working on the SAM algorithm, trying to speed up pixel classification. We have tried several ways of splitting up the pixel processing into dask delayed methods in order to parallelize it. However, the overhead on the smaller data set we are using has not led to any speed ups yet. We are running on the f180201t01p00r05rdn_e_sc01_ort_img.hdr image, which using the original master branch as a baseline, runs about 3 hours 25 min un-parallelized on my machine.

@lewismc
Copy link
Member

lewismc commented Oct 1, 2019

@aheermann can you please hyperlink the dataset.
A few more questions

which has a baseline un-parallelized runtime

Do you mean pycoal master branch? If not then this is not much to worry about as this is to be expected. Please provide more details. Thanks

@aheermann
Copy link
Contributor Author

aheermann commented Oct 17, 2019

Since the last update, dask was temporarily put on hold as our personal machines were not powerful enough to take advantage of it. As we now have access to AWS, we will pick back up work on dask. It will now be one option of several, including Pytorch (#172) and Joblib (#177) for users when running Pycoal.

@lewismc
Copy link
Member

lewismc commented Oct 17, 2019

@aheermann got it.
Thinking about the abstraction layer here is an important part of engineering a good solution. Please start thinking about that. It will require you to work with other in the group.

@lewismc lewismc removed this from the 0.6 milestone Nov 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants