Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when MS has multiple SPWs with different sizes #93

Closed
SpheMakh opened this issue Nov 12, 2017 · 45 comments
Closed

Crash when MS has multiple SPWs with different sizes #93

SpheMakh opened this issue Nov 12, 2017 · 45 comments
Assignees

Comments

@SpheMakh
Copy link
Collaborator

table.getcol() raises an error if SPWs have different sizes. It maybe better to first get the number of SPWs then loop over them when getting data. Or something similar to how different scans are treated.

@Jordatious
Copy link

I'm also getting this issue. Is it possible to patch?

@o-smirnov
Copy link
Collaborator

@Jordatious could you make a small MS for me to test this? Perhaps extract just one scan from your failing MS, and send me that + parset? (Can also put it on ILIFU somewhere where I have access.)

@Jordatious
Copy link

Let me find out if I can share the data

@Jordatious
Copy link

Jordatious commented Dec 8, 2020

Hey Oleg, I've copied the (whole) dataset here: /scratch2/users/oms/Jordan/. I stole the config from Ian here, with a few tweaks because of the funny way I'd arranged the columns. I've put this parset and the sbatch file to call CubiCal in the directory too.

@Jordatious
Copy link

Hey @o-smirnov, can you confirm you have a copy of the data? Please let me know how you go.

@o-smirnov
Copy link
Collaborator

Copying it to Rhodes now. Going along at a merry clip of ~80MB/s, so shouldn't be long!

@o-smirnov
Copy link
Collaborator

Right, got it.

@o-smirnov
Copy link
Collaborator

@Jordatious, do you know how this MS came to be in such an interesting state? The spectral windows actually overlap by one channel at the edges, which is quite difficult to handle consistently, if one wants a solution with a frequency solint that spans across SPW boundaries.

I'll push a fix shortly which will at least allow for one SPW at a time to be solved for.

@Jordatious
Copy link

Hey @o-smirnov, it was output from the IDIA pipeline, which calls mstransform with an SPW fed in by the user. In this case, as agreed upon at the time by the LADUMA processing WG, I set the SPWs to overlap by 2 MHz, with the following in my config:

spw = '0:880~933MHz,0:960~1012MHz,0:1010~1062MHz,0:1060~1112MHz,0:1110~1163MHz,0:1301~1334MHz,0:1332~1382MHz,0:1380~1420MHz'

From memory one SPW had an issue and I forgot to fix that before including it in the concat. Generally we don't include this overlap, and I suppose one solution for LADUMA would be to not include the overlap until spectral-line imagine, which is presumably a later split operation.

@o-smirnov
Copy link
Collaborator

Yeah, it's quite a mission to get the code to support overlapping SPWs. So if it can be avoided, let's try that.

I did push a fix to the issue-93 branch which lets it run with one SPW at a time (via the --sel-ddid option).

@Jordatious
Copy link

Jordatious commented Dec 14, 2020

Hey @o-smirnov, thanks for that! So if I have continuous SPWs, with nchan differing by 0 or 1 channels in each SPW (but otherwise continuous), will the default (i.e. without --sel-ddid) still work? Or will I also need to iterate through each SPW? Does this significantly slow down the run-time?

@KshitijT, could we have this version pulled to ilifu?

@o-smirnov
Copy link
Collaborator

I see we wrote the code with the anticipation of having multiple SPWs with a different number of channels, as long as the frequency increases monotonically. The initial error you were getting was due to another bit of code added later, which I've now fixed.

What I'll do is extract spws 0,2,4,6 from your MS and try it on that. Stay tuned.

@o-smirnov
Copy link
Collaborator

OK, doesn't quite cut the mustard. Let me work on another fix...

@cyriltasse
Copy link
Collaborator

About the padding thing - might be quite easy to do a flux-dependent padding... For the few facets containing bright sources increase the padding by some factor...

@o-smirnov
Copy link
Collaborator

Wrong issue. Good to know you're reading it though. ;)

@o-smirnov
Copy link
Collaborator

For the few facets containing bright sources increase the padding by some factor...

Great idea. And move to facet-dependent resolution while we're at it! We're still hijacking the issue though...

@cyriltasse
Copy link
Collaborator

Wrong issue. Good to know you're reading it though. ;)

Oh yeah - sorry !!! :)

@bennahugo
Copy link
Collaborator

bennahugo commented Dec 15, 2020

Hey @o-smirnov, thanks for that! So if I have continuous SPWs, with nchan differing by 0 or 1 channels in each SPW (but otherwise continuous), will the default (i.e. without --sel-ddid) still work? Or will I also need to iterate through each SPW? Does this significantly slow down the run-time?

@KshitijT, could we this version pulled to ilifu?

@Jordatious be careful with overlapping SPW. I have done some testing with CASA gaincal with combine="spw" with bda averaged data. In that case we have chunked overlapping spw at various differing resolutions which we bin the baselines into based on how much smearing can be tolerated (ska-sa/xova). CASA 5 gaincal does not handle this correctly - the solution intervals it constructs seem to assume some form of regular spacing because it flags out entire good spw in this case. So it is not just cubical that cannot operate in this regime. @JSKenyon is working on supporting overlapping SPW for the bda case for the cubical successor quartical and I know he has recently got it to work in this overlapping regime

@Jordatious
Copy link

Thanks for the heads up @bennahugo. It's possible I've already run into this issue. That would be a nice explanation for why my current selfcal run isn't performing as usual! Although we use combine='scan'. What's bda?

@o-smirnov, I'll test with SPWs that aren't overlapping, but likely with uneven numbers of channels in each SPW. Let me know when that should work here, and which version/branch to use.

@bennahugo
Copy link
Collaborator

bennahugo commented Dec 15, 2020 via email

@o-smirnov
Copy link
Collaborator

OK, this code was less tested than I thought, but I think the issue-93 branch now has a working version that supports unequal, non-overlapping SPWs. Note that the freq solution interval cannot be larger than an SPW. @Jordatious can you test with a github branch directly, or do we need to roll you a Stimela container?

Also, @landmanbester, could you please do me a favour, and run this branch against your Cyg A dataset to check that you have the same results? I'd like to make sure that nothing is broken in the case of equally-sized SPWs.

@landmanbester
Copy link
Collaborator

Also, @landmanbester, could you please do me a favour, and run this branch against your Cyg A dataset to check that you have the same results? I'd like to make sure that nothing is broken in the case of equally-sized SPWs.

Sure, I'll rerun and check that I get the same dirty image

@Jordatious
Copy link

OK, this code was less tested than I thought, but I think the issue-93 branch now has a working version that supports unequal, non-overlapping SPWs. Note that the freq solution interval cannot be larger than an SPW. @Jordatious can you test with a github branch directly, or do we need to roll you a Stimela container?

Brilliant! If you can containerise it on ilifu (/software/astro/caracal/), that'd be great!

@o-smirnov
Copy link
Collaborator

@KshitijT, I have pushed an image called stimela/cubical:1.5.8-iss93 to quay.io and dockerhub. Could you set @Jordatious up so he can use it?

@landmanbester
Copy link
Collaborator

I reran the calibration on Cygnus A data in A config and compared the corrected data from master and the issue-93 branch. The only part of the parset that is different between runs is the name of the output column. There is a maximum absolute difference between the columns of about 0.05 and a max relative difference of about 0.0008. I am not sure if this suffices. Do you want me to look at the difference between the dirty images as well?

@KshitijT
Copy link
Collaborator

@KshitijT, I have pushed an image called stimela/cubical:1.5.8-iss93 to quay.io and dockerhub. Could you set @Jordatious up so he can use it?

Let me give it a shot.

@o-smirnov
Copy link
Collaborator

Thanks @landmanbester, that's very useful. There were no changes to the solver so there should not be any difference. However, I also added a workaround for casacore/python-casacore#130, so there's a possibility master is being affected by that bug (which would be very interesting to say the least!) Let me make a branch where the workaround is disabled and try your solve with that...

@KshitijT
Copy link
Collaborator

@Jordatious , I have put the new singularity cubical image in /software/astro/caracal .

The image name is: cubical_frankl.sif . Let me know if there is any problem with running that.

@Jordatious
Copy link

@KshitijT brilliant, thanks! Hopefully I'll get a chance to test this later this afternoon!

@landmanbester
Copy link
Collaborator

Thanks @landmanbester, that's very useful. There were no changes to the solver so there should not be any difference. However, I also added a workaround for casacore/python-casacore#130, so there's a possibility master is being affected by that bug (which would be very interesting to say the least!) Let me make a branch where the workaround is disabled and try your solve with that...

Hold on, sorry, I'm being daft. It is just the phase ambiguity. I thought np.testing.assert_array_almost_equal compares absolute values but explicitly comparing the absolutes gives a max absolute difference of about 5e-5 so it's all good

@o-smirnov
Copy link
Collaborator

Ah, an even better outcome. Thanks, @landmanbester!

@Jordatious
Copy link

So I got this working from start to end for data without the overlapping SPWs. However, the results were quite bad! I'm suspecting it's because I believe the following default of @IanHeywood's that I used here means 256 and 64 MHz, right?

[g]
...
freq-int = 256
...
[de]
...
freq-int = 64

This is larger than the ~30-50 MHz width of each SPW. I could try this again with a width of ~30 MHz, although I'm basically blindly copying the config here so I'm not sure if the results will be sensible. I also notice freq-chunk = 256 I presume this chunking is how it internally partitions the data, while freq-int is the frequency interval over which it solves. Is that correct?

@JSKenyon
Copy link
Collaborator

JSKenyon commented Jan 7, 2021 via email

@o-smirnov
Copy link
Collaborator

Yep. And the effective freq_int can't be bigger than a chunk, so effectively it would use min(freq_int, freq_chunk).

@Jordatious, if you want one solution per SPW regardless of SPW size, just set freq-int=0, freq-chunk=0. Zero in this case means "full SPW".

Beyond that, there could be many reasons for bad results... one thing to try in terms of bracketing the problem -- run it on a single SPW (via --sel-ddid) and compare the results with the same SPW, but coming from a multi-SPW run. That will tell us if the problem is with my new code, or elsewhere.

@Jordatious
Copy link

Thanks for that. In that case, since the channel width is ~836 kHz, those intervals are 214.0 and 53.5 MHz. So maybe I try an interval / chunk of 36 (since the minimum is 30.1 MHz), or otherwise 0.

Sorry, it wasn't clear whether you were suggesting I use --sel-ddid or not when the SPWs weren't overlapping. I think I'm now suspecting I was supposed to, and that overlapping SPWs are totally off the table anyway. So by "multi-SPW run", do you mean without --sel-ddid? That's what I already did.

If I run this again, will it re-write the corrected data column using the MODEL_DATA DIR1_DATA columns I already have to solve again?

@o-smirnov
Copy link
Collaborator

What I mean is, if you run with --sel-ddid 0 for instance, it will select a subset of rows corresponding to SPW 0, and only process those. With the default of --sel-ddid None, it will process all SPWs (provided they don't overlap). Since the frequency solution interval is bounded by SPW boundaries anyway, the results for SPW 0 should be identical for both runs. Unless I screwed up.

If I run this again, will it re-write the corrected data column using the MODEL_DATA DIR1_DATA columns I already have to solve again?

Yes, it will overwrite CORRECTED_DATA (or whatever the output column is set to). If you run with --sel-ddid 0, it will only overwrite the parts corresponding to SPW 0.

@Jordatious
Copy link

Jordatious commented Mar 17, 2021

As hinted in #424, this seems to have worked. However, the peel did not work as expected, which still has a lot of residual artefacts around the source (see attached). The peak goes from 175 mJy to 1.8 mJy, so the source is peeled. So I consider this done, and that the rest might be based on zeros I'm finding in the data (see below).

My config is identical to the one from Oxkat, other than the following:

diff my_peel.parset 3GC_peel.parset
4c4
< column = CORRECTED_DATA
---
> column = DATA
6c6
< freq-chunk = 0
---
> freq-chunk = 256
31c31
< column = DATA
---
> column = CORRECTED_DATA
44c44
< list = DIR1_DATA+-MODEL_DATA:MODEL_DATA
---
> list = MODEL_DATA+-DIR1_DATA:DIR1_DATA
197c197
< freq-int = 0
---
> freq-int = 256
236c236
< freq-int = 0
---
> freq-int = 64

So I used all (non-overlapping) SPWs for this, with a frequency interval and chunk size equal to the whole SPW. It reads from corrected and writes to data. I also tried reading from corrected and writing to corrected, which achieves the same result, which is cool. The column trickery is because my DIR1_DATA and MODEL_DATA are flipped.

I've plotted the data for MODEL_DATA and DIR1_DATA, and that reveals some non-flagged zeros in the data, which is weird, and possibly why the data didn't come out nicely. I first tried this with shadeMS, but hit the error here: ratt-ru/shadeMS#94 (issue 94 vs. 93 here!). So I did my own plotting, which is attached (first MODEL_DATA , the problem source, then DIR1_DATA, the whole sky including the problem source). Manually traversing the data, I've found zeros in both columns. It seems the majority are from a single dump, and almost all on antenna 0. I'm still checking for the DATA column, but I haven't found any there yet. Any thoughts about these problems and whether you agree the result should be much better would be welcome!

Assuming this works going forward, I guess this will be pulled into master (once the errors are fixed in #424), and usable from one of the standard Stimela containers, right?

output
image
image

@JSKenyon
Copy link
Collaborator

Something seems a bit off. In particular, I am concerned that some of the sources in the left half of the image have worse artefacts after peeling, when one would expect them to be largely unaffected (unless they are part of the peeling model?). Are you certain that your model is correct?

@Jordatious
Copy link

Exactly. They're not part of the model that should be peeled. Other than the zeros, the model should be good. The sky model is based on what was fed into selfcal, which did sensible things. It's a fairly small sky model, over only 5k, with 1.5 arcsec pixels (so ~2 degrees diameter). Not sure if that's a problem.

The model for the problem source seems sensible, corresponding to a double source with one component brighter than the other, and it's quite smooth. I thought they should be brighter in the plot, so I'm not sure about that yet.

@JSKenyon
Copy link
Collaborator

Interesting. It is not impossible that there are calibration artefacts ("ghosts") being introduced by the DD solve, but I wouldn't anticipate it given the solution intervals (though if there are unflagged zeros, I guess that could be an issue). As a sanity check, could you possibly image your various model columns? What was the time interval for the peeled source?

@Jordatious
Copy link

Well from an initial look, there are no zeros in the DATA column (over-)written by CubiCal, at least over the antenna 0 where they exist for the other columns. Sure, I'm imaging the model column now.

The config largely follows the one from Oxkat, which looks like it uses a 72 second time interval, assuming I've understood this correctly.

[de]
time-int = 72

@IanHeywood
Copy link
Collaborator

That's in units of integration times, which is assumed to be 8 seconds, so 72 times 8 seconds = 9.6 minutes.

@Jordatious
Copy link

Ok thanks Ian.

I imaged the model column and it came out ok, almost entirely identical to the corrected data from after selfcal, with the problem source still present, as expected. I also imaged DIR1_DATA, but it came out identical to DATA (the column I got CubiCal to write out to), which I did not expect. I suspect tclean doesn't know how to image this custom column, so it defaults back to DATA. It doesn't say that in the logs, but I suspect that's the case. So I might copy it to a column it already knows about and then image it again.

Here is a gif blinking between these two columns.

data-model

o-smirnov added a commit that referenced this issue May 12, 2021
* partial fix for #93. Should allow for one spw at a time

* fixes #93. Adds workaround for casacore/python-casacore#130

* added future-fstrings

* correctly named future-fstrings

* ...and install it for all pythons

* back to mainstream sharedarray

* lowered message verbosity

* droppping support for python<3.6

* Issues 427 and 429 (#430)

* Fixes #427

* Fixes #429

* Update __init__.py

Bump version

* Update Jenkinsfile.sh (#447)

* Update Jenkinsfile.sh

Remove python 2 building

* Update Jenkinsfile.sh

* Issue 431 (#432)

* Beginning of nobeam functionality.

* Remove duplicated source provider.

* Change syntax for python 2.7 compatibility.

* Update montblanc version in setup to avoid installation woes.

Co-authored-by: JSKenyon <[email protected]>

* Remove two uses of future_fstrings.

* make pypi compatible (#446)

* make pypi compatible

* error

* Update setup.py

Only support up to v0.6.1 of montblanc

* Update setup.py

Co-authored-by: Benjamin Hugo <[email protected]>
Co-authored-by: JSKenyon <[email protected]>
Co-authored-by: JSKenyon <[email protected]>

* Update __init__.py

* Fix warning formatting

Co-authored-by: Oleg Smirnov <[email protected]>
Co-authored-by: Benjamin Hugo <[email protected]>
Co-authored-by: JSKenyon <[email protected]>
Co-authored-by: JSKenyon <[email protected]>
Co-authored-by: Gijs Molenaar <[email protected]>
Co-authored-by: JSKenyon <[email protected]>
@Jordatious
Copy link

@SpheMakh

@Jordatious
Copy link

Thanks @SpheMakh for looking at this for LADUMA. To briefly clarify the previous post, this is also showing how the other column came out. So this blinks between data, model and corrected. I haven't managed to image the DIR1 column yet, but I can if it's useful. So basically I expected a much better result from this, and I'm wondering if the zeros in the data are causing problems here, although I'm not sure where they're coming from!

output

JSKenyon added a commit that referenced this issue Apr 13, 2023
* Added different slope modes for slope solvers. Experimental.

* initial rehashing to support a more geneal slope structure (e.g. iono+delay)

* fixes #428

* fixes #433

* no-X polcal experiments

* added support for arbitrary Jones terms in *first* chain position

* fixed typo

* Noxcal (#450)

* partial fix for #93. Should allow for one spw at a time

* fixes #93. Adds workaround for casacore/python-casacore#130

* added future-fstrings

* correctly named future-fstrings

* ...and install it for all pythons

* back to mainstream sharedarray

* lowered message verbosity

* droppping support for python<3.6

* Issues 427 and 429 (#430)

* Fixes #427

* Fixes #429

* Update __init__.py

Bump version

* Update Jenkinsfile.sh (#447)

* Update Jenkinsfile.sh

Remove python 2 building

* Update Jenkinsfile.sh

* Issue 431 (#432)

* Beginning of nobeam functionality.

* Remove duplicated source provider.

* Change syntax for python 2.7 compatibility.

* Update montblanc version in setup to avoid installation woes.

Co-authored-by: JSKenyon <[email protected]>

* Remove two uses of future_fstrings.

* make pypi compatible (#446)

* make pypi compatible

* error

* Update setup.py

Only support up to v0.6.1 of montblanc

* Update setup.py

Co-authored-by: Benjamin Hugo <[email protected]>
Co-authored-by: JSKenyon <[email protected]>
Co-authored-by: JSKenyon <[email protected]>

* Update __init__.py

* Fix warning formatting

Co-authored-by: Oleg Smirnov <[email protected]>
Co-authored-by: Benjamin Hugo <[email protected]>
Co-authored-by: JSKenyon <[email protected]>
Co-authored-by: JSKenyon <[email protected]>
Co-authored-by: Gijs Molenaar <[email protected]>
Co-authored-by: JSKenyon <[email protected]>

* fixed bug when outer chain term was DD and incorrect model was used. Properly implemented scalar updates

* added scalar and unislope update types

* added auto-generated Stimela2 schemas

* added schemas to manifest

* use / separator for stimela schema; omit defaults in schema; fix path

* minor bugfixes

* added stimela_cabs file. Fixed problem with legacy-only flags on output

* added stimela_cabs.yml

* changed / to .

* fixed problem with IFR gain flags overpropagating

* added parset argument

---------

Co-authored-by: Oleg Smirnov <[email protected]>
Co-authored-by: Lexy Andati <[email protected]>
Co-authored-by: Benjamin Hugo <[email protected]>
Co-authored-by: Gijs Molenaar <[email protected]>
o-smirnov added a commit that referenced this issue Feb 12, 2024
* Added different slope modes for slope solvers. Experimental.

* initial rehashing to support a more geneal slope structure (e.g. iono+delay)

* fixes #428

* fixes #433

* no-X polcal experiments

* added support for arbitrary Jones terms in *first* chain position

* fixed typo

* Noxcal (#450)

* partial fix for #93. Should allow for one spw at a time

* fixes #93. Adds workaround for casacore/python-casacore#130

* added future-fstrings

* correctly named future-fstrings

* ...and install it for all pythons

* back to mainstream sharedarray

* lowered message verbosity

* droppping support for python<3.6

* Issues 427 and 429 (#430)

* Fixes #427

* Fixes #429

* Update __init__.py

Bump version

* Update Jenkinsfile.sh (#447)

* Update Jenkinsfile.sh

Remove python 2 building

* Update Jenkinsfile.sh

* Issue 431 (#432)

* Beginning of nobeam functionality.

* Remove duplicated source provider.

* Change syntax for python 2.7 compatibility.

* Update montblanc version in setup to avoid installation woes.

Co-authored-by: JSKenyon <[email protected]>

* Remove two uses of future_fstrings.

* make pypi compatible (#446)

* make pypi compatible

* error

* Update setup.py

Only support up to v0.6.1 of montblanc

* Update setup.py

Co-authored-by: Benjamin Hugo <[email protected]>
Co-authored-by: JSKenyon <[email protected]>
Co-authored-by: JSKenyon <[email protected]>

* Update __init__.py

* Fix warning formatting

Co-authored-by: Oleg Smirnov <[email protected]>
Co-authored-by: Benjamin Hugo <[email protected]>
Co-authored-by: JSKenyon <[email protected]>
Co-authored-by: JSKenyon <[email protected]>
Co-authored-by: Gijs Molenaar <[email protected]>
Co-authored-by: JSKenyon <[email protected]>

* fixed bug when outer chain term was DD and incorrect model was used. Properly implemented scalar updates

* added scalar and unislope update types

* added auto-generated Stimela2 schemas

* added schemas to manifest

* use / separator for stimela schema; omit defaults in schema; fix path

* minor bugfixes

* added stimela_cabs file. Fixed problem with legacy-only flags on output

* added stimela_cabs.yml

* changed / to .

* fixed problem with IFR gain flags overpropagating

* Update __init__.py

* Stimelation (#480)

* fixes to AIPS leakage plotter (more table format support); started an AIPS bandpass plotter but this is WIP

* add check for INDE value in PRTAB leakages

* changed to hierarchical config

---------

Co-authored-by: JSKenyon <[email protected]>
Co-authored-by: Jonathan <[email protected]>
Co-authored-by: Lexy Andati <[email protected]>
Co-authored-by: Benjamin Hugo <[email protected]>
Co-authored-by: Gijs Molenaar <[email protected]>
Co-authored-by: JSKenyon <[email protected]>
Co-authored-by: Athanaseus Javas Ramaila <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants