Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing library alias on Mac #134

Closed
1 task done
peastman opened this issue Mar 3, 2022 · 61 comments · Fixed by conda-forge/pytorch-cpu-feedstock#116
Closed
1 task done

Missing library alias on Mac #134

peastman opened this issue Mar 3, 2022 · 61 comments · Fixed by conda-forge/pytorch-cpu-feedstock#116
Labels

Comments

@peastman
Copy link

peastman commented Mar 3, 2022

Solution to issue cannot be found in the documentation.

  • I checked the documentation.

Issue

The Mac packages distributed through the default channel include an alias with the name libopenblas.dylib pointing to the library. That allows it to be referenced with standard naming. The conda-forge packages are missing that alias. Instead you have to reference it with the name libopenblas.0.dylib. This causes problems for software that expects the standard name to work. See, for example, numpy/numpy#14165 and https://gitlab.kitware.com/cmake/cmake/-/issues/23291.

Would it be possible to add the missing alias?

Installed packages

# packages in environment at /Users/peastman/miniconda3/envs/test:
#
# Name                    Version                   Build  Channel
bzip2                     1.0.8                h3422bc3_4    conda-forge
ca-certificates           2021.10.8            h4653dfc_0    conda-forge
libffi                    3.4.2                h3422bc3_5    conda-forge
libgfortran               5.0.0.dev0      11_0_1_hf114ba7_23    conda-forge
libgfortran5              11.0.1.dev0         hf114ba7_23    conda-forge
libopenblas               0.3.18          openmp_h5dd58f0_0    conda-forge
libzlib                   1.2.11            hee7b306_1013    conda-forge
llvm-openmp               13.0.1               hf3c4609_0    conda-forge
ncurses                   6.3                  hc470f4d_0    conda-forge
openssl                   3.0.0                h3422bc3_2    conda-forge
pip                       22.0.3             pyhd8ed1ab_0    conda-forge
python                    3.9.10          h38ef502_2_cpython    conda-forge
python_abi                3.9                      2_cp39    conda-forge
readline                  8.1                  hedafd6a_0    conda-forge
setuptools                60.9.3           py39h2804cbe_0    conda-forge
sqlite                    3.37.0               h72a2b83_0    conda-forge
tk                        8.6.12               he1e0b03_0    conda-forge
tzdata                    2021e                he74cb21_0    conda-forge
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
xz                        5.2.5                h642e427_1    conda-forge
zlib                      1.2.11            hee7b306_1013    conda-forge

Environment info

active environment : test
    active env location : /Users/peastman/miniconda3/envs/test
            shell level : 2
       user config file : /Users/peastman/.condarc
 populated config files : /Users/peastman/.condarc
          conda version : 4.11.0
    conda-build version : not installed
         python version : 3.8.11.final.0
       virtual packages : __osx=12.2=0
                          __unix=0=0
                          __archspec=1=arm64
       base environment : /Users/peastman/miniconda3  (writable)
      conda av data dir : /Users/peastman/miniconda3/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/omnia/osx-arm64
                          https://conda.anaconda.org/omnia/noarch
                          https://conda.anaconda.org/conda-forge/osx-arm64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://repo.anaconda.com/pkgs/main/osx-arm64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/osx-arm64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /Users/peastman/miniconda3/pkgs
                          /Users/peastman/.conda/pkgs
       envs directories : /Users/peastman/miniconda3/envs
                          /Users/peastman/.conda/envs
               platform : osx-arm64
             user-agent : conda/4.11.0 requests/2.25.1 CPython/3.8.11 Darwin/21.3.0 OSX/12.2
                UID:GID : 503:20
             netrc file : None
           offline mode : False
@peastman peastman added the bug label Mar 3, 2022
@isuruf
Copy link
Member

isuruf commented Mar 3, 2022

It's in openblas package along with the headers.

@isuruf isuruf closed this as completed Mar 3, 2022
@peastman
Copy link
Author

peastman commented Mar 3, 2022

I'm trying to track down a build failure. Perhaps you have insight into what the real problem is? My library uses PyTorch and links to libtorch.dylib. It does not directly use OpenBLAS, but apparently PyTorch does at some level, and I get an error trying to configure with CMake, telling me that libopenblas.dylib doesn't exist. Installing the PyTorch package installs the libopenblas package but not the openblas package. So what is the real problem?

  • Should the PyTorch package list openblas as a dependency?
  • Should the libopenblas package include the alias (the same way the corresponding package on the default channel does)?
  • Should CMake recognize that the correct library name is libopenblas.0.dylib rather than libopenblas.dylib?
  • Something else?

One way or another, the current behavior is that if you install the PyTorch conda package and then use CMake to build something that links to libtorch, you get an error. What is the correct place to fix this?

@isuruf
Copy link
Member

isuruf commented Mar 3, 2022

What's the error exactly?

@peastman
Copy link
Author

peastman commented Mar 3, 2022

make[2]: *** No rule to make target '/Users/peastman/miniconda3/envs/openmm/lib/libopenblas.dylib', needed by 'libOpenMMTorch.dylib'.  Stop.
make[1]: *** [CMakeFiles/Makefile2:190: CMakeFiles/OpenMMTorch.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

@peastman
Copy link
Author

Does anyone have insight on what the correct fix is? I don't know whether the change should be in PyTorch, OpenBLAS, or CMake, but something needs to be changed somewhere.

@isuruf
Copy link
Member

isuruf commented Mar 19, 2022

See #134 (comment)

@peastman
Copy link
Author

See #134 (comment) above. If a user simply installs PyTorch and tries to compile against it, they get a compilation error. The package includes libtorch, which contains the C++ API and is designed to be a library users link their own code to. Using the package in the intended way leads to errors. The fact that it uses OpenBLAS is an internal implementation detail of PyTorch, exactly the sort of thing conda is supposed to take care of for you.

Is there any good reason not to include the symlink in libopenblas? Omitting it creates problems for users. The version of the package on the defaults channel includes it.

@peastman
Copy link
Author

Can anyone else comment on this? The problem is still not resolved. I believe the solution I originally proposed (including the alias in libopenblas) is the correct solution. I'm open to other possibilities, but no one has suggested a better one or pointed out any downside to including the alias. It's a tiny change that saves a lot of problems for users and makes the conda-forge version of the package consistent with the default channel version.

cc @jakirkham @h-vetinari

@isuruf
Copy link
Member

isuruf commented Mar 30, 2022

Is there any good reason not to include the symlink in libopenblas?

Yes, to prevent downstream packages from accidentally linking to them. The symlink and headers are in the development package just like debian, fedora and countless other package managers. Are you suggesting that debian should package the symlink and the headers in the non dev package?

@isuruf
Copy link
Member

isuruf commented Mar 30, 2022

@peastman
Copy link
Author

Again, I'm happy to consider other options. But you still haven't suggested any solution at all to the problem. Let me repeat one more time exactly what the steps are.

  1. Install the pytorch package from CondaForge.
  2. Try to compile a program that links to libtorch.
  3. It fails with a linker error saying that libopenblas.dylib can't be found.

Notice that the user has not installed OpenBLAS. Until they hit this error message, they didn't even know that PyTorch used OpenBLAS. They certainly don't know there are two different packages related to OpenBLAS. Further, this error only happens with recent versions of PyTorch. With older versions the above steps work perfectly, because they require an older version of OpenBLAS than is available on CondaForge. Instead conda installs a package from the defaults channel which does include the alias.

Are you suggesting that debian should package the symlink and the headers in the non dev package?

What does this have to do with Debian? This is an error on Mac.

@isuruf
Copy link
Member

isuruf commented Mar 31, 2022

But you still haven't suggested any solution at all to the problem.

Ask pytorch-cpu-feedstock maintainers to patch the cmake file. It should not be necessary to link to libopenblas.dylib.

What does this have to do with Debian? This is an error on Mac.

It's not unique to Mac. This doesn't manifest in Linux x86, because pytorch uses MKL in Linux x86.

Anyways, this is a pytorch-cpu-feedstock issue and not this feedstock.

@peastman
Copy link
Author

peastman commented Apr 3, 2022

Ask pytorch-cpu-feedstock maintainers to patch the cmake file.

This has nothing to do with a PyTorch CMake file. PyTorch simply calls find_package(BLAS). CMake has built in support for finding a variety of BLAS implementations. When it finds OpenBLAS on Mac it expects there to be a library called libopenblas.dylib. Every other way of installing OpenBLAS includes an alias with that name. The CondaForge package is unique in not including it. It is the one that is not following the convention. I already tried raising an issue with the CMake team and was explicitly told, "This is a bug in the conda package."

@h-vetinari
Copy link
Member

I think the issue is that pytorch has a host-dependency on openblas (hence, with the development headers, library alias, etc.), but the openblas output doesn't have it's own run_export (and therefore just inherits the one from libopenblas).

I think this setup is actually correct from the openblas side, but it sounds like the pytorch feedstock should add a runtime-dependency on openblas; or alternatively, add a devel package (that's suitable for being compiled against) that includes it.

CC @conda-forge/pytorch-cpu

@peastman
Copy link
Author

peastman commented Apr 8, 2022

Can any of the PyTorch package maintainers comment on this?

@hmaarrfk
Copy link

hmaarrfk commented Apr 8, 2022

I think @h-vetinari is correct. PRs welcome.

@isuruf
Copy link
Member

isuruf commented Apr 8, 2022

No, @h-vetinari is not correct. The correct way is to patch pytorch's cmake files to remove the dependence on libopenblas.dylib.

@isuruf
Copy link
Member

isuruf commented Apr 8, 2022

Or make pytorch link to the blas library as PRIVATE instead of PUBLIC.

@peastman
Copy link
Author

peastman commented Apr 8, 2022

The correct way is to patch pytorch's cmake files to remove the dependence on libopenblas.dylib.

Why do you keep saying that? PyTorch's CMake files do not refer to libopenblas. All it does is call find_package(BLAS), and CMake locates a BLAS implementation for it.

@isuruf
Copy link
Member

isuruf commented Apr 8, 2022

All it does is call find_package(BLAS), and CMake locates a BLAS implementation for it.

Yes, but it links to BLAS as a public dependency instead of a private dependency. See https://github.com/pytorch/pytorch/blob/e5bf87963d65d6273a6ec3505a50c07229643606/cmake/Dependencies.cmake#L194. I suggest you start reading https://cmake.org/cmake/help/latest/command/target_link_libraries.html

Please read up on how CMake target files work and have a look at /lib/python3.10/site-packages/torch/share/cmake/Caffe2/Caffe2Targets.cmake file.

@isuruf
Copy link
Member

isuruf commented Apr 8, 2022

PyTorch's CMake files do not refer to libopenblas.

This is clearly wrong. Have a look at lib/python3.10/site-packages/torch/share/cmake/Caffe2/Caffe2Targets.cmake and also read https://cmake.org/cmake/help/latest/command/export.html#exporting-targets to get an understanding.

@peastman
Copy link
Author

@hmaarrfk what do you think is the right solution? I'm happy to make a PR. I just need to know what it should change.

@hmaarrfk
Copy link

I think isuruf is right (he has very good instincts on this) .

I think you have to hunt down a cmake file that has the word PUBLIC in it where it shouldn't.

Not very easy to find.

@jakirkham
Copy link
Member

When it comes to this kind of hunting, git grep does a lot better that GitHub search IME

@hmaarrfk
Copy link

Don't we have the package you are looking for

https://github.com/conda-forge/openmm-feedstock

Can you follow that recipe?

@jakirkham
Copy link
Member

Peter maintains that feedstock IIUC. The issue is PyTorch (upstream) has something wonky in its CMake files AIUI

@peastman
Copy link
Author

peastman commented May 9, 2022

I've been looking into implementing this. The more I investigate it, the less comfortable I am with the proposed solution.

Here is the relevant PyTorch CMake script. It builds two lists called Caffe2_DEPENDENCY_LIBS and Caffe2_PUBLIC_DEPENDENCY_LIBS. Later in the build process, a different script links anything added to the former as private and anything added to the latter as public. As part of building the list, it looks for many different BLAS implementations and whichever one it finds, it adds the relevant library to Caffe2_PUBLIC_DEPENDENCY_LIBS. This works for all implementations other than OpenBLAS. For example, the MKL conda package includes aliases for all its libraries with no need to install a separate dev package. And of course it also works for any OpenBLAS package other than the CondaForge one.

I could create a patch to the script so that OpenBLAS_LIB would get added to Caffe2_DEPENDENCY_LIBS instead of Caffe2_PUBLIC_DEPENDENCY_LIBS. That might work, though I haven't tested it. But it raises a number of issues.

  • Why are we making it treat OpenBLAS differently from all other BLAS implementations?
  • Why are we hacking PyTorch's build scripts at all?
  • This will be a brittle fix. Future changes to PyTorch's build scripts could easily break the patch.
  • This changes the binary interface for libtorch, which risks breaking other software that uses it.

Instead we could trivially and robustly fix the problem by just adding one alias to the OpenBLAS package which is already present in other distributions of OpenBLAS, including the corresponding package on the conda defaults channel.

@peastman
Copy link
Author

peastman commented Jun 2, 2022

Your assumption that the alias doesn't exist is wrong.

Isn't it?

Download the package and see for yourself. I'm not talking about what happens while building the PyTorch package. I'm talking about what happens when you install PyTorch to use its C++ API, and try to compile your own code that links to it:

  • conda install pytorch
  • Compile your code that links to libtorch.
  • Get a cryptic linker error involving a library you've never heard of.

You're somehow supposed to figure out on your own that there exists a package called openblas, that it's different from the libopenblas package that was automatically installed with PyTorch, somehow track down what the difference between the two packages is (the names give no clue), and finally realize that you need to manually install that second package if you want PyTorch to work correctly.

with Isuru's strong opposition to changing this (and his experience & standing on these topics), it looks like you're fighting an uphill battle on this.

Isuru has not offered a single reason why it would be harmful to add the alias, nor addressed the fact that the alias is present in other comparable packages (such as MKL, or the defaults channel version of libopenblas), nor pointed out any advantage of his suggested change which would be far more complicated, far more work, and far more fragile. In fact, he hasn't replied to anything I've said for nearly two months.

@hmaarrfk
Copy link

hmaarrfk commented Jun 2, 2022

Unfortunately, i really don't have an Mac M1. I am unable to download and see for myself. A log file showing your error would go a long way.

Generally speaking: a one line patch to pytorch is not "hacking" and would likely get accepted upstream.
Generally speaking: from an engineering standpoint, upstream isn't always interested in "integration in an ecosystem" as a first priority. So we often have to do small fixes like one line patches to build systems to get them to work with conda.
Generally speaking: it is common to ship the bare .so files in a different package to ensure that software doesn't accendentallg link to them. This seems to be true even in other distributions like ubuntu or Fedora where devel or dev packages exist.

@hmaarrfk
Copy link

hmaarrfk commented Jun 2, 2022

We have also worked alot with different upstream package providers software authors to help them improve their build systems.

They are, generally speaking, really interested in getting people to use their software. They just don't really test for "conda" at first.

Edit: package provider -> software authors

@peastman
Copy link
Author

peastman commented Jun 3, 2022

I just don't understand where you're coming from. Can you give one single disadvantage to including the alias? Is there any possible use case where it would cause problems? If not, then why is there any disagreement?

And let me emphasize again, this package is the outlier that's violating the conventions of the rest of the ecosystem. If you install the same package from the defaults channel, it includes the alias. If you install a different BLAS from conda-forge, such as MKL, it includes all necessary aliases. But this one doesn't, and that omission causes major problems for users.

Unfortunately, i really don't have an Mac M1. I am unable to download and see for myself.

Just download https://anaconda.org/conda-forge/libopenblas/0.3.20/download/osx-64/libopenblas-0.3.20-openmp_hb3cd9ec_0.tar.bz2. Expand it and see for yourself that it contains a library called libopenblas.0.dylib, but not an alias called libopenblas.dylib pointing to it.

Now download https://anaconda.org/anaconda/libopenblas/0.3.20/download/osx-64/libopenblas-0.3.20-h9a5756b_1.tar.bz2, which is the exact same version of the same package from the defaults channel, and see that it does include the alias.

A log file showing your error would go a long way.

See #134 (comment) above.

@ocefpaf
Copy link
Member

ocefpaf commented Jun 3, 2022

I just don't understand where you're coming from. Can you give one single disadvantage to including the alias? Is there any possible use case where it would cause problems?

You probably missed #134 (comment)

@peastman
Copy link
Author

peastman commented Jun 4, 2022

I saw it. It just doesn't make any sense. No downstream package will link to it unless it's specifically trying to link to it. Linking to a library is not something that happens "accidentally". Every library you link to has to be specified by name. And let me emphasize yet again, this is how every other comparable package works. Including the same package from the default channel. Including other BLAS implementations from conda-forge.

@hmaarrfk
Copy link

hmaarrfk commented Jun 4, 2022

Let me try to recap in a constructive way.

You asked for help, from volunteers, without providing any logs claiming that something was broken. Instead you pointed to how "defaults" does it.

There are reasons why you are choosing conda-forge. Some of the design decisions behind how conda-forge is built is the reason why you find this attractive.

Now, you had 3 core members of conda-forge respond to you in a consistent manner.

In fact, two of those are actively involved in building large packages like pytorch.

In #134 (comment) and #134 (comment) isuru pointed you exactly to the line of code that you would have to suggest changing upstream, and for conda-forge that would address your issues.

I, a maintainer of the pytorch feedstock, would personally accept such a patch, and help you justify it upstream.

We aren't saying we are infallible, but in this case, you got some pretty direct feedback about how to solve this particular issue. Please try to make an attempt to resolve it.

@peastman
Copy link
Author

peastman commented Jun 4, 2022

@hmaarrfk I've been trying very hard to be constructive through this whole discussion, but I'm also finding it incredibly frustrating. It began with Isuru basically dismissing my suggestions with a vague reference to Debian, an argument that made no sense and displayed a lack of understanding of the deep differences between the dependency models in Linux and Python. I nonetheless made a sincere effort to implement the solution he suggested, but quickly concluded it just didn't make sense. I wrote up a technical discussion of the problems with it, which was met with complete silence. It took over three weeks and multiple further posts to get anyone to reply, and when they did, I was basically told, "Isuru doesn't like it so it isn't going to happen." Meanwhile, Isuru has not replied to a single thing I've posted for two months, which makes a productive discussion entirely impossible.

I am 100% willing to do the work to fix the problem. But no one seems interested in what is obviously the correct fix, instead referring back to the earlier suggestion which has serious problems that I already explained in detail. Every time I try to discuss the technical reasons for preferring it, I am met with silence. Looking back over the discussion, I count no less than four times when that happened.

You asked for help, from volunteers, without providing any logs claiming that something was broken.

I did post it, over two months ago. And I provided another link back to it just yesterday.

I am trying really, really hard to fix a serious problem that is impacting my work. I am willing to do the work to fix it. If you say you're willing to accept a PR adding the missing alias, I'll do it immediately and we can close this issue as resolved. But instead I feel I'm just getting ignored, and it's very upsetting.

@hmaarrfk
Copy link

hmaarrfk commented Jun 4, 2022

I'm not really sure if your assessment of the timeline is correct. isuruf replied in a very timely manner at the beginning of the issue.

isuruf likely has the most complete understanding of compilers, conda, and conda-forge of those i've interacted with.

It began with Isuru basically #134 (comment) with a vague reference to Debian

The internet can be dry at times. The balance between "brief" answers and "basically dismissing" is thin. However, he promptly replied to questions.

It took over three weeks and multiple further posts to get anyone to reply

Let me justify why I didn't reply. When asked for additional information, you provided 3 lines of a log. Often it is really important how you setup your information. When asked for the "exact" problem, often it is really valuable to provide full CMake Logs. We could have been more explicit about this request, but the level of experience compiling things yourself often indicates that you are comfortable communicating programming with fully reproducible logs.

I was basically told, "Isuru doesn't like it so it isn't going to happen."

Again. many of us trust isuru's judgement on these things.

Isuru has not replied to a single thing I've posted for two months, which makes a productive discussion entirely impossible

Again, nobody has to reply. if somebody feels like they cannot add anything constructive to the discussion, they often refrain from typing.

obviously the correct fix

I don't think any solution obvious. We often work collaboratively asking others for input and feedback.

I did #134 (comment), over two months ago. And I provided #134 (comment) just yesterday.

These logs are seriously incomplete. For an example on how to ask for external help, please see google/highway#726 where I included as an attachment, log information going all the way back to early cmake commands and the installation of dependencies. These are hard to make without CIs, but you can likely start at when you type

mkdir build
cd build
cmake ..
[...]

and provide us all that information.

I am trying really, really hard to fix a serious problem that is impacting my work.

As much as we want to help, we often won't be fast enough for a mission critical work. Its really up to you, as an engineer, to decide on the risk factors you want to choose and how you want to build your software dependencies.

The infrastructure at conda-forge is flexible enough to allow you to build and upload your own packages. This REALLY helps decouple problems.

If you say you're willing to accept a PR adding the missing alias,

I don't think I can accept a PR to openblas-feedstock since those involved in maintaining it expressed that they do not agree with the assessment that any files are missing from the released packages.

But instead I feel I'm just getting ignored

You've been pointed in the right direction in #134 (comment)

For what its worth, isuru's comments lead me directly to a small patch:
conda-forge/pytorch-cpu-feedstock#116

which I've asked for feedback from upstream developers too
pytorch/pytorch#78883

In summary:

  • things take time.
  • it really helps us if you provide full log information. Full build logs are really useful.
    • That said, isuruf guessed the problem without seeing them. I am not skilled enough to do the same :/
  • Many of us volunteering our time, please be patient.

As a maintainer of the pytorch feedstack, and having studied the suggestions made by isuru, i agree that it is a bug in pytorch, and thus, in the pytorch feedstock. I'm working with other maintainers to address it. And thus I'm going to close this issue.

It may take a few weeks to finish the rebuild given how difficult it is to build pytorch on CIs.

@hmaarrfk hmaarrfk closed this as completed Jun 4, 2022
@peastman
Copy link
Author

peastman commented Jun 5, 2022

When asked for additional information, you provided 3 lines of a log.

What I provided is exactly what I was asked for. Isuru's exact request was, "What's the error exactly?" I provided the exact error. At no point in the entire discussion did anyone ever say, "Your error message was insufficient, please post more details."

I'm not really sure if your assessment of the timeline is correct. isuruf replied in a very timely manner at the beginning of the issue.

I posted the above on May 3. More than two weeks later, on May 18, still no one had replied. I therefore posted again asking for any insight--keeping in mind I had provided all the information I had been asked for up to that point. He replied with a two word message, "See #134", linking back to an earlier one sentence message from before he asked me for the error and before I provided it.

Let me justify why I didn't reply. When asked for additional information, you provided 3 lines of a log.

Tell me precisely what information you want, and I'll be glad to provide it.

Again. many of us trust isuru's judgement on these things.

If he were participating in this discussion that might be a valid point. However, he has not said a single thing in two months, since before I posted a detailed analysis of why his suggested solution was not correct. That was nearly a month ago, and still not a single person has responded to a single one of the technical issues I raised. Not one. There is no discussion going on here. I keep trying to discuss technical issues, and all my attempts are completely ignored. If Isuru wants to join in the technical discussion, he's welcome to. If he doesn't want to that's his choice too. But his choice to ignore the discussion does not mean anything has been resolved.

You've been pointed in the right direction in #134 (comment)

And I replied with #134 (comment) explaining why that was not, in fact, the right direction. Do you agree with the technical points I raised? Disagree? If you disagree, please say so and explain why. If you agree, please say that.

For what its worth, isuru's comments lead me directly to a small patch:

I strongly recommend against that patch. It is not the right solution. It is based on a misunderstanding of the problem. I've already explained the reasons why. I could keep repeating them, or expand on them further if you want to understand the actual issues here. What you're doing there is a band-aid to paper over the fact that the OpenBLAS package is broken and violates the conventions of the Python ecosystem. In the process, it changes the binary interface for libtorch which risks breaking downstream software.

@ngam
Copy link
Contributor

ngam commented Jun 5, 2022

  • This changes the binary interface for libtorch, which risks breaking other software that uses it.

I could keep repeating them, or expand on them further if you want to understand the actual issues here.

Could you explain this binary interface part to me? (As in, explaining how it would change it and thus risk breaking software using it)

--

And, if you want an even more trivial solution, you could follow the very first response in the issue:

It's in openblas package along with the headers.

@peastman
Copy link
Author

peastman commented Jun 5, 2022

And, if you want an even more trivial solution, you could follow the very first response in the issue:

It's in openblas package along with the headers.

That's the problem. it's not the solution!

Could you explain this binary interface part to me? (As in, explaining how it would change it and thus risk breaking software using it)

Consider a downstream C++ code that links to libtorch, and also invokes BLAS routines directly. PyTorch supports many different BLAS implementations, and ensures that an appropriate one will be installed. Its build scripts contain lots of logic to locate different implementations and link to them correctly. You could repeat all that work in your own build system, but why do that when PyTorch has already done it for you? When you link to libtorch, that's guaranteed to bring in an appropriate BLAS that you can invoke with no further effort.

The proposed change would break that. It changes BLAS from public to private. Any package that expects it to be public would no longer compile.

@ngam
Copy link
Contributor

ngam commented Jun 5, 2022

Consider a downstream C++ code that links to libtorch, and also invokes BLAS routines directly. PyTorch supports many different BLAS implementations, and ensures that an appropriate one will be installed. Its build scripts contain lots of logic to locate different implementations and link to them correctly. You could repeat all that work in your own build system, but why do that when PyTorch has already done it for you? When you link to libtorch, that's guaranteed to bring in an appropriate BLAS that you can invoke with no further effort.

But we purposely aren't relying on PyTorch's logic in the feedstock, or at least, we are forcing it in a certain direction --- essentially, all MKL except OpenBLAS for osx-arm64. If I am not mistaken, we do want to set up a matrix at some point to let the user select whichever they want, but there is no point for now given the superiority of MKL and the preference of upstream for it. Anyway, even though I understand your concerns, I am not really sure I see it as a serious problem, in part because we have introduced a global pytorch pin, so packages will have to be rebuilt with each version...

... would no longer compile.

Yes, that's fine, no? Then downstream packages will fix their builds and adjust according to the proposed change.

@ngam
Copy link
Contributor

ngam commented Jun 5, 2022

I also want to add one more thing as someone who's only recently joined this community. I think it is better to keep a friendly, less adversarial, tone. I understand, and sympathize with, your frustration, but just remember everyone here is a volunteer (as far as I understand). There is a higher likelihood of change (albeit incremental and slow) if we support the collective decision-making. Not every single solution will fit neatly and perfectly with what you (or I) think is best, but as long as it is acceptable, we should all get behind it and move the ecosystem forward. We could always revisit and correct course later --- no need to stop everything here for this very single alias.

@peastman
Copy link
Author

peastman commented Jun 5, 2022

If I am not mistaken, we do want to set up a matrix at some point to let the user select whichever they want, but there is no point for now given the superiority of MKL and the preference of upstream for it.

That's a different problem than what I'm talking about. I'm not talking about adding a new feature. I'm referring to existing packages that are out there right now, that link against libtorch. Those packages expect it to publicly link BLAS, because that's what the official PyTorch distribution does. Many of those packages may not even use conda as their primary distribution mechanism. Pip is more popular, and even among conda users, the pytorch channel packages have an order of magnitude more downloads than the conda forge ones. If you break their conda packages, the developers may have little interest in fixing them, especially because the break was entirely caused by the conda-forge libraries being incompatible with the official PyTorch ones.

Yes, that's fine, no? Then downstream packages will fix their builds and adjust according to the proposed change.

Why would you break their packages? What is the benefit of doing that? It will disrupt users' workflows and create extra work for conda-forge volunteers. I certainly would not call that fine!

I think it is better to keep a friendly, less adversarial, tone.

Believe me, any adversarial tone is not intended on my part! On the contrary, I have been consistently trying to keep the discussion entirely focused on technical issues. I'm happy to participate in any collaborative process so long as that process is actually collaborative (i.e. not simply ignoring all attempts to raise technical issues).

@ngam
Copy link
Contributor

ngam commented Jun 6, 2022

Believe me, any adversarial tone is not intended on my part! On the contrary, I have been consistently trying to keep the discussion entirely focused on technical issues. I'm happy to participate in any collaborative process so long as that process is actually collaborative (i.e. not simply ignoring all attempts to raise technical issues).

I believe you. I myself unintentionally sometimes cause a scene trying to help... but I end up being counterproductive... Not saying that you are.

Anyway, how about we try to move forward? Let's not consider your specific proposal here for now, and let's see how we can fix things in the pytorch-feedstock so that your project can more smoothly and productively move forward. If we end up realizing that there is no way around this particular fix, we will come back to it in a new issue.

Good starting point is the PR you commented on, but perhaps you should start a new issue there to lay out your case again (if you'd like). For now, I think the pytorch-feedstock issues/PRs are the better place to discuss issues surrounding this.

@peastman
Copy link
Author

peastman commented Jun 6, 2022

Let me summarize what I see as the choice we're making here. Two possibilities have been proposed.

  1. Patch the PyTorch build scripts to make BLAS private.
  • This risks breaking downstream packages.
  • The change could itself be broken by future changes to PyTorch.
  1. Add the missing alias to OpenBLAS.
  • This does not break anything.
  • There is no chance of any outside change breaking it in the future.
  • It makes this package consistent with all other BLAS packages in the conda ecosystem.

Do you agree with that analysis? Are there any important issues you feel it misses?

@ngam
Copy link
Contributor

ngam commented Jun 6, 2022

Let me summarize what I see as the choice we're making here. Two possibilities have been proposed.

  1. Patch the PyTorch build scripts to make BLAS private.
  • This risks breaking downstream packages.
  • The change could itself be broken by future changes to PyTorch.
  1. Add the missing alias to OpenBLAS.
  • This does not break anything.
  • There is no chance of any outside change breaking it in the future.
  • It makes this package consistent with all other BLAS packages in the conda ecosystem.

Do you agree with that analysis? Are there any important issues you feel it misses?

I think your analysis is fine, but it should be "libopenblas" not openblas; the latter does have the missing alias you are after.

The bigger issue here is thus: Either way, we are trying to accommodate a specific fix for you. The maintainers of openblas think it is best this fix comes from pytorch. A maintainer from pytorch seems to agree that a fix in the pytorch feedstock is doable and the better route. So why not join the effort fixing this in pytorch and see how it goes? If you're not satisfied with the solution afterwards, we can discuss again.

Edit: and this should go under the dangers of the second solution: #134 (comment)

@peastman
Copy link
Author

peastman commented Jun 6, 2022

It looks like we both posted at the same moment. All I can do is give the best advice I can based on the my understanding of the problem. It's not my decision to make. I just wish someone could articulate a clear technical argument for why that decision is being made.

I think your analysis is fine, but it should be "libopenblas" not openblas

In that case, the PyTorch package should list openblas as a dependency. It doesn't.

Either way, we are trying to accommodate a specific fix for you.

On the contrary, it's the exact opposite. I can work around the problem. It's a bit annoying, but manageable. But I'd much rather fix the root problem so other users don't have to deal with it in the future. That's my goal here. I'm not demanding a fix just for me. I'm trying to improve the health of the ecosystem and eliminate future problems for other people.

@ngam
Copy link
Contributor

ngam commented Jun 6, 2022

We then agree on wanting a healthier ecosystem :)

The best technical argument is the issue with packages accidentally depending on it (though you disagreed with that earlier). Unfortunately, I am not the right person to assess this or weigh on it. But, let's take this opportunity to have you be more involved in the pytorch feedstock going forward --- how about that? 😃

@h-vetinari
Copy link
Member

You complain about the amount of time with no responses, let me add another ingredient in the mix that I suspect is affecting this discussion: you are arguing quite forcefully that you're right and everyone else is wrong. That means there are strongly diminishing returns to spend energy engaging with you, and consciously or subconsciously, people will end up engaging less. Despite that, certain people have made an effort (thanks a lot @hmaarrfk!), but your messages show no reflection on what is being said, e.g.

I strongly recommend against that patch. It is not the right solution. It is based on a misunderstanding of the problem. I've already explained the reasons why.

We agree with the patch and disagree with you, and we've also already explained the reasons why.

But to give you one more explanation for this: building an ecosystem worth of packages leads to different kind of concerns and trade-offs. Adding this alias has the potential to lead to many packages silently and wrongly linking to openblas, rather than the very carefully built blas setup in conda-forge that allows switching out blas implementations in the environment.

Two possibilities have been proposed.

  1. Patch the PyTorch build scripts to make BLAS private.
  • This risks breaking downstream packages.
  • The change could itself be broken by future changes to PyTorch.

I think you're exaggerating the impact here. Pytorch should not re-expose blas symbols, as they don't own those in any meaningful way. If packages were relying on that, they should specifically link against a blas implementation. And the patch is neither crazy intrusive nor unmaintainable, and will hopefully even be accepted upstream.

@peastman
Copy link
Author

peastman commented Jun 6, 2022

you are arguing quite forcefully that you're right and everyone else is wrong.

Could you please quote a single place in this thread where I said everyone else is wrong? I realize that may itself sound aggressive, but I feel I am being unjustly attacked, and your statement was itself extremely aggressive. I have literally being struggling for the past three months just to get anyone to reply. My complaint has never been that everyone else was wrong. At any time I would have been happy to hear a reply, even if it was one I disagreed with! If you read every post from when I opened the issue on March 3 all the way to May 31, you will see the only person I ever disagreed with was Isuru, and that's because he's the only person who ever posted anything concrete it was possible to agree or disagree with!

This discussion has gone way off the rails, and perhaps it would be best to just let it die and forget the whole thing. I believe, however, that there's a much deeper and much more serious issue: this community is neither welcoming nor respectful toward outsiders. I have witnessed it repeatedly, and heard the same complaint from many other people. Please go back to the start of this thread and just read from the beginning with an open mind! What you will see is a lot of me trying to be collaborative and open to suggestions. You will also see that I repeatedly tried to discuss technical issues and asked for guidance on the right fix, that these attempts were ignored and dismissed, and that at times they were met with personal attacks like the one quoted above.

I think it would be really useful for all of us to read through the CondaForge code of conduct. To be very blunt, this community does not live up to it. I can easily point you to other threads where you will see lots more of the same sort of behavior, but I don't think that would really be productive. I'm not interesting in who to blame for anything that's past. But can we all commit ourselves to turning this into the kind of open, welcoming, collaborative community that I think we all want it to be?

@h-vetinari
Copy link
Member

you are arguing quite forcefully that you're right and everyone else is wrong.

Could you please quote a single place in this thread where I said everyone else is wrong?

I provided an example right in my post. After all the preceding discussion,

I strongly recommend against that patch. It is not the right solution. It is based on a misunderstanding of the problem.

implies we are all misunderstanding the problem, but you understand it correctly. Similar statements are further upthread (e.g. "I saw it. It just doesn't make any sense.").

I realize that may itself sound aggressive, but I feel I am being unjustly attacked, and your statement was itself extremely aggressive. I have literally being struggling for the past three months just to get anyone to reply.

I apologise you feel attacked. My point was to try to explain to you my (subjective) point of view is why you're not getting more replies. Intentionally or not, implying repeatedly that others do not understand what they are explaining to you is just not a very... effective way of convincing anybody.

I believe, however, that there's a much deeper and much more serious issue: this community is neither welcoming nor respectful toward outsiders. I have witnessed it repeatedly, and heard the same complaint from many other people.

You're painting with a very broad brush here. Everyone of the regular contributors that I know in conda-forge is trying hard to accommodate requests and newcomers alike. That's not a claim to perfection (certainly not for myself), but we're all volunteers, and the more demanding the request, the more those ideals may suffer.

We don't get to choose how we come across online - I'm trying to tell you (and ngam was pointing you in the same direction re:adversarial) that the way you communicate on this topic makes it unappealing to keep engaging. I realise that this is frustrating; similarly on the other side of the fence (and presumably the main reason why Isuru hasn't commented again, though I should clarify that this is only my conjecture).

Again, my message was meant as clarification, not as an attack - I'm sorry if it came across like that. With that I'm going to take my leave from this thread.

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this issue Jun 9, 2022
We've some users report that they are getting symbol collisions when linking to blas.

I don't see a need to re-export the blas library symbols.

I figured I would share here for other packagers to be able to benefit too.

xref: conda-forge/pytorch-cpu-feedstock#116
xref: conda-forge/openblas-feedstock#134
Pull Request resolved: #78883
Approved by: https://github.com/ezyang
facebook-github-bot pushed a commit to pytorch/pytorch that referenced this issue Jun 10, 2022
Summary:
We've some users report that they are getting symbol collisions when linking to blas.

I don't see a need to re-export the blas library symbols.

I figured I would share here for other packagers to be able to benefit too.

xref: conda-forge/pytorch-cpu-feedstock#116
xref: conda-forge/openblas-feedstock#134

Pull Request resolved: #78883
Approved by: https://github.com/ezyang

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/221755cc7151fa59514ec4c2e52b1ab991fc07a0

Reviewed By: osalpekar

Differential Revision: D37059163

Pulled By: osalpekar

fbshipit-source-id: e1b7bc6972b707ea712b1901efadfc76806c72f9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants