Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Admin pages frequently come up blank white for custom hubs cloud clients. #4208

Closed
MegaMotion opened this issue Apr 29, 2021 · 30 comments
Closed
Assignees
Labels
bug jira-hubs needs triage For bugs that have not yet been assigned a fix priority

Comments

@MegaMotion
Copy link

MegaMotion commented Apr 29, 2021

Bug can be erratic and does not affect everyone, but those affected see nothing but a blank white screen instead of admin pages, and very few to no messages in the browser logs.

To reproduce: deploy a custom client, and you may or may not have this problem.

When active, the bug affects all browsers equally.

We experienced the issue from the point we installed our current stack several weeks ago all the way up until a couple of days ago, not immediately but a couple of days after installing the new hubs cloud client. At one point it just miraculously fixed itself, and worked for a day or two, but today it has gone back to a blank white screen, and several other users are reporting the problem now as well.

┆Issue is synchronized with this Jira Task

@MegaMotion MegaMotion added bug needs triage For bugs that have not yet been assigned a fix priority labels Apr 29, 2021
@blairmacintyre
Copy link

Ditto. We just installed a custom client, seeing the same thing. We had been running a custom client based on the previous client code, and ported our small changes to this. Admin pages works fine without the custom client, and everything else in the client seems to work (spoke, rooms, etc). Just the admin pages. Signing out and back in doesn't help.

@mattrossman
Copy link
Contributor

In our case, the admin page is sending an empty Javascript bundle even though the expected 3 MB build file appears in dist/ locally. Perhaps something is going wrong during the deploy script's upload process. After an npm run undeploy, the admin page becomes visible, it's just npm run deploy that doesn't work.

@blairmacintyre
Copy link

FWIW, /hab/svc/hubs/var/dist/pages/admin.html (Is this where it's served from?) is small.

app root@optimistic-troll:/hab/svc/hubs# ls -la var/dist/pages/
total 128
drwxr-xr-x 2 hab hab  4096 Apr 29 16:07 .
drwxr-xr-x 6 hab hab  4096 Apr 29 16:07 ..
-rw-r--r-- 1 hab hab  2274 Apr 29 16:07 admin.html
-rw-r--r-- 1 hab hab  2678 Apr 29 16:07 avatar.html
-rw-r--r-- 1 hab hab  2987 Apr 29 16:07 cloud.html
-rw-r--r-- 1 hab hab  3030 Apr 29 16:07 discord.html
-rw-r--r-- 1 hab hab 61479 Apr 29 16:07 hub.html
-rw-r--r-- 1 hab hab  1578 Apr 29 16:07 hub.service.js
-rw-r--r-- 1 hab hab  3040 Apr 29 16:07 index.html
-rw-r--r-- 1 hab hab  3149 Apr 29 16:07 link.html
-rw-r--r-- 1 hab hab  3388 Apr 29 16:07 scene.html
-rw-r--r-- 1 hab hab 11038 Apr 29 16:07 schema.toml
-rw-r--r-- 1 hab hab  2149 Apr 29 16:07 signin.html
-rw-r--r-- 1 hab hab  2154 Apr 29 16:07 verify.html
-rw-r--r-- 1 hab hab  2309 Apr 29 16:07 whats-new.html

At the bottom of admin.html it's trying to load a script at <script type="text/javascript" src="https://gt-ael-aq-assets.aelatgt-internal.net/hubs/assets/js/admin-f0d3e2008bc2c665d632.js"></script></body>

That script exists, and is fairly large

app root@optimistic-troll:/hab/svc/hubs# ls -l var/dist/assets/js/admin-f0d3e2008bc2c665d632.js
-rw-r--r-- 1 hab hab 3240496 Apr 29 16:07 var/dist/assets/js/admin-f0d3e2008bc2c665d632.js

So it's being uploaded, it's just not "being gotten".

@blairmacintyre
Copy link

And the response from the server for that file is 200, happy, but it gets back content-length 0.

@MegaMotion
Copy link
Author

Same thing here: if I view source, I am getting an html page, but the body contains nothing but a ui-root div and a call to that admin js file, which does not load.

@blairmacintyre
Copy link

Given that I see the JS file on my server at the right place, but if I try to access that URL directly in the browser it returns nothing, is it possibly a server issue or some sort of cache/CDN problem?

@mattrossman
Copy link
Contributor

@blairmacintyre There's two different .js files at play here. The file you showed that exists on the server is accessible by direct URL here: admin-f0d3e2008bc2c665d632.js. However, the admin page that I see is requesting a different script which produces the 0-byte response: admin-8ad4ac3d365a78faa8cc.js.

The latter file is the "correct" one which was built locally and supposedly uploaded, but it appears the server is still holding an old script file. Maybe if you try manually cleaning the files on the server and then re-deploying, it will accept the new script?

At the bottom of admin.html it's trying to load a script at <script type="text/javascript" src="https://gt-ael-aq-assets.aelatgt-internal.net/hubs/assets/js/admin-f0d3e2008bc2c665d632.js"></script>

This is the only part that confuses me, I'm seeing a different script requested by the current admin page. Could you double check this?

@MegaMotion
Copy link
Author

Hm, that is an interesting clue. I can see in my own example that the script being called from my admin.html is zero bytes in size, in my S3 bucket. I can also see that I had a 3M version of this file on January 26th, the last time the admin pages worked, and I have many other versions that are zero bytes, from all the times it has not worked.

So clearly it is the process of creating this file that is dropping the ball.

@blairmacintyre
Copy link

@mattrossman hmmm. I'm confused now too. The dates on these files are correct in that directory on the server, based on when we deployed the modified client, but the admin.html file we are getting (that you see, Matt, which I also see, now that I look) is different from the admin.html file on the server.

Am I not looking in the right place? I there some sort of cache on the server that's stuck?

@MegaMotion
Copy link
Author

Well, my only current theory is the New Moon tonight, but... my admin pages are working again for the moment. :-) Anybody else see anything different here today?

@blairmacintyre
Copy link

Mine started working; I rebuilt and re-uploaded. No idea why it failed the first time.

@MegaMotion
Copy link
Author

Whoops, back to white pages again. :-\

@johnshaughnessy
Copy link
Contributor

I've seen this happen as a result of having "invalid" data in local storage. We can probably handle this case more gracefully.

@MegaMotion
Copy link
Author

Well, for anyone else cursed with this affliction, I just discovered a workaround! Deploying a clean, fresh client does not do anything for me, however "npm run undeploy" DOES restore my admin pages! Obviously, this is not a fix, since it abandons all my custom client work, but it enables making changes to admin and then redeploying the custom client afterwards.

@rawnsley
Copy link
Contributor

rawnsley commented Jan 6, 2022

I'm also experiencing this "white admin page" issue. The pathology for me is that admin.js is truncated and stops at some random point within the file. This point is different each time.

The root cause for me is here in the deploy script. The contents of admin/dist are copied to the top-level dist folder, but the ncp function invokes its callback multiple times and the first time is often before the copy is complete. These incomplete files are then bundled up into the tar file for distribution. Here is the function with some console logging:

...
console.log("NCP BEFORE");
  await new Promise(res => {
    ncp("./admin/dist", "./dist", err => {
console.log("NCP CALLBACK");
      if (err) {
        console.error(err);
        process.exit(1);
      }

      res();
    });
  });
  step.text = "Preparing Deploy.";
console.log("NCP AFTER");
...

The output is typically something like this:

NCP BEFORE
NCP CALLBACK
NCP AFTER
NCP CALLBACK
NCP CALLBACK

Multiple callbacks are NOT part of the advertised functionality and I can confirm (using fs.stat) that the file copy is not always complete after the first callback.

This fits with the pathology:

  • The problem only impacts admin pages because the client pages are already there
  • admin.js is the largest file in that folder and so most likely to be truncated
  • The problem has got worse as admin.js has grown in size; in fact I triggered it months ago with an accidental include that ballooned the size of the file
  • Perhaps the problem is also getting more common because CPUs are getting faster and outpacing the I/O? This last one is a guess based on the fact that the problem happens 100% of the time on my MacBook Pro and never happened on my previous computer.

Inserting a pause after the copy command makes the problem go away, but obviously that's a band-aid solution. ncp hasn't been touched in a long time and should be replaced with something more modern and less flaky.

@daCking15
Copy link

@rawnsley By "inserting a pause after the copy command", do you mean something like the following:
image

@daCking15
Copy link

I also tried this, but no luck so far:
image

@daCking15
Copy link

I can also confirm that this wasn't an issue on my older computer (i3), but has been on newer ones (i5, i7, M1)

@rawnsley
Copy link
Contributor

@daCking15

Something like this to give the ncp command the time it needs to actually finish, which is probably a fraction of a second:

  ...
  await new Promise(res => {
    ncp("./admin/dist", "./dist", err => {
      if (err) {
        console.error(err);
        process.exit(1);
      }

      res();
    });
  });
  step.text = "Preparing Deploy.";

  // HACK TO WORK AROUND NCP BEHAVIOUR
  await new Promise(res => setTimeout(res, 5000));

  step.text = "Packaging Build.";
  tar.c({ sync: true, gzip: true, C: path.join(__dirname, "..", "dist"), file: "_build.tar.gz" }, ["."]);
  step.text = `Uploading Build ${buildEnv.BUILD_VERSION}.`;
  ...

I haven't had the problem since making this change and the one time I did have a problem was because I had accidentally reverted it.

@markusTraber
Copy link
Contributor

Had the same issue since the April Hubs-Cloud Update. @rawnsley hack got it working for me now, thank you! :)

@msalafia
Copy link

msalafia commented Apr 8, 2022

@rawnsley you saved my day

@mattrossman
Copy link
Contributor

Thanks for identifying that, I was having this issue again with the April release and your workaround fixed it.

Looks like a known issue with ncp:
AvianFlu/ncp#143

ncp hasn't been touched in a long time and should be replaced with something more modern and less flaky.

Here is another package with similar functionality and popularity but is more recently updated, perhaps it would be a more reliable option. It does have several dependencies though.
https://www.npmjs.com/package/cpy

@takahirox
Copy link
Contributor

takahirox commented Apr 13, 2022

Honestly I'm not really familiar with the deploy script yet but sounds like @rawnsley 's workaround should get in the core because we often get the problem report about this problem and the workaround seems to actually resolve the problem. And the workaround is very easy and simple.

We may be able to either seek a more proper way to detect the copy completion, try to fix ncp, or try other libs later if possible and needed.

What do you think? @netpro2k @brianpeiris

@rawnsley
Copy link
Contributor

Honestly I'm not really familiar with the deploy script yet but sounds like @rawnsley 's workaround should get in the core because we often get the problem report about this problem and the workaround seems to actually resolve the problem. And the workaround is very easy and simple.

We may be able to either seek a more proper way to detect the copy completion, try to fix ncp, or try other libs later if possible and needed.

What do you think? @netpro2k @brianpeiris

I've submitted a trivial PR in case you want to include it temporarily.

@jbshin-gemiso
Copy link

@rawnsley

I was also very helpful.
Have a nice day !

@brianpeiris brianpeiris self-assigned this Apr 16, 2022
@brianpeiris
Copy link
Contributor

Thanks for the detailed investigation and workaround @rawnsley, and for the alternative @mattrossman. I'll see if I can fix this more permanently.

@brianpeiris
Copy link
Contributor

Alright, I think #5365 is a good fix, though I wasn't able to reproduce the issue myself, so that might require verification from the community. But, I think it's solid enough to ship.

Sorry it took so long to get to this. Thanks to @takahirox for bringing attention to it.

@brianpeiris
Copy link
Contributor

brianpeiris commented May 9, 2022

I'm going to mark this as closed, since #5365 has been merged into our master branch, though it will take a while to be released to Hubs Cloud proper. Keep an eye on the the changelog to watch for that. In the mean time, if you are running into this issue, you should be able to cherry pick the changes from #5365 into your Hubs Cloud fork directly.

@Dayk0
Copy link

Dayk0 commented May 12, 2022

Yes. The white page is still existing in the "master" branch of the github. You have to upload the files directly to the 'hubs-cloud' branch to avoid this problem

@wswoodruff
Copy link

This may or may not be related, but I've experienced some erratic behavior with whitescreens before when making rapid enough requests to reticulum.

Ret has a rate limiter available as a plug and it looks like lots of files pull it in. I'm pretty sure the rate limiter is what was causing this type of error for me.

Permalink: https://github.com/mozilla/reticulum/blob/239742e27c019f35ac50b939b307144b308fd3a7/lib/ret_web/rate_limit.ex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug jira-hubs needs triage For bugs that have not yet been assigned a fix priority
Projects
None yet
Development

No branches or pull requests