Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Factory often has one extra glidein job running #397

Open
osg-cat opened this issue Feb 8, 2024 · 3 comments
Open

Factory often has one extra glidein job running #397

osg-cat opened this issue Feb 8, 2024 · 3 comments
Labels
BUG For BUGS factory for affected component Low Low priority osg OSG stakeholder

Comments

@osg-cat
Copy link

osg-cat commented Feb 8, 2024

Describe the bug
I have often observed that GlideinWMS exceeds its per-entry glidein maximum by one glidein job. It is especially apparent when we add a new site to the OSPool, because we always start with a cap of 2 glideins. Also, we do set num_factories = 2, because we have two production factories now.

To Reproduce
We set some glidein configuration in a YAML file which gets converted to regular GlideinWMS configuration. But here is a YAML fragment:

    num_factories: 2
    limits:
      entry:
        glideins: 2

Expected behavior
For a case like above, I expect each factory to run at most 1 glidein job on the entry, for a total of up to 2 glidein jobs across the 2 factories.

Screenshots
Here is typical output from a Python script I use to check on a site:

PILOTS IN FACTORY ACCESS POINTS
+-------------------------------------------------+---------------------+------+-----+-------+-------+------+-------+-------+
| Schedd Name | Frontend Name | Idle | Run | Remov | Compl | Held | TxOut | Suspd |
+-------------------------------------------------+---------------------+------+-----+-------+-------+------+-------+-------+
| [email protected] | OSG_OSPool:frontend | 0 | 2 | 0 | 0 | 0 | 0 | 0 |
| [email protected] | OSG_OSPool:frontend | 0 | 2 | 0 | 0 | 0 | 0 | 0 |
+-------------------------------------------------+---------------------+------+-----+-------+-------+------+-------+-------+

  • Contacted 15 Factory Access Points

This site had exactly the YAML configuration shown above.

Info (please complete the following information):
Stakeholders and components can be a comma separated list or on multiple lines.
If you add a new stakeholder or component, not on the sample list, add it on a line by its own.

  • GlideinWMS version: OSG factories as of 8 Feb 2024 (but also, this has been going on for at least 1 year)
  • Python version:
  • OS version:
  • HTCondor version:
  • Priority: Low
  • Stakeholders: OSG, esp. the OSG Campus Coordinator, who has to explain this behavior to admins!
  • Components: Factory

Additional context
Just reach out to me (Tim C.) by email or Slack for any extra details.

@github-actions github-actions bot added BUG For BUGS factory for affected component Low Low priority osg OSG stakeholder labels Feb 8, 2024
@mmascher
Copy link
Contributor

mmascher commented Feb 8, 2024

I don't think this is related to the reconfigure, the limits are written correctly in the job.descript

PerEntryMaxGlideins     1
PerEntryMaxIdle         1
PerEntryMaxHeld         1
DefaultPerFrontendMaxGlideins   1
DefaultPerFrontendMaxIdle       1
DefaultPerFrontendMaxHeld       1

@mmascher
Copy link
Contributor

mmascher commented Feb 8, 2024

Could it be because the limits are applied per frontend group?

[2024-02-07 08:44:51,249] INFO: Client OSPool.main (secid: OSG_OSPool_frontend) schedd status {1: 0}
[2024-02-07 08:44:51,249] INFO: Using v3+ protocol and credential HYJDWWIN
[2024-02-07 08:44:51,401] INFO: Submitted 1 glideins to [email protected]: [(780578, 0)]
[2024-02-07 08:44:51,401] INFO: Submitted 1 glideins
[2024-02-07 08:44:51,402] INFO: Checking downtime for frontend OSG_OSPool security class: frontend (entry OSG_US_UNR-CC-CE1).
[2024-02-07 08:44:51,405] INFO: frontend_token supplied, writing to /var/lib/gwms-factory/client-proxies/user_feosgospool/glidein_gfactory_instance/credential_OSPool.main-canary_OSG_US_UNR-CC-CE1.idtoken
[2024-02-07 08:44:51,406] INFO: frontend_scitoken supplied, writing to /var/lib/gwms-factory/client-proxies/user_feosgospool/glidein_gfactory_instance/credential_OSPool.main-canary_OSG_US_UNR-CC-CE1.scitoken
[2024-02-07 08:44:51,408] INFO: Client OSPool.main-canary (secid: OSG_OSPool_frontend) requesting 1 glideins, max running 1, idle lifetime 864000, remove excess 'NO', remove_excess_margin 0
[2024-02-07 08:44:51,408] INFO:   Decrypted Param Names: ['SecurityClass', 'ScitokenId', 'SecurityName', 'OSG_US_UNR-CC-CE1.idtoken', 'frontend_scitoken']
[2024-02-07 08:44:51,410] INFO: Client OSPool.main-canary (secid: OSG_OSPool_frontend) schedd status {1: 0}
[2024-02-07 08:44:51,410] INFO: Using v3+ protocol and credential HYJDWWIN
[2024-02-07 08:44:51,594] INFO: Submitted 1 glideins to [email protected]: [(780579, 0)]
[2024-02-07 08:44:51,595] INFO: Submitted 1 glideins

For the factory each frontend group is in reality a different frontend. In the above case the factory submitted one glidein for the main group and one for the main-canary one.

@mmascher
Copy link
Contributor

mmascher commented Feb 8, 2024

@rynge for my education, what is the difference between main and main-canary?

We need to be careful here. As confusing as this sounds, it might be the correct behavior. Groups can be different VOs submitting 1 test glidein each. So in the end you get 2 glideins...

On the other hand, if we set a limit as 100 in the factory, I would not expect the factory to submit 200 glideins. I need to double check what the factory does in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BUG For BUGS factory for affected component Low Low priority osg OSG stakeholder
Projects
None yet
Development

No branches or pull requests

2 participants