Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AdvancedCronJob支持展示上一次执行的成功job数 #519

Closed
qiankunli opened this issue Jan 22, 2021 · 7 comments
Closed

AdvancedCronJob支持展示上一次执行的成功job数 #519

qiankunli opened this issue Jan 22, 2021 · 7 comments

Comments

@qiankunli
Copy link

qiankunli commented Jan 22, 2021

acj 常规输出

➜  ~ k get acj -o wide
NAME        SCHEDULE    TYPE           LASTSCHEDULETIME   AGE
clean-log   0 0 * * *   BroadcastJob   4h4m               3d20h

我们发现部分node 可能是因为负载较高的缘故,从未执行过job(pod 状态为OutOfpods),但一直没发现。

期待可以输出最近一次执行的 成功数/node节点数,来判断在所有节点上是否都执行,如果未执行,可以尽快去采取一些措施。该数据也建议与kube-state-metrics 集成,方便做监控。

@qiankunli qiankunli changed the title [feature request] AdvancedCronJob支持展示上一次执行的成功job数 Jan 22, 2021
@FillZpp FillZpp assigned FillZpp and unassigned jian-he Jan 22, 2021
@FillZpp
Copy link
Member

FillZpp commented Jan 22, 2021

@qiankunli AdvancedCronJob 只负责周期性创建 BroadcastJobJob,因此输出只包含周期信息。至于每次 BroadcastJob 执行的结果,可以通过 k get bcj -o wide 来看到。

@qiankunli
Copy link
Author

qiankunli commented Jan 22, 2021

@qiankunli AdvancedCronJob 只负责周期性创建 BroadcastJobJob,因此输出只包含周期信息。至于每次 BroadcastJob 执行的结果,可以通过 k get bcj -o wide 来看到。

➜  ~ k get bcj -o wide
NAME                   DESIRED   ACTIVE   SUCCEEDED   FAILED   AGE
clean-log-1611273600   52        0        10          2        5h39m

DESIRED=52,SUCCEEDED=10,FAILED=2 中间的差值在哪看呢?一般有啥原因不?

@FillZpp
Copy link
Member

FillZpp commented Jan 22, 2021

@qiankunli YAML please. k get bcj clean-log-1611273600 -o yaml

@qiankunli
Copy link
Author

k get bcj clean-log-1611273600 -o yaml

apiVersion: apps.kruise.io/v1alpha1
kind: BroadcastJob
metadata:
  annotations:
    apps.kruise.io/scheduled-at: "2021-01-22T00:00:00Z"
  creationTimestamp: "2021-01-22T00:00:00Z"
  generation: 1
  managedFields:
  - apiVersion: apps.kruise.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:apps.kruise.io/scheduled-at: {}
        f:ownerReferences: {}
      f:spec:
        .: {}
        f:completionPolicy:
          .: {}
          f:type: {}
        f:failurePolicy: {}
        f:template:
          .: {}
          f:metadata:
            .: {}
            f:creationTimestamp: {}
            f:labels:
              .: {}
              f:broadcastjob-controller-uid: {}
              f:broadcastjob-name: {}
          f:spec:
            .: {}
            f:containers: {}
            f:dnsPolicy: {}
            f:restartPolicy: {}
            f:schedulerName: {}
            f:securityContext: {}
            f:terminationGracePeriodSeconds: {}
            f:volumes: {}
      f:status:
        .: {}
        f:active: {}
        f:completionTime: {}
        f:conditions: {}
        f:desired: {}
        f:failed: {}
        f:phase: {}
        f:startTime: {}
        f:succeeded: {}
    manager: manager
    operation: Update
    time: "2021-01-22T00:01:17Z"
  name: clean-log-1611273600
  namespace: default
  ownerReferences:
  - apiVersion: apps.kruise.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: AdvancedCronJob
    name: clean-log
    uid: 557e4e1a-e842-488a-bf5b-aa708d7c0c71
  resourceVersion: "215856730"
  selfLink: /apis/apps.kruise.io/v1alpha1/namespaces/default/broadcastjobs/clean-log-1611273600
  uid: 0d7551e0-8ddb-42a5-8266-229d85949cbb
spec:
  completionPolicy:
    type: Always
  failurePolicy:
    type: FailFast
  parallelism: 2147483647
  template:
    metadata:
      creationTimestamp: null
    spec:
      containers:
      - command:
        - /auto-del-1-days-ago.sh
        image: xx/clean-log:20210119
        imagePullPolicy: Always
        name: clean-log
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/log/app
          name: log
      dnsPolicy: ClusterFirst
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - hostPath:
          path: /var/log/app
          type: ""
        name: log
status:
  active: 0
  completionTime: "2021-01-22T00:01:17Z"
  conditions:
  - lastProbeTime: "2021-01-22T00:01:17Z"
    lastTransitionTime: "2021-01-22T00:01:17Z"
    message: failure policy is FailurePolicyTypeFailFast and failed pod is found
    reason: Failed
    status: "True"
    type: Failed
  desired: 52
  failed: 2
  phase: failed
  startTime: "2021-01-22T00:00:01Z"
  succeeded: 10

这个acj的作用是,每天早上8点执行clean-log image 中包含的clean-log shell 清理物理机xx 天前的日志。我发现有一些pod 报Outpod(估计是物理机负载较大导致pod 创建失败),导致pod failed,如果 FailurePolicyType默认为FailFast,导致整个bj 执行失败?

@FillZpp
Copy link
Member

FillZpp commented Jan 22, 2021

@qiankunli 是的,默认 FailFast 发现有 job 失败整个 BroadcastJob 就算作失败了,你可以在 broadcastJobTemplate 里把这个策略显式设置为 Continue

@qiankunli
Copy link
Author

@qiankunli 是的,默认 FailFast 发现有 job 失败整个 BroadcastJob 就算作失败了,你可以在 broadcastJobTemplate 里把这个策略显式设置为 Continue

@FillZpp 感谢提醒,我试试

@FillZpp
Copy link
Member

FillZpp commented Jan 22, 2021

@qiankunli 方便的话可以登记一下使用 #289 ,以便我们后续收集反馈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants