Version: 0.9.0

Trouble Shooting

Scheduler logs

Retrieve scheduler logs

Currently, the scheduler writes its logs to stdout/stderr, docker container handles the redirection of these logs to a local location on the underneath node, you can read more document here. These logs can be retrieved by kubectl logs. Such as:

// get the scheduler pod
kubectl get pod -l component=yunikorn-scheduler -n yunikorn
NAME READY STATUS RESTARTS AGE
yunikorn-scheduler-766d7d6cdd-44b82 2/2 Running 0 33h
// retrieve logs
kubectl logs yunikorn-scheduler-766d7d6cdd-44b82 yunikorn-scheduler-k8s -n yunikorn

In most cases, this command cannot get all logs because the scheduler is rolling logs very fast. To retrieve more logs in the past, you will need to setup the cluster level logging. The recommended setup is to leverage fluentd to collect and persistent logs on an external storage, e.g s3.

Set Logging Level

note

Changing the logging level requires a restart of the scheduler pod.

Stop the scheduler:

kubectl scale deployment yunikorn-scheduler -n yunikorn --replicas=0

edit the deployment config in vim:

kubectl edit deployment yunikorn-scheduler -n yunikorn

add LOG_LEVEL to the env field of the container template. For example setting LOG_LEVEL to 0 sets the logging level to INFO.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
...
spec:
template:
...
spec:
containers:
- env:
- name: LOG_LEVEL
value: '0'

Start the scheduler:

kubectl scale deployment yunikorn-scheduler -n yunikorn --replicas=1

Available logging levels:

ValueLogging Level
-1DEBUG
0INFO
1WARN
2ERROR
3DPanic
4Panic
5Fatal

Pods are stuck at Pending state

If some pods are stuck at Pending state, that means the scheduler could not find a node to allocate the pod. There are several possibilities to cause this:

1. Non of the nodes satisfy pod placement requirement

A pod can be configured with some placement constraints, such as node-selector, affinity/anti-affinity, do not have certain toleration for node taints, etc. To debug such issues, you can describe the pod by:

kubectl describe pod <pod-name> -n <namespace>

the pod events will contain the predicate failures and that explains why nodes are not qualified for allocation.

2. The queue is running out of capacity

If the queue is running out of capacity, pods will be pending for available queue resources. To check if a queue is still having enough capacity for the pending pods, there are several approaches:

1) check the queue usage from yunikorn UI

If you do not know how to access the UI, you can refer the document here. Go to the Queues page, navigate to the queue where this job is submitted to. You will be able to see the available capacity left for the queue.

2) check the pod events

Run the kubectl describe pod to get the pod events. If you see some event like: Application <appID> does not fit into <queuePath> queue. That means the pod could not get allocated because the queue is running out of capacity.

The pod will be allocated if some other pods in this queue is completed or removed. If the pod remains pending even the queue has capacity, that may because it is waiting for the cluster to scale up.

Restart the scheduler

YuniKorn can recover its state upon a restart. YuniKorn scheduler pod is deployed as a deployment, restart the scheduler can be done by scale down and up the replica:

kubectl scale deployment yunikorn-scheduler -n yunikorn --replicas=0
kubectl scale deployment yunikorn-scheduler -n yunikorn --replicas=1

Still got questions?

No problem! The Apache YuniKorn community will be happy to help. You can reach out to the community with the following options:

  1. Post your questions to dev@yunikorn.apache.org
  2. Join the YuniKorn slack channel and post your questions to the #yunikorn-user channel.
  3. Join the community sync up meetings and directly talk to the community members.