With the customers running more denser clusters with bigger machines, lately we are seeing a lot of complains with CPU contention, higher CPU ready time confusion understanding Hyper threading, vNUMA. With this paper I wish to address few of the issues, following are the topics I intend to cover
- CPU scheduling
- Strict Co-scheduling
- Relaxed Co-scheduling
- CPU skew in virtual machines
- Reducing the vCPU
- Hyper Threading
- Right sizing your Virtual Machines
- Understanding NUMA
- Understanding Virtual Machine Geometry
- Myths around having odd number of CPU
Let us start with understanding how the CPU scheduling works for virtual machine. For the example, consider we have 1 CPU socket with 4 physical cores. Forget about the Hyper threading for now. Also I have 2 virtual machines, 1 with 4 VCPU and other with 2 VCPU.
Strict Co–Scheduling :- Now the CPU can schedule the VMs per time slot. For now consider the slot as a unit of time when the VMs can be scheduled.
During the first timeslot, the CPU will allocate all 4 cores to the VM1 with 4 VCPU. What happens in the VM2 with 2 VCPU can not be scheduled and has to go in a wait state. During this wait state the VM is ready but does not have CPU cores. This is called “CPU Ready“. Now the same is going to happen with VM1 during the second time slot. As a result the both VMs will have to wait and the vm performance is going to take a hit.
This is a classic example of Strict Co-Sheduling. You may Notice that during the time slots T2 and T4 2 cores are wasted, this happens because the ESXi needs total of 4 free cores to schedule VM1.
If there was a VM3 with 2 vCPU, this could have been scheduled during this time slot.
Relaxed Co-Scheduling :- In relaxed co-scheduler, the Virtual machine claims the available CPU cores while it needs more, it will still go ahead using the what ever it has to keep running. As a result some of the vcpu in the VMs will be performing slower than other.
The application is not aware of what is happening as it expects all the cores to run at same speed. While this does not waste any CPU slots and Cores, a new problem is introduced. This problem is called as “CPU Skew”.
Co-Stop :- Now if this continues the vcpu 3 and 4 will keep on falling back. If this value go over and above a certain limit the ESXi will start limiting the Cores and will start holding it back. This is called Co-stop, which will also effect CPU performance.
Reducing the Number of vCPU :- One way to solve the issue is to reduce the number of vCPU. Lets say VM1 with 4 VCPU has average usage of around 30%, by reducing the Number of vCPU to half, the average CPU usage will go almost double that is around 60%. Which is completely acceptable, the VM will still be performing at same speed but will not have “CPU ready” time.
Also there will not be any cores wasted.