Monday, February 9, 2015

undeleted Multiple snapshots of VMs would Kill your ESXi ( Heap COW cannot expand)

Issue: I have received alert that few VMs are rebooted and few VMs are hung

Findings: Up on checking the Vcenter, found that all these VMs are running on same Host  but Esxi Seems to be running fine and also some of the other VMs running fine on the same ESXi.

but when I checked all these VMs  events and  tasks  logs deeply,  noticed that when VADP triggered snapshots creation, all these VMs got same erroras below and failed to create snapshots and  follow by rebooted

Error: error message from ESXi Reason: 0 ( cannot allocate memory)

 


up on checking ESXi kernal warning Logs, its shown Heap COW already it's maximum size and cannot expand.
so I checked the current free size of the heap COW % e and it is 4%. (which means it's been utilized 96 %).

How to check current heap COW Size utilization:






solution:  in order to resolve this COW Heap memory issue, I restarted management services of ESXi and consolidated all the VM's existing snapshots manually. after that the Heap COW free % back to 95% ( which means now it's been utilizing 5% of Heap COW memory)



and also I have increased the default size of Heap COW in ESXi to maximum. http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1004424 

So what is the relation between Heap COW and multiple snapshots

VMware solution is as below:

If you use snapshots on virtual machines running on an ESX host, each snapshot delta disk is a COW (Copy On Write) disk. For each one in use by running virtual machines, their data structures take up ESX kernel memory. This allocation is known as the COW heap. This memory is used to store cached metadata, pointing to where in a VMDK or in a chain of VMDK files disk data to be accessed resides.


http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1003156&sliceId=2&docTypeID=DT_KB_1_1&dialogID=161478702&stateId=1%200%20161488616

moreover In order to prevent this issue in feature. I have configured motoring for  consolidation failed issue and scripted to poll every minutes Heap COW memory utilization to data store( which  can be even polled to syslog server)