Jul 21

Large packet loss at the guest OS level on the VMXNET3 vNIC in ESXi 5.x / 4.x

Symptoms

When using the VMXNET3 driver on ESXi 4.x and 5.x, you see significant packet loss during periods of very high traffic bursts.

Cause

This issue occurs when packets are dropped during  high traffic bursts. This can occur due to a lack of receive and transmit buffer space or when receive traffic is speed-constrained, as, for example, with a traffic filter.

Resolution

To resolve this issue, ensure that there is no traffic filtering occurring (for example, with a mail filter). After eliminating this possibility, slowly increase the number of buffers in the guest operating system.

To reduce burst traffic drops in Windows 2008 R2 Buffer Settings:
  1. Click Start > Control Panel > Device Manager.
  2. Right-click vmxnet3 and click Properties.
  3. Click the Advanced tab.
  4. Click Small Rx Buffers and increase the value. The default value is 512  and the maximum is 8192.
  5. Click Rx Ring #1 Size and increase the value. The default value is 1024 and the maximum is 4096.

Notes:

  • It is important to increase the value of Small Rx Buffers and Rx Ring #1 gradually to avoid drastically increasing the memory overhead on the host and possibly causing performance issues if resources are close to capacity.
  • If this issue occurs on only 2-3 virtual machines, set the value of Small Rx Buffers and Rx Ring #1 to the maximum value. Monitor virtual machine performance to see if this resolves the issue.
  • The Small Rx Buffers and Rx Ring #1 variables affect non-jumbo frame traffic only on the adapter.

Additional Information

This issue can affect any application that has a high number of connections with burst traffic patterns. By using Iperf, VMware has been able to replicate the packet drop behavior with a large number of consecutive clients.

Jul 17

ESXCLI storage commands

Which PSA, SATP & PSP?

First things first, lets figure out is the device is managed by VMware’s native multipath plugin, the NMP. Or indeed is it managed by a third-party plugin, such as EMC’s PowerPath? I start with the esxcli storage nmp device list command. This not only confirms that the device is managed by NMP, but will also display the Storage Array Type Plugin (SATP) for path failover and the Path Selection Policy (PSP) for load balancing. Here is an example of this command (I’m using the -d option to run it against one device to keep the output to a minimum).

~ # esxcli storage nmp device list -d naa.600601603aa029002cedc7f8b356e311
naa.600601603aa029002cedc7f8b356e311
  Device Display Name: DGC Fibre Channel Disk (naa.600601603aa029002cedc7f8b356e311)
  Storage Array Type: VMW_SATP_ALUA_CX
  Storage Array Type Device Config: {navireg=on, ipfilter=on}
   {implicit_support=on;explicit_support=on; explicit_allow=on;
   alua_followover=on;{TPG_id=1,TPG_state=ANO}{TPG_id=2,TPG_state=AO}}
  Path Selection Policy: VMW_PSP_RR
  Path Selection Policy Device Config: {policy=rr,iops=1000,
   bytes=10485760,useANO=0; lastPathIndex=0: NumIOsPending=0,
   numBytesPending=0}
  Path Selection Policy Device Custom Config:
  Working Paths: vmhba2:C0:T3:L100
  Is Local SAS Device: false
  Is Boot USB Device: false
~ #

Clearly we can see both the SATP and the PSP for the device in this output. There is a lot more information here as well, especially since this is an ALUA array. You can read more about what these configuration options mean in this post. This device is using the Round Robin PSP, VMW_PSP_RR. One interesting fact even now is the support for Round Robin PSP; some arrays support it and some do not. It is always worth checking the footnotes of the VMware HCL Storage section to see if a particular array supports Round Robin. Now that we have the NMP, SATP & PSP, let’s look at some other details.

Queue Depth, Adaptive Queuing, Reservations

This next command is very useful for checking a number of things. Primarily, it will tell you what the device queue depth is set to. But it will also tell you if adaptive queuing has been configured, and if the device has for a perennially reserved setting, something that is used a lot in Microsoft Clustering configurations to avoid slow boots.

~ # esxcli storage core  device list -d naa.600601603aa029002cedc7f8b356e311
naa.600601603aa029002cedc7f8b356e311
   Display Name: DGC Fibre Channel Disk (naa.600601603aa029002cedc7f8b356e311)
   Has Settable Display Name: true
   Size: 25600
   Device Type: Direct-Access
   Multipath Plugin: NMP
   Devfs Path: /vmfs/devices/disks/naa.600601603aa029002cedc7f8b356e311
   Vendor: DGC    
   Model: VRAID          
   Revision: 0532
   SCSI Level: 4
   Is Pseudo: false
   Status: on
   Is RDM Capable: true
   Is Local: false
   Is Removable: false
   Is SSD: false
   Is Offline: false
   Is Perennially Reserved: false
   Queue Full Sample Size: 0
   Queue Full Threshold: 0
   Thin Provisioning Status: unknown
   Attached Filters: VAAI_FILTER
   VAAI Status: supported
   Other UIDs: vml.0200640000600601603aa029002cedc7f8b356e311565241494420
   Is Local SAS Device: false
   Is Boot USB Device: false
   No of outstanding IOs with competing worlds: 32

The last line out output is actually the device queue depth. For this device, 32 I/Os can be queued to the device. The Queue Full Sample Size and the Queue Full threshold both related to Adaptive Queuing – it is not configured on this device since both values are 0. If you’d like to know more about Adaptive Queuing, you can read this article here.

The perennially reserved flag is an interesting one and a relatively recent addition to device configurations. With applications that place SCSI reservations on devices (such as Microsoft Cluster), ESXi host reboots would be delayed as it tried to query devices with SCSI reservation on them. Perennially Reserved is a flag to tell the ESXi hosts not to waste any time trying to query these devices on boot as there is a likelihood that they are reserved by another host. This therefore speeds up the boot times of the ESXi hosts running MSCS VMs.

For those of you contemplating VSAN, VMware’s new Virtual SAN product, the ability to identify SSD devices and local vs. remote devices is critical. VSAN required SSD (or PCIe flash devices) as well  as local magnetic disks. This command will help you identify both.

Apart from some vendor specific information and size information, another interesting item is the VAAI Status. In this case, VAAI (vSphere APIs for Array integration) is shown as supported. But how can I find out more information about which primitives are supported? This next command will help with that.

Which VAAI primitives are supported?

~ # esxcli storage core device vaai status get -d naa.600601603aa029002cedc7f8b356e311
naa.600601603aa029002cedc7f8b356e311
   VAAI Plugin Name: VMW_VAAIP_CX
   ATS Status: supported
   Clone Status: supported
   Zero Status: supported
   Delete Status: unsupported

This device, as we can clearly see, supports 3 out of the 4 VAAI block primitives. ATS, Atomic Test & Set, is the replacement for SCSI reservations. Clone is the ability to offload a clone or migration operation to the array using XCOPY. Zero is the ability to have the array to zero out blocks using WRITE_SAME. Delete relates to the UNMAP primitive, and is the ability to reclaim dead space on thin provisioned datastores. In this example, the primitives shows up as unsupported.

 Useful protocol information

For those of you interested in troubleshooting storage issues outside of ESXi, the esxcli storage san namespace has some very useful commands. In the case of fiber channel you can get information about which adapters are used for FC, and display the WWNN (nodename) and WWPN (portname) information, speed and port state as shown here.

~ # esxcli storage san fc list
   Adapter: vmhba2
   Port ID: 012800
   Node Name: 20:00:00:c0:dd:18:77:d1
   Port Name: 21:00:00:c0:dd:18:77:d1
   Speed: 10 Gbps
   Port Type: NPort
   Port State: ONLINE

   Adapter: vmhba3
   Port ID: 000000
   Node Name: 20:00:00:c0:dd:18:77:d3
   Port Name: 21:00:00:c0:dd:18:77:d3
   Speed: 0 Gbps
   Port Type: Unknown
   Port State: LINK DOWN

So I have one good adapter, and one not so good. I can also display FC event information:

~ # esxcli storage san fc events get
FC Event Log                                                
-------------------------------------------------------------
2013-09-23 12:18:58.085 [vmhba2] LINK UP                    
2013-09-23 13:05:35.952 [vmhba2] RSCN received for PID 012c00
2013-09-23 13:29:24.072 [vmhba2] RSCN received for PID 012c00
2013-09-23 13:33:36.249 [vmhba2] RSCN received for PID 012c00

It should be noted that there are a bunch of other useful commands in this name space, not just for FC adapters. You can also examine FCoE, iSCSI and SAS devices in this namespace and get equally useful information.

 Useful SMART Information

Another very useful command, especially since the introduction of vFRC (vSphere Flash Read Cache) and the soon to be announced VSAN, which both support SSD, is the ability to examine the SMART attributes of a disk drive.

~ # esxcli storage core device smart get -d naa.xxxxxx 
Parameter                    Value Threshold Worst 
---------------------------- ----- --------- -----
Health Status                 OK     N/A       N/A
Media Wearout Indicator       N/A    N/A       N/A
Write Error Count             N/A    N/A       N/A
Read Error Count              114    6         100
Power-on Hours                90     0         90
Power Cycle Count             100    20        100
Reallocated Sector Count      2      36        2
Raw Read Error Rate           114    6         100
Drive Temperature             33     0         53
Driver Rated Max Temperature  67     45        47
Write Sectors TOT Count       200    0         200
Read Sectors TOT Count        N/A    N/A       N/A 
Initial Bad Block Count       100    99        10

While there are still some drives that returns certain fields as N/A, I know there is a concerted effort underway between VMware and its partners to get this working as much as possible. It is invaluable to be able to see the media wear out Indicator on SSDs, as well as reallocated sector count and drive temperature.

Source: comachogan.com

Jul 17

vSphere Web Client short cuts

vSphere Web Client short cuts to switch to different views.

  • Ctrl + Alt + 1 = Go to Home View
  • Ctrl + Alt + 2 = Go to vCenter Home View
  • Ctrl + Alt + 3 = Go to the Hosts & Clusters View
  • Ctrl + Alt + 4 = Got to the VM & Templates View
  • Ctrl + Alt + 5 = Got to the Datastores View
  • Ctrl + Alt + 6 = Go to the Networks View
  • Ctrl + Alt + S = Place cursor in the Search field

Jul 17

vSphere 5.1 & EMC Storage

Without delving too deeply into the Pluggable Storage Architecture found in ESXi hosts, VMware uses a Native Multipath Plugin for handling I/O between the host and a storage array. This has two important components – a Storage Array Type Plugin (SATP) for failover and Path Selection Policy (PSP) for load balancing. Each SATP has a default PSP associated with it. An esxcli storage nmp satp list will show you the relationship between SATP and its default PSP.

EMC have taken a very important step with the release of vSphere 5.1. My understanding is that a large portion of EMC storage is now going to use VMware’s Round Robin Path Selection Policy (PSP) by default.  Below is the output taken from a LUN from a VNX-5500 array presented to one of my ESXi 5.1 hosts. As you can clearly see, this is now using Round Robin without having to make any configuration changes.

naa.xxx
 Device Display Name: DGC Fibre Channel Disk (naa.xxx)
 Storage Array Type: VMW_SATP_ALUA_CX
 Storage Array Type Device Config: {navireg=on, ipfilter=on}
 {implicit_support=on; explicit_support=on; explicit_allow=on;
 alua_followover=on;{TPG_id=1,TPG_state=ANO}{TPG_id=2,TPG_state=AO}}
 Path Selection Policy: VMW_PSP_RR
 Path Selection Policy Device Config: {policy=rr,iops=1000,
  bytes=10485760,useANO=0;lastPathIndex=2: NumIOsPending=0,
  numBytesPending=0}
 Path Selection Policy Device Custom Config:
 Working Paths: vmhba2:C0:T3:L0, vmhba4:C0:T3:L0
 Is Local SAS Device: false
 Is Boot USB Device: false

In vSphere 5.1, the default PSPs for the Storage Array Type Plugins (SATPs) VMW_SATP_ALUA_CX and VMW_SATP_SYMM have changed from   VMW_PSP_FIXED  to  VMW_PSP_RR:

Using the command esxcli storage nmp satp list, we can see this change:

Name              Default PSP   Description
----------------  ------------  ---------------------------------
VMW_SATP_ALUA_CX  VMW_PSP_RR    Supports EMC CX that use the ALUA..
VMW_SATP_SYMM     VMW_PSP_RR    Placeholder (plugin not loaded)

I think that this is indeed a great move. I believe you’ll get optimal storage performance with the RR PSP. However if you use Microsoft Cluster Services (MSCS) you are probably aware that you cannot use the Round Robin path selection policy on the back-end storage. Without getting into too much detail, handling SCSI Reservations across multiple paths is the reason behind not supporting this. Therefore, if you use EMC storage, and you use virtualized MSCS environments, and you plan to upgrade to vSphere 5.1, keep this in mind. You will have to change the VMW_PSP_RR for those devices back to the original setting. There is plenty more information around MSCS supportability in vSphere environments in this KB article.

Source: cormachogan.com

Jul 17

Storage Enhancements – All Paths Down (APD)

All Paths Down (APD) is a situation which occurs when a storage device is removed from the ESXi host in an uncontrolled manner, either due to administrative error or device failure. Over the previous number of vSphere releases, VMware has made significant improvements to handling the APD, the All Paths Down, condition. This is a difficult condition to manage since we don’t know if the device is gone forever or if it might come back, i.e. is it a permanent device loss or is it a transient condition.

The bigger issue around APD is what it can do to hostd. Hostd worker threads will sit waiting for I/O to return indefinitely (for instance, when a rescan of the SAN is initiated from the vSphere UI) . However hostd only has a finite number of worker threads, so if these all get tied up waiting for disk I/O, then other hostd tasks will be affected. A common symptom of APD is ESXi hosts disconnecting from vCenter is because their hostd daemons have become wedged.

I wrote about the vSphere 5.0 enhancements that were made to APD handling on the vSphere Storage blog here if you want to check back on them. Basically a new condition was introduced in vSphere 5.0. This condition is known as PDL (Permanent Device Loss) and this is where we knew that the device was never coming back. We learnt this through SCSI sense codes sent by the target array. Once we had a PDL condition, we could fast-fail I/Os to the missing device and prevent hostd getting tied up.

Following on from the 5.0 APD handling improvements, what we want to achieve in vSphere 5.1 is as follows:

  • Handle more complex transient APD conditions, and not have hostd getting stuck indefinitely when devices are removed in an uncontrolled manner.
  • Introduce some sort of PDL method for those iSCSI arrays which present only one LUN for target. These arrays were problematic for APD handling, since once the LUN went away, so did the target, and we had no way of getting back any SCSI sense codes.

It should be noted that in vSphere 5.0U1, we fixed an issue with vSphere correctly detecting PDL, and restarting VMs on other hosts in a vSphere HA cluster which may not have this APD state. This enhancement is also in 5.1.

Complex APD
As I have already mentioned, All Paths Down affects more than just Virtual Machine I/O. It can also affect hostd worker threads, leading to host disconnects from vCenter in worst case scenarios.  It can also affect vmx I/O when updating Virtual Machine configuration files. On occasion, we have observed scenarios where the .vmx file was affected by an APD condition.

In vSphere 5.1, a new timeout value for APD is being introduced. There will be a new global setting for this feature called Misc.APDHandlingEnable. If this value is set to 0, the current (5.0) behavior of retrying  failing I/Os forever will be used. If Misc.APDHandlingEnable is set to 1 (default), APD Handling will be enabled to follow the new model using the time out value Misc.APDTimeout.

This is set to 140 second timeout by default, tuneable. [The lower limit is 20 seconds but this is only for testing]. These settings (Misc.APDHandlingEnable & Misc.APDTimeout) are exposed in the vSphere UI. When APD is detected, the timer starts. After 140 seconds, the device is marked as APD Timeout.  Any further I/Os are fast-failed with a status of NO_CONNECT. This is the same sense code observed when an FC cable is disconnected from an FC HBA. This fast failing of I/Os prevents hostd from getting stuck waiting on I/O.  If any of the paths to the device recovers, subsequent I/Os to the device are issued normally and special APD treatment finishes.

Single-Lun, Single-Target
We also wanted to extend the PDL (Permanent Device Loss) detection to those arrays that only have a single LUN per Target. On these arrays, when the LUN disappears, so does the target so we could never get back a SCSI Sense Code as mentioned earlier.

Now in 5.1, the iSCSI initiator attempts to re-login to the target after a dropped session. If the device is not accessible, the storage system rejects our effort to access the storage. Depending on the response from the array, we can say the device is in PDL, not just unreachable.

I’m very pleased to see these APD enhancements in vSphere 5.1. The more that is done to mitigate the impact of APD, the better.

Source: cormachoman.com

Jul 17

vSphere 5.5. Storage – PDL AutoRemove

In vSphere 5.5 we introduced yet another improvement to this mechanism, namely the automatic removal of devices which have entered the PDL state from the ESXi host.

In a nutshell, PDL AutoRemove automatically removes a device with a PDL state from the ESXi host. Think about it – here we have a device that we know is never coming back based on the SCSI sense code we have gotten from the controller/array. Why wouldn’t we want to clean up this device? A PDL state on a device implies it cannot accept more I/Os, but it still needlessly uses up one of the 256 device per host limit on ESXi. With PDL AutoRemove, we now have an added benefit of the device being automatically removed from the ESXi host (since it is never coming back).

PDL AutoRemove occurs only if there are no open handles left on the device. The AutoRemove will happen when the last handle on the device closes.

One important point however – due to the nature of stretched/metro clusters, it is recommended that this setting be disabled in those environments. If you wish to read more about the reasons why and how, Michael Webster does a good job of explaining it in this blog article.

Source: cormachogan.com

Jul 16

Path Selection Policy with ALUA

It’s important to understand how VMware ESXi servers handle connections to their associated storage arrays.

If we look specifically with fibre channel fabrics, we have several multipathing options to be considered.
There are three path selection policy (PSP) plugins that VMware uses natively to determine the I/O channel that data will travel over to the storage device.

  1. Fixed Path
  2. Most Recently Used (MRU)
  3. Round Robin (RR)
Let’s look at some examples of the three PSPs we’ve mentioned and how they behave.  The definitions come from the vSphere 5 storage guide found below.

 

Fixed Path

The host uses the designated preferred path, if it has been configured.
Otherwise, it selects the first working path discovered at system boot time. If
you want the host to use a particular preferred path, specify it manually. Fixed
is the default policy for most active-active storage devices.
NOTE If the host uses a default preferred path and the path’s status turns to
Dead, a new path is selected as preferred. However, if you explicitly designate
the preferred path, it will remain preferred even when it becomes inaccessible

 

So we’ll show 3 examples.  On the left we’ve shown the default path (manually set) to the storage system.  In the middle picture, we’ve simulated a path failure, and the fixed path has changed.  Then on the right, we’ve shown that when the failure has been resolved, the path goes back to the original path that was set.

 

Fixed Path

 

Most Recently Used

The host selects the path that it used most recently. When the path becomes
unavailable, the host selects an alternative path. The host does not revert back
to the original path when that path becomes available again. There is no
preferred path setting with the MRU policy. MRU is the default policy for most
active-passive storage devices.

 

Below we’ show 3 more examples.  This time we show the MRU policy.  On the left we see the default path that the MRU has chosen.  The middle picture shows a failure and what happens to the path selection.  So far, the MRU Policy looks the same as the Fixed Path Policy.  In the last picture (on the right) you’ll see the difference from Fixed Path.  When the failure has been resolved, the MRU policy does not go back to the original path.

 

MRU

 

Round Robin

The host uses an automatic path selection algorithm rotating through all active
paths when connecting to active-passive arrays, or through all available paths
when connecting to active-active arrays. RR is the default for a number of arrays
and can be used with both active-active and active-passive arrays to implement
load balancing across paths for different LUNs.

 

Below we again have 3 examples.  This time we’re not showing any failures, but we are showing how the round robin policy selects one path, then another, then another and eventually repeats when all the available paths have been used.

 

Round Robin

ALUA

Along with PSPs we also have Storage Array Type Plugins (SATP) which are specific to the storage vendors.  These SATPs are responsible for determining which I/O paths are available to be used.  The SATP is responsible for monitoring changes and handling failovers.  SATPs are used to determine which paths are available and the PSPs then choose the available path to use.  SATPs aren’t used exclusively in failover events though, some storage arrays have two active storage processors but only one of the storage processors owns the LUN that is being accessed.  The specific SATP we’ll be looking at is the Asymmetric Logical Unit Access (ALUA).

 

In the example on the left we can see that the storage path is not optimized.  We found a perfectly acceptable path to the storage system but when we got there we found out that the other storage processor currently owned that LUN.  At that point the storage processor can still access the LUN but has to go through the second storage process and thus is not optimized.  ALUA makes sure that the most optimized path is always used.
AluaExample

 

Now lets look at an example of the Round Robin PSP with the ALUA SATP.
Below we see that there is one storage processor that is the optimized path, but there are still multiple paths to get to that storage processor from the host.  There are two optimized paths available for the Round Robin PSP to go back and forth between to access the LUN.

 

RR with ALUA

If we look in the ESXi Configurations we can see that ALUA is the SATP and Round Robin is the selected PSP.  If you look in the storage paths you’ll see Active(I/O) listed for the optimized storage path and Active as the non-optimized storage path.  Remember that just because it’s not optimized doesn’t mean that the path couldn’t be used if it was necessary.

VMwareALUA

If you want to set your PSP for each path without selecting every path on the host, you can do so with the PowerCLI.

aluapowercli