Tuesday, 2 March 2010

Citrix XenApp Performance Monitoring/General "Server" Performance monitoring

Hi All,

As part of the current project I have been working on I was asked to put together a list of performance counters which should be setup to monitor the performance of XenApp in depth. The contract which I am uses EdgeSight heavily but I wanted to make sure that all was covered in relation to performance. So in OpsMgr we set up the following and I though that this may be a useful reference for anyone wanting to monitor XenApp or a servers general performance...enjoy:

Kernel

Memory\Free System Page Table Entries

This will show the number of page table entries not currently used by the system. On a x32 Windows Server 2003, 4Gb RAM, this counter should not drop below 5000 Free System Page Table Entries as this would indicate that we are running low on kernel memory.

Logical Disk

% Disk Time

Gives an indication of how busy the disks are. The disk can become a bottleneck for a number of reasons:

- The server has too little physical memory so is “thrashing.” If thrashing is occurring, the pages/sec will also be high.

- A single user is running an application or process that makes extensive and rapid use of the disk. This can be investigated by running Current Process and Current User reports in the future.

- Many users are performing large amounts of disk activity. The speed of the disks may be the server’s bottleneck.

The metric % Disk Time is calculated using a number of factors and values above 100% are possible. If we see values of 100% disk time, the disk is in constant use. Values greater than 100% may indicate that the disk is too slow for the number of requests.

% Free Space

The server is running out of disk space. Several factors can cause this:

- A lack of remaining disk space after installing the operating system and applications.
- A large number of users are logged on (now or in the past) and their configuration data, settings, and files are taking up too much space.
- A rogue process or user is consuming a large amount of disk space.

Current Disk Queue Length and Average Disk Queue length

This counter shows the number of requests outstanding on the disk at the time that the performance data is collected. For this counter, lower values are better. Values above 2 per disk may indicate a bottleneck. This counter gives the queue length across all disks in use (so if they are allocated via LUN and this has 4 disks then a disk queue length of 8 might be ok). Bottlenecks can create a backlog that can spread beyond the current server that is accessing the disk, and result in long wait times for users.

Logical Disk: Avg. Disk sec/Read and Logical Disk: Avg. Disk sec/Write

These counters show the average time, in seconds, of a read or write operation to the disk. Typically these are used to monitor SQL performance but it may be useful to see what times are being produced. From a SQL perspective the following is a ruff guide to disk performance:

- Avg. Disk Sec/Read - Measure of disk latency. Avg. Disk sec/Read is the average time, in seconds, of a read of data from the disk. More Info:Reads Excellent <> 20 Msec ( .020 seconds )

- Avg. Disk sec/Write - Measure of disk latency. Avg. Disk sec/Write is the average time, in seconds, of a write of data to the disk.

Non cached Writes:Excellent <> 20 Msec ( .020 seconds )

Cached Writes OnlyExcellent <> 04 Msec ( .004 seconds)

Memory

Available Bytes

Informs you if too much memory is being used. This could be because:
- Too many users are logged on.
- The applications that users are running are too memory-hungry for the amount of memory available on the server.
- Some user or process is using a large amount of memory. Running a perfmon report focussing on Current Process report may help you track this down.

Being short on memory could result in “thrashing.”

Pages/sec (Hard Page Faults)

A large amount of paging indicates either:
- The system is low on physical memory and the disk is being used extensively as virtual memory. This can be caused by too many users being logged on, too many processes running, or a rogue process “stealing” virtual memory.

—or—

An active process or processes are making large and frequent memory accesses.

Too much paging degrades the performance of the server for all users logged on. The Available Bytes, Disk, and % Processor Time metrics may also enter warning or critical states when a large amount of paging occurs. Short bursts of heavy paging are normal, but long periods of heavy paging seriously affect server performance.

It is generally agreed that anything over 20 Pages/sec could be deemed as a bottleneck.

Page File: % Usage

While the Page File is less likely to be a bottleneck it is worth checking. Anything over 70% could indicate an issue.

Memory: Pool Non-paged Bytes

A good counter to detect an application memory leak. When the application is closed memory should be returned. If this counter begins to creep up at a rapid rate during the day then faulty application may be causing a memory leak.

Committed Bytes

If committed bytes are higher than the amount of physical RAM then this would indicate a server with not enough memory.

Network

Bytes Total/Sec

This metric gives a good indication of how much network activity this server is generating or receiving. Thresholds here are dependent on a number of variables such as network link, NIC hardware etc.

Processor

% Interrupt Time

The processor is spending a large amount of time responding to input and output rather than user processing. A large value for interrupt time usually indicates a hardware problem or a very busy server.

% Processor Time

A high processor time for a long period of time indicates that the processor is the bottleneck of the server, too many users are logged on, or there is a rogue user or process (use the Current Process perfmon counters to investigate). This is pure processor utilisation, not always a bad thing when it's 80-90%. User experience and helpdesk calls would need to correlate to this for this to be deemed the issue.

System

Context Switches/Sec

A large number of threads and/or processes are competing for processor time. There are a number of variables which could be used to determine thresholds such as number of processors and clock rate. Thresholds I have seen mentioned in the past are around 15000 per a CPU.
% Interrupt Time

Terminal Services

Active Sessions

A large number of users are logged on and running applications. The server may begin running out of memory or processor time and performance for users may deteriorate.

Inactive Sessions

A large number of disconnected sessions are taking virtual memory. Remove some disconnected sessions or reduce the length of time for which disconnected sessions can persist until they are automatically removed.

Citrix Metaframe Presentation Server

Data Store Connection Failure

Thresholds here are dependent on WAN/LAN etc but this value should be low.

No comments: