Performance Monitoring

Measure, record, and track various service performance indicators for IT infrastructure, end- to-end IT services, business processes, and the organization (based on empirical models).

Improvement Planning

Practices-Outcomes-Metrics (POM)

Representative POMs are described for Performance Monitoring at each level of maturity.

1Initial

Practice
Use monitoring to instigate following specific incidents.
Outcome
Reactive environment with minimal predictability and poor corrective actions, leading to budget overruns and/or over-investment (“sledgehammer to crack a nut”).
Metrics
- Number of outages
- Total downtime
- Variance from planned BAU cost
- Helpdesk calls
- MTTR

2Basic

Practices
- Establish dynamic normal profiles for each metric to minimise risk of alarm storms.
- Profile should examine deviations over multiple time horizons (eg hourly, daily, weekly, monthly) in order to determine if deviation is significant.
- Define series of SLAs for each of the infrastructure components.
- Agree escalation path for any given level of deviation for each component type.
Outcome
Capability to get visibility of response times, availability, capacity, utilization and cost for each infrastructure category.
Metrics
- IT component availability
- IT component response times
- IT component capacity (storage capacity, %CPU utilization, bandwidth capacity, …)
- Cost per discrete component
- % IT infrastructure covered
- MTTR
Practices
- Implement monitoring processes to identify when actual performance deviates from dynamic normal profile.
- Escalate if level of deviation breaches corresponding SLA.
Outcome
Drive continuous improvements in infrastructure performance metrics, leading to overall better predictability.
Metrics
- Availability trend by category
- Capacity utilization trend by category
- IT component SLA breaches
- Response times trend by category
- Number of IT component SLA breaches
- MTTR
- MTBF
- Variance from IT budget
- NOTE: Database server availability must exceed 99.97% during core hours

3Intermediate

Practice
Establish dynamic normal profile for each IT service in terms of its performance levels and the behaviour of the underlying infrastructure metrics (e.g. transaction times, available memory, CPU Utilization).
Outcome
Capability to get business-understandable view of the availability, capacity, response times, utilization and cost of each IT service, and how the performance of each infrastructure component contributes to these metrics (e.g. end-to-end service response times).
Metrics
- Service availability (trended)
- Service utilization (trended)
- Service capacity (trended)
- Service latency (trended)
- Service cost (trended)
- % IT Services covered
- Service MTTR
- Service MTBF
- Cost per IT Service vs Industry Benchmark
- Adoption of IT service metrics
Practices
- Compare behaviour of IT service against the dynamic normal profile for the service — examine deviations over multiple time horizons (eg hourly, daily, weekly, monthly) in order to determine if deviation is significant or falls under normal operating volatility.
- Escalate to relevant service owners and technology groups where deviation is significant so that appropriate remedial action can be undertaken.
- N.B. In most instances, changes from the normal profile do not necessarily indicate a problem has occurred — instead they represent an opportunity to investigate the changed behaviour and resolve before users even become aware of the issue.
Outcomes
- Empirical Modelling:
- Improved service metrics and predictability within the IT environment.
- Improved user experience (as potential to minimise infrastructure threats before user impact)
- Monitoring:
- Increased availability
- Reduction in the time taken to resolve IT service issues and reduction in the number of recurring issues.
- Improved user experience.
Metrics
- Empirical Modelling:
- Service response times (trended)
- Service availability (trended)
- Service capacity (trended)
- Service latency (trended)
- Service utilization (trended)
- Number of SLA breaches
- Theoretical Weighted Business Impact (Estimate of hours lost due to service outages/shortages).
- Monitoring:
- Availability
- MTTR
- MTBF
- Number of issues (by incident type)
- IT component metrics (actual vs SLA)
- IT service metrics (actual vs SLA)

4Advanced

Practice
Establish dynamic normal profile for each business process in terms of its expected performance levels and the behaviour of the underlying IT services.
Outcomes
- Abstracted view of IT services defined in terms of the Business processes delivered, with an understanding of how each Business process is performing.
- Ability to monitor IT contribution to Business process performance via metrics such as Business process performance.
Metrics
- % Business processes mapped to IT services & monitored
- Industry benchmark cost per IT-enabled business process
Practices
- Compare behaviour of Business Process against the dynamic normal profile for the business process — examining deviations over multiple time horizons in order to determine the relative importance of the deviation.
- Escalate all significant deviations to relevant business process owners, IT service owners and technology groups so that appropriate remedial action can be undertaken.
Outcomes
- Provides insight into Actual vs Planned SLA performance (and thereby identifies areas where IT is hampering business performance).
- Drives continuous improvement in business process performance, helping to identify where IT investments will make the most beneficial.
- NOTE: By understanding how changes in business demand impact on IT services, IT services is better placed to divert scarce IT resources from one IT service to another.
Metrics
- IT SLAs breached
- Business process SLAs breached
- Business process hours lost (or number of transactions lost)
- Transactions handled per IT dollar
- Business process availability
- Business process response times
- Business process utilization