Performance Monitoring
Measure, record, and track various service performance indicators for IT infrastructure, end- to-end IT services, business processes, and the organization (based on empirical models).
Improvement Planning
Practices-Outcomes-Metrics (POM)
Representative POMs are described for Performance Monitoring at each level of maturity.
- 1Initial
- Practice
- Use monitoring to instigate following specific incidents.
- Outcome
- Reactive environment with minimal predictability and poor corrective actions, leading to budget overruns and/or over-investment (“sledgehammer to crack a nut”).
- Metrics
- Number of outages
- Total downtime
- Variance from planned BAU cost
- Helpdesk calls
- MTTR
- 2Basic
- Practices
- Establish dynamic normal profiles for each metric to minimise risk of alarm storms.
- Profile should examine deviations over multiple time horizons (eg hourly, daily, weekly, monthly) in order to determine if deviation is significant.
- Define series of SLAs for each of the infrastructure components.
- Agree escalation path for any given level of deviation for each component type.
- Outcome
- Capability to get visibility of response times, availability, capacity, utilization and cost for each infrastructure category.
- Metrics
- IT component availability
- IT component response times
- IT component capacity (storage capacity, %CPU utilization, bandwidth capacity, …)
- Cost per discrete component
- % IT infrastructure covered
- MTTR
- Practices
- Implement monitoring processes to identify when actual performance deviates from dynamic normal profile.
- Escalate if level of deviation breaches corresponding SLA.
- Outcome
- Drive continuous improvements in infrastructure performance metrics, leading to overall better predictability.
- Metrics
- Availability trend by category
- Capacity utilization trend by category
- IT component SLA breaches
- Response times trend by category
- Number of IT component SLA breaches
- MTTR
- MTBF
- Variance from IT budget
- NOTE: Database server availability must exceed 99.97% during core hours
- 3Intermediate
- Practice
- Establish dynamic normal profile for each IT service in terms of its performance levels and the behaviour of the underlying infrastructure metrics (e.g. transaction times, available memory, CPU Utilization).
- Outcome
- Capability to get business-understandable view of the availability, capacity, response times, utilization and cost of each IT service, and how the performance of each infrastructure component contributes to these metrics (e.g. end-to-end service response times).
- Metrics
- Service availability (trended)
- Service utilization (trended)
- Service capacity (trended)
- Service latency (trended)
- Service cost (trended)
- % IT Services covered
- Service MTTR
- Service MTBF
- Cost per IT Service vs Industry Benchmark
- Adoption of IT service metrics
- Practices
- Compare behaviour of IT service against the dynamic normal profile for the service — examine deviations over multiple time horizons (eg hourly, daily, weekly, monthly) in order to determine if deviation is significant or falls under normal operating volatility.
- Escalate to relevant service owners and technology groups where deviation is significant so that appropriate remedial action can be undertaken.
- N.B. In most instances, changes from the normal profile do not necessarily indicate a problem has occurred — instead they represent an opportunity to investigate the changed behaviour and resolve before users even become aware of the issue.
- Outcomes
- Empirical Modelling:
- Improved service metrics and predictability within the IT environment.
- Improved user experience (as potential to minimise infrastructure threats before user impact)
- Monitoring:
- Increased availability
- Reduction in the time taken to resolve IT service issues and reduction in the number of recurring issues.
- Improved user experience.
- Metrics
- Empirical Modelling:
- Service response times (trended)
- Service availability (trended)
- Service capacity (trended)
- Service latency (trended)
- Service utilization (trended)
- Number of SLA breaches
- Theoretical Weighted Business Impact (Estimate of hours lost due to service outages/shortages).
- Monitoring:
- Availability
- MTTR
- MTBF
- Number of issues (by incident type)
- IT component metrics (actual vs SLA)
- IT service metrics (actual vs SLA)
- 4Advanced
- Practice
- Establish dynamic normal profile for each business process in terms of its expected performance levels and the behaviour of the underlying IT services.
- Outcomes
- Abstracted view of IT services defined in terms of the Business processes delivered, with an understanding of how each Business process is performing.
- Ability to monitor IT contribution to Business process performance via metrics such as Business process performance.
- Metrics
- % Business processes mapped to IT services & monitored
- Industry benchmark cost per IT-enabled business process
- Practices
- Compare behaviour of Business Process against the dynamic normal profile for the business process — examining deviations over multiple time horizons in order to determine the relative importance of the deviation.
- Escalate all significant deviations to relevant business process owners, IT service owners and technology groups so that appropriate remedial action can be undertaken.
- Outcomes
- Provides insight into Actual vs Planned SLA performance (and thereby identifies areas where IT is hampering business performance).
- Drives continuous improvement in business process performance, helping to identify where IT investments will make the most beneficial.
- NOTE: By understanding how changes in business demand impact on IT services, IT services is better placed to divert scarce IT resources from one IT service to another.
- Metrics
- IT SLAs breached
- Business process SLAs breached
- Business process hours lost (or number of transactions lost)
- Transactions handled per IT dollar
- Business process availability
- Business process response times
- Business process utilization