Atlassian

/Atlassian

4 critical components of successful IT metrics and reporting with Nikki Nguyen

headshots-large-nn

 

 

 

 

Let’s do the numbers

In IT, we love to measure and report. We just can’t help ourselves. But in our efforts to track every statistic possible, we often lose focus. So let’s change that. Let’s start asking questions like… Who will use the metrics? Why do we need them? Are we setting the right performance goals to reinforce the goals of our business–or could we even be working against them? Today, we’ll look at four very practical guidelines for measuring and reporting on IT performance, and for setting the right goals from the start.

1: Make sure IT performance goals jibe with your business goals

I recently opened a ticket online with a hardware vendor to ask about repair service. They responded quickly, and answered many (but not all) of my questions. Most concerning, though, was the email that I received a few minutes later: “Your ticket has been successfully resolved.”

Had it? Says who? While I appreciated the fast response, my issue had not, in fact, been resolved. Did someone close a ticket just so they could say it had been closed? The front line support team was clearly being evaluated on time-per-ticket, or percentage of tickets successfully resolved, or both.

Certainly, time-per-ticket and percentage of tickets resolved are legitimate measurements for IT operations. But what about the underlying problem I reported? If you’re not tracking at the incident and problem level (to look for common, overarching problems and a high volume of incidents associated with them), you’re missing an opportunity to help your business solve problems proactively instead of just reacting to them. More importantly, what about customer satisfaction? I didn’t feel my issue had been resolved. Now, I had to open another ticket and waste more of my own time. I grew frustrated. I gave up on the product.

In a haste to meet their operational performance metrics, they lost sight of much more important business goals: make customers happy and encourage referrals and repeat business.

To avoid this trap in your own organization, look for ways to set meaningful goals and measurements that encourage behavior in line with company and organization-wide goals. Incentivizing a low-level support team to close or escalate tickets quickly can actually cost the company more, and HDI even has the math to prove it:

image1

Source: HDI

So encourage your Level 1 support team to spend a bit longer collecting more information before escalating, and give them the training and resources they need to be more effective at resolving tickets, not just triaging them. The savings adds up quickly.

2: Share different metrics with different stakeholders

Have you ever sat through one of those tortuous meetings where one or more managers each deliver ten slides to share their key accomplishments and metrics for the quarter? The reason they are so torturous is simple: the reports lack context, and they aren’t relevant to you. There are two primary reasons you should cater your reports to the individual stakeholder you are sharing them with:

  • To give stakeholders the information they need to do their own jobs better.
  • To keep them from meddling.

The first is pretty obvious. Different stakeholders care about different things: a front-line IT manager cares deeply about technical performance data, while a CTO cares much more about the bigger picture. Avoid distributing generic, tell-all reports to large audiences altogether, and instead, meet with your key stakeholders and agree on the right measurements to help them achieve their goals.

The second is less obvious, but equally important. People love to meddle. We all do. I’ve watched a very senior IT executive review a very low-level list of unresolved IT incidents. He didn’t need that data. In fact, he had directors and managers he completely trusted to achieve the goals he had put in place. Once he had the data in front of him, he couldn’t help but ask questions and get involved. Distraction ensued.

The moral? Don’t include data for data’s sake. Yes, you need to be completely transparent about your performance, what you’re doing well, and how you can improve. But you don’t want to give the entire sink to every person who asks for a drink of water.

3: Use visuals to make reports easier to understand.

Excel spreadsheets full of raw data aren’t very effective as report-outs to your team members, peers, and leadership, because they require the viewer to interpret the data.

Fortunately, adding context to the data isn’t always so hard if you are already using a strong reporting dashboard. You want to provide clean, crisp, and easily understood reports that provide enough context to quickly communicate how you are doing against your goals, your benchmarks, and your history.

image2

For practitioners and front-line managers, consider using daily reports to show the top 10 issue types over the last 24 hours. They’re easy to read and understand, and can help your staff quickly hone in on any emerging categories that may growing in popularity.

image3

Trending reports can be even more helpful, because you can compare your performance over a period of time, and look for any anomalies that might be worth exploring further. If you looked at your time-to-resolution data in a vacuum each month, you would never notice that July and August showed a strong upward climb in the number of issues opened.

What caused that influx of new issues? Was a new software revision released? Did you ship a new product? Why were nearly a third of July’s issues unresolved, when most months the percentage was much higher? It’s important to look at the entire picture, and to understand the data you are looking at (and if possible, what caused it) before you share reports and discuss results.

4: Keep a scorecard

When a store clerk or passerby asks you how you are feeling, it’s customary to respond briefly with “I’m fine” or “A bit tired today.” It’s a quick way to summarize how you are feeling, without giving them the blow-by-blow account of every event over the last month or more that has lead up to how you are feeling today.

The same principle should apply in IT metrics and reporting. If you’re not using a scorecard as a simple, high-level way to both evaluate and communicate your team’s performance, it’s time to start now. An effective scorecard will include the objective or measurement you are scoring yourself against, and an easy “traffic light” system to indicate your current progress: red (at risk), yellow (caution), or green (good).

The most important thing about a scorecard is to be honest. Nobody performs perfectly at all times, so giving yourself a green smiley across every category at every reporting interval will likely cause more alarm and disbelief than praise. Plus, when something truly does go wrong, you are more likely to get support and understanding if you have been candidly assessing your performance and flagging the areas that are putting you at risk.

A basic scorecard for operational performance might look something like this, and is a great way to quickly update stakeholders without burying them in unnecessary technical data.

Screenshot (12)

More advanced scorecards, like balanced scorecards, can measure IT’s contribution to larger business goals, and are effective at tracking the performance across entire organizations and companies.

Putting it all to use

The above are just guiding principles to help you narrow in on what you want to report, and how. To learn more about implementing SLAs and metrics in JIRA Service Desk, watch Lucas Dussurget’s killer presentation at Atlassian Summit 2014. It’s full of our own top tricks, examples, and best practices based on tons of customer implementations. And for a deep-dive on figuring out what you should be measuring, be sure to check out another excellent presentation from Summit 2014–this one by John Custy.

 

This article was originally published on the Atlassian website.

 

ABOUT THE AUTHOR

Nikki Nguyen

Associate Product Marketing Manager, JIRA Service Desk

Although my life in IT is behind me, it’s not too far away. I’m now a recovering systems administrator evangelizing the way teams work by using JIRA Service Desk. I’ve found a love of combining customer service with technology.

 

Nikki is presenting at Service Management 2015.

 

 

 

By |2018-03-19T16:23:22+00:00July 22nd, 2015|Atlassian, blog, guest blogger, metrics, reporting, Service Management 2015|

Love DevOps? Wait ’till you meet SRE – with guest blogger Patrick Hill

headshots-large-PH

 

 

 

 

Site Reliability Engineering may be the most important acronym you’ve never heard of – here’s why.

You may have heard of a little company called Google. They invent cool stuff like driverless cars and elevators into outer space. Oh: and they develop massively successful applications like Gmail, Google Docs, and Google Maps. It’s safe to say they know a thing or two about successful application development, right?

They’re also the pioneers behind a growing movement called Site Reliability Engineering (SRE). SRE effectively ends the age-old battles between Development and Operations. It encourages product reliability, accountability, and innovation – minus the hallway drama you’ve come to expect in what can feel like Software Development High School.

How? Let’s look at the basics.

What in the world is SRE?

Google’s mastermind behind SRE, Ben Treynor, still hasn’t published a single-sentence definition, but describes site reliability as “what happens when a software engineer is tasked with what used to be called operations.”

The underlying problem goes like this: Dev teams want to release awesome new features to the masses, and see them take off in a big way. Ops teams want to make sure those features don’t break things. Historically, that’s caused a big power struggle, with Ops trying to put the brakes on as many releases as possible, and Dev looking for clever new ways to sneak around the processes that hold them back. (Sounds familiar, I’d wager.)

SRE removes the conjecture and debate over what can be launched and when. It introduces a mathematical formula for green- or red-lighting launches, and dedicates a team of people with Ops skills (appropriately called Service Reliability Engineers, or SRE’s) to continuously oversee the reliability of the product. As Google’s own SRE Andrew Widdowson describes it, “Our work is like being a part of the world’s most intense pit crew. We change the tires of a race car as it’s going 100mph.”

Doesn’t sound revolutionary yet? Much of the magic is in how it works. Here are some of the core principles – which also happen to be some of the biggest departures from traditional IT operations.

First, new launches are green-lighted based on current product performance.

Most applications don’t achieve 100% uptime. So for each service, the SRE team sets a service-level agreement (SLA) that defines how reliable the system needs to be to end-users. If the team agrees on a 99.9% SLA, that gives them an error budget of 0.1%. An error budget is exactly as it’s named: it’s the maximum allowable threshold for errors and outages.

ProTip: You can easily convert SLAs into “minutes of downtime” with this cool uptime cheat sheet.

Here’s the clincher: The development team can “spend” this error budget in any way they like. If the product is currently running flawlessly, with few or no errors, they can launch whatever they want, whenever they want. Conversely, if they have met or exceeded the error budget and are operating at or below the defined SLA, all launches are frozen until they reduce the number of errors to a level that allows the launch to proceed.

The genius? Both the SREs and developers have a strong incentive to work together to minimize the number of errors.

SREs can code, too

In the old model, you throw people at a reliability problem and keep pushing (sometimes for a year or more) until the problem either goes away or blows up in your face.

Not so in SRE. Both the development and SRE teams share a single staffing pool, so for every SRE that is hired, one less developer headcount is available (and vice versa). This ends the never-ending headcount battle between Dev and Ops, and creates a self-policing system where developers get rewarded with more teammates for writing better performing code (i.e., code that needs less support from fewer SREs).

SREsTalking

 

SRE teams are actually staffed entirely with rock-star developer/sys-admin hybrids who not only know how to find problems, but fix them, too. They interface easily with the development team, and as code quality improves, are often moved to the development team if fewer SRE’s are needed on a project.

In fact, one of the core principles mandates that SRE’s can only spend 50% of their time on Ops work. As much of their time as possible should be spent writing code and building systems to improve performance and operational efficiency.

Developers get their hands dirty, too

At Google, Ben Treynor had to fight for this clause, and he’s glad he did. In fact, in his great keynote on SRE at SREcon14 he emphasizes that getting this commitment from your executives before you launch SRE should be mandatory.

Basically, the development team handles 5% of all operations workload (handling tickets, providing on-call support, etc.). This allows them to stay closely connected to their product, see how it is performing, and make better coding and release decisions.

In addition, any time the operations load exceeds the capacity of the SRE team, the overflow always gets assigned to the developers. When the system is working well, the developers begin to self-regulate here as well, writing strong code and launching carefully to prevent future issues.

SRE’s are free agents (and can be pulled, if needed)

To make sure teams stay healthy and happy, Treynor recommends allowing SRE’s to move to other projects as they desire, or even move to a different organization. SRE encourages highly motivated, dedicated, and effective teamwork – so no team member should be held back from pursuing his or her own personal objectives.

If an entire team of SREs and developers simply can’t get along and are creating more trouble than reliable code, there’s a final drastic measure you can take: Pull the entire SRE team off of the project, and assign all of the operations workload directly to the development team. Treynor has only done this a couple times in his entire career, and the threat is usually enough to bring both teams around to a more positive working relationship.

There’s quite a bit more to SRE than I can cover in once article – like how SRE prevents production incidents, how on-call support teams are staffed and the rules they follow on each shift, etc.

Our take

IT is full of buzzwords and trends, to be sure. One minute it’s cloud, the next it’s DevOps or customer experience or gamification. SRE is in a strong position to become much more than that, particularly since it is far more about the people and process than the technology that underlies them. While technology certainly can (and likely will) adapt to the concept as it matures and more teams adopt it, you don’t need new tools to align your development and operations organizations around the principles of Site Reliability Engineering.

In future articles, we’ll look at just that: practical steps for taking a step towards SRE, and the role technology can play.

 

This article was originally published on the Atlassian website.


Patrick Hill, Site Reliability Engineer, has been with Atlassian a while now, and recently transfered from Sydney to our Austin office. (G’day, y’all!) In my free time, I enjoy taking my beard from “distinguished professor” to “lumberjack” and back again. Find me on Twitter! @topofthehill

Patrick’s colleagues Sam Jebeile and Nick Wright will be discussing SRE in depth at Service Management 2015.