Monthly Archives: July 2015

//July

Service Management in an as-a-service world – Part 1

Ian-krieger

 

 

 

 

 

With guest blogger Ian Krieger

Why moving to the cloud can give you more control, not less.

What are the opportunities and challenges for the IT service management team in a world where more applications are moving into the cloud, offered as subscription services, from a multitude of vendors? Can you keep control and visibility?

Recently I led a discussion at an itSMF Special Interest Group meeting about IT service management in an “as-a-Service” world – a world where the way IT is procured, delivered and consumed has fundamentally changed with the advent of cloud computing. Not that cloud computing is new by any means – particularly in smaller organisations, but it is now becoming more and more prevalent in large enterprises. Or it is expected to be…

While there has been a lot of hype around “the cloud”, what became apparent at the meeting is that most information is targeted at the executives in high level overviews, or at techies in great technical detail.

Meanwhile, the IT service management team has been left in the cold. There is little clear direction on “how to” or “where to start” and too much hype versus fact. Yet it is the service management team who often has the responsibility to “make it happen”.

In our discussion, which included IT service management professionals from government, financial services and IT vendors, the concerns/queries about service management in a cloud environment were startlingly consistent across industry sectors:

  •        What is the best way to monitor and report service delivery?
  •        How have other organisations done it?
  •        What is hybrid cloud and how do you manage it?
  •        How do you manage service integration across multiple vendors?

The Australian Government defines cloud computing as a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.

Interestingly, the itSMF group viewed cloud as a commercial model for delivering IT, rather than a technology. And the overriding concern is that these services are not in their control.

So how does cloud impact the policies, processes and procedures service management uses to plan, deliver, operate and control IT services offered to end-users?

For me it comes down to recognising that while traditional IT procurement has changed, you can still be in control; defining a clear – but flexible – business map for how the technology, processes and people will support the business; and ensuring transparency across multiple vendors.

New Ways of IT Procurement Don’t Have to Mean You Lose Control

Much of the fear of losing control comes from the feeling that IT departments are relinquishing control to IT third parties because they no longer own the IT and can’t see, touch or grab it. Yet in many ways they have more control than ever as it is easier to increase or decrease capacity quickly in response to changes in your organisation or the market in which it operates. And, if you chose the right vendor, they should provide you with regularly updated innovative solutions and contracted service levels rather than you being locked into a technology that will start to age as soon as you implement it.

Of course it’s not simple matter of moving everything into the cloud. Sometimes legislative requirements will dictate where data can be stored or who has access to it which may force an application to be insourced. Or it may depend on the maturity of an organisation’s approach to IT – an immature organisation may refuse to outsource because it is simply fearful of doing so whereas a mature approach is open to pushing risk outside the organisation.

And not all clouds are the same. A private cloud is used by a single organisation. A community cloud is for the exclusive use of a specific community of consumers with shared concerns (eg security requirements or mission). A public cloud is for open use by the general public. And a hybrid cloud is comprised of multiple distinct cloud infrastructures (private, community or public). Whilst the debate over public vs. private cloud services rages on, in the context of the above and the relative organisational needs and maturity, they all have a place.

This feeling of a loss of control can be exacerbated by departments choosing their own systems, easily bought and delivered over the Internet. However this “shadow IT” should not be feared – instead it should be seen as an indicator that the IT department is not delivering what they need. This is why business mapping is so important.

 

Part 2 of this blog will cover why business mapping is critical to ensuring IT and Service Management truly support the business and how to get started.

_________________________________________________________________________

Ian Krieger is the Chief Architect for Unisys Asia Pacific & Japan. He has worked in the IT industry for more than 20 years. He has helped organisations throughout the region understand how to best use services and technology to support their business’ goals and strategies. Ian is a technologist who prefers to look at the practical applications of technology as opposed to the “shiny”.

By |2018-03-19T16:23:22+00:00July 30th, 2015|blog, cloud, guest blogger, ITSM, Service Management 2015, shadowIT, UNISYS|

4 critical components of successful IT metrics and reporting with Nikki Nguyen

headshots-large-nn

 

 

 

 

Let’s do the numbers

In IT, we love to measure and report. We just can’t help ourselves. But in our efforts to track every statistic possible, we often lose focus. So let’s change that. Let’s start asking questions like… Who will use the metrics? Why do we need them? Are we setting the right performance goals to reinforce the goals of our business–or could we even be working against them? Today, we’ll look at four very practical guidelines for measuring and reporting on IT performance, and for setting the right goals from the start.

1: Make sure IT performance goals jibe with your business goals

I recently opened a ticket online with a hardware vendor to ask about repair service. They responded quickly, and answered many (but not all) of my questions. Most concerning, though, was the email that I received a few minutes later: “Your ticket has been successfully resolved.”

Had it? Says who? While I appreciated the fast response, my issue had not, in fact, been resolved. Did someone close a ticket just so they could say it had been closed? The front line support team was clearly being evaluated on time-per-ticket, or percentage of tickets successfully resolved, or both.

Certainly, time-per-ticket and percentage of tickets resolved are legitimate measurements for IT operations. But what about the underlying problem I reported? If you’re not tracking at the incident and problem level (to look for common, overarching problems and a high volume of incidents associated with them), you’re missing an opportunity to help your business solve problems proactively instead of just reacting to them. More importantly, what about customer satisfaction? I didn’t feel my issue had been resolved. Now, I had to open another ticket and waste more of my own time. I grew frustrated. I gave up on the product.

In a haste to meet their operational performance metrics, they lost sight of much more important business goals: make customers happy and encourage referrals and repeat business.

To avoid this trap in your own organization, look for ways to set meaningful goals and measurements that encourage behavior in line with company and organization-wide goals. Incentivizing a low-level support team to close or escalate tickets quickly can actually cost the company more, and HDI even has the math to prove it:

image1

Source: HDI

So encourage your Level 1 support team to spend a bit longer collecting more information before escalating, and give them the training and resources they need to be more effective at resolving tickets, not just triaging them. The savings adds up quickly.

2: Share different metrics with different stakeholders

Have you ever sat through one of those tortuous meetings where one or more managers each deliver ten slides to share their key accomplishments and metrics for the quarter? The reason they are so torturous is simple: the reports lack context, and they aren’t relevant to you. There are two primary reasons you should cater your reports to the individual stakeholder you are sharing them with:

  • To give stakeholders the information they need to do their own jobs better.
  • To keep them from meddling.

The first is pretty obvious. Different stakeholders care about different things: a front-line IT manager cares deeply about technical performance data, while a CTO cares much more about the bigger picture. Avoid distributing generic, tell-all reports to large audiences altogether, and instead, meet with your key stakeholders and agree on the right measurements to help them achieve their goals.

The second is less obvious, but equally important. People love to meddle. We all do. I’ve watched a very senior IT executive review a very low-level list of unresolved IT incidents. He didn’t need that data. In fact, he had directors and managers he completely trusted to achieve the goals he had put in place. Once he had the data in front of him, he couldn’t help but ask questions and get involved. Distraction ensued.

The moral? Don’t include data for data’s sake. Yes, you need to be completely transparent about your performance, what you’re doing well, and how you can improve. But you don’t want to give the entire sink to every person who asks for a drink of water.

3: Use visuals to make reports easier to understand.

Excel spreadsheets full of raw data aren’t very effective as report-outs to your team members, peers, and leadership, because they require the viewer to interpret the data.

Fortunately, adding context to the data isn’t always so hard if you are already using a strong reporting dashboard. You want to provide clean, crisp, and easily understood reports that provide enough context to quickly communicate how you are doing against your goals, your benchmarks, and your history.

image2

For practitioners and front-line managers, consider using daily reports to show the top 10 issue types over the last 24 hours. They’re easy to read and understand, and can help your staff quickly hone in on any emerging categories that may growing in popularity.

image3

Trending reports can be even more helpful, because you can compare your performance over a period of time, and look for any anomalies that might be worth exploring further. If you looked at your time-to-resolution data in a vacuum each month, you would never notice that July and August showed a strong upward climb in the number of issues opened.

What caused that influx of new issues? Was a new software revision released? Did you ship a new product? Why were nearly a third of July’s issues unresolved, when most months the percentage was much higher? It’s important to look at the entire picture, and to understand the data you are looking at (and if possible, what caused it) before you share reports and discuss results.

4: Keep a scorecard

When a store clerk or passerby asks you how you are feeling, it’s customary to respond briefly with “I’m fine” or “A bit tired today.” It’s a quick way to summarize how you are feeling, without giving them the blow-by-blow account of every event over the last month or more that has lead up to how you are feeling today.

The same principle should apply in IT metrics and reporting. If you’re not using a scorecard as a simple, high-level way to both evaluate and communicate your team’s performance, it’s time to start now. An effective scorecard will include the objective or measurement you are scoring yourself against, and an easy “traffic light” system to indicate your current progress: red (at risk), yellow (caution), or green (good).

The most important thing about a scorecard is to be honest. Nobody performs perfectly at all times, so giving yourself a green smiley across every category at every reporting interval will likely cause more alarm and disbelief than praise. Plus, when something truly does go wrong, you are more likely to get support and understanding if you have been candidly assessing your performance and flagging the areas that are putting you at risk.

A basic scorecard for operational performance might look something like this, and is a great way to quickly update stakeholders without burying them in unnecessary technical data.

Screenshot (12)

More advanced scorecards, like balanced scorecards, can measure IT’s contribution to larger business goals, and are effective at tracking the performance across entire organizations and companies.

Putting it all to use

The above are just guiding principles to help you narrow in on what you want to report, and how. To learn more about implementing SLAs and metrics in JIRA Service Desk, watch Lucas Dussurget’s killer presentation at Atlassian Summit 2014. It’s full of our own top tricks, examples, and best practices based on tons of customer implementations. And for a deep-dive on figuring out what you should be measuring, be sure to check out another excellent presentation from Summit 2014–this one by John Custy.

 

This article was originally published on the Atlassian website.

 

ABOUT THE AUTHOR

Nikki Nguyen

Associate Product Marketing Manager, JIRA Service Desk

Although my life in IT is behind me, it’s not too far away. I’m now a recovering systems administrator evangelizing the way teams work by using JIRA Service Desk. I’ve found a love of combining customer service with technology.

 

Nikki is presenting at Service Management 2015.

 

 

 

By |2018-03-19T16:23:22+00:00July 22nd, 2015|Atlassian, blog, guest blogger, metrics, reporting, Service Management 2015|

Love DevOps? Wait ’till you meet SRE – with guest blogger Patrick Hill

headshots-large-PH

 

 

 

 

Site Reliability Engineering may be the most important acronym you’ve never heard of – here’s why.

You may have heard of a little company called Google. They invent cool stuff like driverless cars and elevators into outer space. Oh: and they develop massively successful applications like Gmail, Google Docs, and Google Maps. It’s safe to say they know a thing or two about successful application development, right?

They’re also the pioneers behind a growing movement called Site Reliability Engineering (SRE). SRE effectively ends the age-old battles between Development and Operations. It encourages product reliability, accountability, and innovation – minus the hallway drama you’ve come to expect in what can feel like Software Development High School.

How? Let’s look at the basics.

What in the world is SRE?

Google’s mastermind behind SRE, Ben Treynor, still hasn’t published a single-sentence definition, but describes site reliability as “what happens when a software engineer is tasked with what used to be called operations.”

The underlying problem goes like this: Dev teams want to release awesome new features to the masses, and see them take off in a big way. Ops teams want to make sure those features don’t break things. Historically, that’s caused a big power struggle, with Ops trying to put the brakes on as many releases as possible, and Dev looking for clever new ways to sneak around the processes that hold them back. (Sounds familiar, I’d wager.)

SRE removes the conjecture and debate over what can be launched and when. It introduces a mathematical formula for green- or red-lighting launches, and dedicates a team of people with Ops skills (appropriately called Service Reliability Engineers, or SRE’s) to continuously oversee the reliability of the product. As Google’s own SRE Andrew Widdowson describes it, “Our work is like being a part of the world’s most intense pit crew. We change the tires of a race car as it’s going 100mph.”

Doesn’t sound revolutionary yet? Much of the magic is in how it works. Here are some of the core principles – which also happen to be some of the biggest departures from traditional IT operations.

First, new launches are green-lighted based on current product performance.

Most applications don’t achieve 100% uptime. So for each service, the SRE team sets a service-level agreement (SLA) that defines how reliable the system needs to be to end-users. If the team agrees on a 99.9% SLA, that gives them an error budget of 0.1%. An error budget is exactly as it’s named: it’s the maximum allowable threshold for errors and outages.

ProTip: You can easily convert SLAs into “minutes of downtime” with this cool uptime cheat sheet.

Here’s the clincher: The development team can “spend” this error budget in any way they like. If the product is currently running flawlessly, with few or no errors, they can launch whatever they want, whenever they want. Conversely, if they have met or exceeded the error budget and are operating at or below the defined SLA, all launches are frozen until they reduce the number of errors to a level that allows the launch to proceed.

The genius? Both the SREs and developers have a strong incentive to work together to minimize the number of errors.

SREs can code, too

In the old model, you throw people at a reliability problem and keep pushing (sometimes for a year or more) until the problem either goes away or blows up in your face.

Not so in SRE. Both the development and SRE teams share a single staffing pool, so for every SRE that is hired, one less developer headcount is available (and vice versa). This ends the never-ending headcount battle between Dev and Ops, and creates a self-policing system where developers get rewarded with more teammates for writing better performing code (i.e., code that needs less support from fewer SREs).

SREsTalking

 

SRE teams are actually staffed entirely with rock-star developer/sys-admin hybrids who not only know how to find problems, but fix them, too. They interface easily with the development team, and as code quality improves, are often moved to the development team if fewer SRE’s are needed on a project.

In fact, one of the core principles mandates that SRE’s can only spend 50% of their time on Ops work. As much of their time as possible should be spent writing code and building systems to improve performance and operational efficiency.

Developers get their hands dirty, too

At Google, Ben Treynor had to fight for this clause, and he’s glad he did. In fact, in his great keynote on SRE at SREcon14 he emphasizes that getting this commitment from your executives before you launch SRE should be mandatory.

Basically, the development team handles 5% of all operations workload (handling tickets, providing on-call support, etc.). This allows them to stay closely connected to their product, see how it is performing, and make better coding and release decisions.

In addition, any time the operations load exceeds the capacity of the SRE team, the overflow always gets assigned to the developers. When the system is working well, the developers begin to self-regulate here as well, writing strong code and launching carefully to prevent future issues.

SRE’s are free agents (and can be pulled, if needed)

To make sure teams stay healthy and happy, Treynor recommends allowing SRE’s to move to other projects as they desire, or even move to a different organization. SRE encourages highly motivated, dedicated, and effective teamwork – so no team member should be held back from pursuing his or her own personal objectives.

If an entire team of SREs and developers simply can’t get along and are creating more trouble than reliable code, there’s a final drastic measure you can take: Pull the entire SRE team off of the project, and assign all of the operations workload directly to the development team. Treynor has only done this a couple times in his entire career, and the threat is usually enough to bring both teams around to a more positive working relationship.

There’s quite a bit more to SRE than I can cover in once article – like how SRE prevents production incidents, how on-call support teams are staffed and the rules they follow on each shift, etc.

Our take

IT is full of buzzwords and trends, to be sure. One minute it’s cloud, the next it’s DevOps or customer experience or gamification. SRE is in a strong position to become much more than that, particularly since it is far more about the people and process than the technology that underlies them. While technology certainly can (and likely will) adapt to the concept as it matures and more teams adopt it, you don’t need new tools to align your development and operations organizations around the principles of Site Reliability Engineering.

In future articles, we’ll look at just that: practical steps for taking a step towards SRE, and the role technology can play.

 

This article was originally published on the Atlassian website.


Patrick Hill, Site Reliability Engineer, has been with Atlassian a while now, and recently transfered from Sydney to our Austin office. (G’day, y’all!) In my free time, I enjoy taking my beard from “distinguished professor” to “lumberjack” and back again. Find me on Twitter! @topofthehill

Patrick’s colleagues Sam Jebeile and Nick Wright will be discussing SRE in depth at Service Management 2015.

 

 

 

The Missing Ingredient For Successful Problem Management

Michael-Hall

 

 

 

 

 

With guest blogger Michael Hall.

Many problem management implementations fail or have limited success because they miss one key ingredient in their practice: having trained problem managers leading problem investigations using structured methods. By following a few simple guidelines, your problem management function can be successful from day one or rescued from its current low levels of performance.

Typical implementation

A typical problem management process document usually covers roles and responsibilities, how the process works and a little bit about governance.

Roles and responsibilities usually covers just resolver groups and the process owner. It is surprising how frequently the problem manager role is not defined at all. Responsibilities for the resolver group usually includes ‘investigate root cause’ and ‘update and close problems’. The problem manager is often given responsibilities like ‘assign problems to resolver groups’ and ‘track problem progress’.

The process normally covers the steps but does not say how to go about solving problems. Commonly, the process is simply ‘assign the problem to a resolver group for investigation’. Usually the resolver group also owns closure. This means that there is no way of knowing if the root cause found is correct or if the solution is adequate.

The result is that many implementations do not achieve their expected results. I call this approach ‘passive’ or administrative problem management. The impact on reducing incidents is usually minimal.

If your monthly major incident data looks like this, you may have one of these typical implementations:

SMAC-2015-Blog-Michael-Hall-Graph-Major-Incidents

Figure 1: Monthly Occurrence of Major Incidents.

The Alternative – ‘Active’ Problem Management

The missing ingredient in a typical implementation is skilled problem managers using a structured approach to solving problems. By structured, I mean a consistent, evidence-based method, either by adopting one of the major problem-solving frameworks such as Kepner and Fourie, or by agreeing your own set of steps (I set out one version in my book). Deciding on a standard method that everyone will use with NO exceptions is the critical success factor for effective problem management.

The benefits are:

    • Speed to root cause – a standard approach yields results more quickly –around  60% quicker in fact (see Figure 2)
    • Consistency – all your problem managers can be equally successful
    • Certainty that real causes are found – because investigations are based on evidence and not guesswork and theories, you can show that the causes found are correct
    • Collaboration – if you do problem management the same way every time, teams know what to expect, they can see the good results and they get used to working together without confusion

SMAC-2015-Blog-Michael-Hall-Graph-Average-time-to-root-causeFigure 2: Average time to find root cause in two problem management implementations.

Problem Managers Lead Investigation Sessions

Because it is the problem managers who are highly skilled in problem solving techniques, they should facilitate problem management investigations in conjunction with the technical experts, then work with subject matter experts to determine solutions to problems and track implementation to ensure the problem is entirely fixed. The problem management function should be responsible for reporting root cause, progress on resolution and all the metrics and KPIs related to problem management, but (very important!) making sure that the subject matter experts get the credit for solving the problems.

Track and validate solutions

To gain the main benefit you are after – reducing the occurrence of major incidents – problem management also needs to apply a structured approach to finding solutions, getting approval to implement and tracking the implementation to an agreed target due date.

The Results

This is what successful problem management looks like when you have skilled problem managers using a structured approach to finding root cause and finding and implementing permanent solutions. When problems stop causing incidents, the incident rate goes down quite rapidly.

SMAC-2015-Blog-Michael-Hall-Graph-Major-Incidents-2

Figure 3: Monthly Occurrence of Major Incidents.

 

Michael has over 25 years experience in IT, developing and leading teams, managing change programs and implementing Service Management. Now a specialist in Service Operations, he founded Problem Management as a global function at Deutsche Bank and is a Chartered IT Professional (CITP). Michael will be leading a workshop on Implementing Real World Problem Management at Service Management 2015.