For the past several years, my employer has been moving toward an Agile development methodology. There are some challenges when mapping this methodology into operations because it’s not the same thing; but, surprisingly, those are not where I have experienced challenges. The biggest challenge during this transition is some of my coworkers seem to think the methodology is that there are no rules.
A friend of mine, a fairly eccentric history professor, used to say that a little knowledge is a dangerous thing but you’ve got to emphasize LITTLE. And it seems like we’re encountering a situation where Phil’s emphasis holds true: the only thing garnered from from Agile training is that the documentation and process from waterfall projects are no more. But breaking away from the large-scale view of a P.R.O.J.E.C.T. for Agile development is a bit like breaking a monolithic application out into microservices — it still needs to do all of the same ‘stuff’, it just does it differently. And there are still policies and procedures — even a microservice team is going to have a coding standard, a process for handling merges, a way of scheduling time off, and some basic idea of what their application needs to accomplish. Sure, the app’s design will change incrementally over time. But it’s not an emergent property like chaos/complexity theory.
Maybe the “what Agile means to me” mentality comes from failing to clearly map a development methodology into an operations framework. Maybe it’s just a good excuse to avoid components of work that they do not enjoy. To avoid “agile operations” becoming “no boring planning stuff!!!”, I’ve outlined ways in which the Scrum methodologies the company wishes to adopt can be used to streamline operations. It helps that our group is reorganizing into an operational/support group and an architecture/design group — I see a lot of places within the operations team where Scrum approaches make sense.
Backlog — prioritizing the ticket queue like a backlog and having support staff constantly pulling from the top of the list — not only is this an awesome way to avoid the guy who scans the queue for the easy jobs, but it ensures the most important problems are being resolved first. A universal set of stakeholders does not exist for the ticket queue — someone whose ticket is ranked fifteenth on the list may disagree, and they are welcome to add details explaining why the issue is more impacting that it seems on its face. But 90% or more of our tickets are “Sev3” — which basically means both “we want it done ASAP” and “it isn’t a wide-spread high impact outage”. Realistically, dozens of tickets do not have the exact same time constraint and impact. There is extra work for management in converting a ticket bucket into an ordered backlog, but the payoff is that tickets are resolved in an order that correlates to the importance of the issue. In addition to the ticket queue, routine maintenance tasks will be included in the backlog. And prioritized accordingly.
Very short sprints — while developers moving from Waterfall to Agile might start from a month (or two) long sprint and trim weeks as they evolve into the process, operations starts from the other end of the spectrum. Our norm is to grab a ticket, sort it, then look at the queue and grab another one. We are planning for hours, maybe a day or two. This means we might establish application access on Tuesday that isn’t needed until next Monday. Establish a sprint that lasts a week, and use the backlog to get tickets that have lower priority (either because the impact is lower or because resolution is not needed for a week) included in the sprint. Service interruptions, SEV1 and SEV2 tickets, will occur and should be assumed in the sprint planning (i.e. either take enough work that you think it will just get done with no service interruption tickets and accept that some tickets from the sprint will be incomplete or leave some space for service interruption tickets and have staff pull “bonus” tickets from the top of the backlog if they have no work toward the end of the sprint).
Estimation — going through the tickets and classifying each incident as a quick little task, something that will take a few hours, or a significant undertaking facilitates in sprint planning. It’s difficult to know how many tickets I can reasonably expect to include in a sprint if I cannot differentiate between a three minute config change and a three day application rollout.
Multi-tasking — Implementation, support, and ticket resolution tasks are no longer a big bucket of work that individuals attempt to multi-task to complete. There are distinct tasks that are completed in series. Some tickets require information from the user; put the ticket on hold until a response is received and move on to the next unit of work.
Velocity — historic data based on time estimates cannot be generated, but simple number of tickets per week pre- and post- can certainly be compared. And going forward, ticket counts can be weighted by estimation values.
Stand-ups are a bit of a mental sticking point for me. I can conceive the value of spending a few minutes reviewing what you’ve done, what you plan on doing, and ensuring there is a ready forum to discuss any sticking points (maybe someone else has encountered a similar situation and can offer assistance). Stand-ups could include a quick discussion of any priority shifts (escalations, service interruptions) too. *But* my experience with stand-ups has been the attendance test variety — stand-ups that were used to hurt individuals who didn’t make it to the office by 08:00. Or those who weren’t around at 16:50. I don’t think it’s reasonable to ask someone who got into an issue and worked until 7P to show up at 8A the next day. I also don’t think it is reasonable to expect someone who came in at 6A to continue working until 5P. Were a stand-up scheduled in the middle of the day, I might feel differently about them.