Mike is a family man
He loves his little kids and tries to spend as much time with them as his job allows. Mike is part of an IT operations team at a large financial IT service provider. And there is on-call duty. Every four weeks he is on call, meaning he has to fix major and critical problems after business hours. The team of the IT department have set up email group notifications from their monitoring tools. But there are multiple issues with it. Number one – at the middle of night an email notification easily goes unnoticed. That short “beep” on your mobile is simply not enough to wake you up. And even if you change that ringtone you simply get too many alerts at night which keep you up and then turn out to be completely irrelevant. For instance because they belong to a different team or different area of expertise. When on duty, Mike has to constantly check his mobile for new email alerts. This makes on-call duty quite a pain and totally collides with his family life. If he then misses an alert it makes him look unreliable.
Mike did get this one Alert
Last week they had this major incident. What a nightmare! Especially as he had stepped in for his colleague Peter who had his first date in 6 months! Already in the early evening hundreds of alert emails had piled up in his Ops inbox. Somebody had forget to switch on a maintenance window in the monitoring tool. His son had a bad fever that night and his wife was away with his little daughter. So he was literally jumping back and forth between his son’s bed, the kitchen and his mobile and PC. At 12pm he fell asleep, totally exhausted. Still pretty tired, he woke up at 6:30 and tried to dig through all those new maintenance-related alert emails in his inbox. All looked rather ok to him but he was a little suspicious. When he checked again he stumbled across one email that pointed towards a potential critical issue. Usually, he should have received a text alert on this critical issue, too, but it seemed it had not been sent this time. Their text alerts weren’t really reliable. And alsot, the issue was clearly related to databases. And of course there was a database team. They always had one person with on-call duty, too. Mike was sure that it was not his team’s cup of tea. Unfortunately, there was no tracking or confirmation with their email notifications. He wanted to call the one guy of the database team he knew, in order to cross-check. But then his son woke up crying and he had to take care of this first.
Then his boss Ralph called
Man, he was simply boiling. Mike’s ear was almost boiling too. Mike tried to explain the situation, the missing text, his evaluation of the issue belonging to the database team. Ralph was not satisfied, shouted at him over and over again and finally hung up. Mike understood him, he was under enormous pressure and this incident really was very bad one with hundreds of customer applications effected. On top of all he could not really help because his son needed him this morning.
A solution was badly wanted
Mike knew exactly what was missing. Their alerting and notification sucked So, his and his team’s top priorities were:
- 100% reliable alert notifications not only email and somewhat working text messages, ideally voice calls
- Escalations to a backup, so just in case he overlooked an alert it would go to another team member
- Great filtering, so he would only receive or be woken up by truly critical alerts
- Targeted notifications, so he and his would only receive alerts from systems they are in charge of
- Duty scheduling with ad-hoc stand-ins and automated alert routing, he could have used this when his son was sick
- Maybe some “who’s on call” across the whole IT division to cross-check and communicate with other teams, ideally accessible from a mobile phone
- And acknowledgement so he could confirm alerts, annotate them and his boss would always be in the loop
Mike has now much more of great family time – even with on-call duty