3AM call on a Sunday in New York
"Your application is broken! Users in the UK will start signing in soon and need to use the application. Please work with your team to fix it"
It's the middle of winter. I roll out of my warm bed and start logging into our system. Barely awake, resting my palms on the cold laptop is unpleasant and I'm annoyed. The team-member responsible for the application was on vacation. My team had not deployed any code that weekend. Why had the application stopped working overnight?
I looked at our infrastructure monitors and as expected, there was no smoking gun - all the servers looked good. In my 15 years of running application/services with high SLAs, I had only encountered one hardware failure and today was not my lucky day. Something else was up. 🤦
I start scanning through our logs and noticed that some of the external services, our application depended upon, could not be reached. But, health-checks indicated that those services were up and running.
Who else depended on those services? Were they seeing the same thing? I didn't know and it took a long time and some waking-up of people to get the answer. For some mysterious reason, those services were unreachable!
Even more people were woken up and in the large group call that ensued, we identified the root-cause. A team had made a configuration change on Friday that took effect on Sunday morning when the servers rebooted.
Nobody was notified and with the systems we had in place there was no easy way to find out what changed, who changed and what was the nature of the change.
As an engineering manager who has worked at a few companies in my career, I have seen similar stories play out on an uncomfortably frequent basis. Add to this that modern teams want to do more faster. They use micro-services. They want continuous delivery. Without the right tooling, that combination could lead to catastrophic consequences.
fluxroll attempts to solve this by being able to answer:
- Which of my dependencies changed and when?
- How do I access logs/monitoring details for any service/component within the organization?
- What was the nature of the change
- Who is responsible for the component/service that changed?
- Who is on call now?
- How do I contact that person?
and much more.
Our mission statement is to build the best production-support tool you'll ever use.