System Administration as Science
One goal in my day to day work is to quantify events in a systemic way. System administrators are in a unique position to view the network, servers, clients, software and the ways that they interact. While good software development depends on abstracting away as many things as you can, good system administration depends on understanding how the layers interact.
For example, a good developer will abstract away the type of database he is connecting to. There is a small shim that can be adjusted so that the program runs with no changes on Oracle or PostgreSQL, for example. The Java language itself depends on abstracting away the entire computer by implementing a virtual machine that acts consistently over differing operating systems, or even different CPU architectures. A Java programmer doesn’t care that he is running on Solaris Sparc or Linux MIPS or Windows X86, or whether the CPU is big-endian or little-endian.
However, a good system administrator does care, and should know the difference. System administration is about removing layers to solve problems that occur when the abstractions break down. Joel Spolsky refers to this as “The Law of Leaky Abstractions.”
All non-trivial abstractions, to some degree, are leaky.
Some have compared system admins to the plumbers of the IT world. Like plumbing, the effects of system administration disappear when everything is working. Only when things start to leak, and shit starts to hit the fan (literally or figuratively) does it become noticeable. There seems to be one breed of system administrator that thrives on fixing problems. Imagine the server going down, and the mayor frantically paging the heroic sysadmin with the Bat Signal.
Our hero drops into the storm with his combat boots and trusty Leatherman, typing arcane commands, drinking Mountain Dew and cursing at everyone around him. Suddenly, joyous shouts erupt as the users discover their work can continue. Everyone cheers the SysOp, while he struts back to his Bat Cave, until the next Bat Time, at the same Bat Channel.
How does one measure the performance of the lone rogue sysadmin troubleshooter against another that has carefully scheduled downtime, and the system “just works”? Is the system with less downtime more reliable because of the work of the system administrator, or are they just lucky? How does one compensate the hero who fixes every problem solved, verses someone that never demonstrates this ability because the system never goes down?
What of the sysadmin who has unreliable hardware or buggy software forced on him by upper management or customer demand? A lot of companies want to measure metrics like uptime, but is it even possible to properly measure 99.99% uptime, and does that have any correlation to the person running the system?
99.9% uptime amounts to approximately 42 minutes of downtime in a single month, but many of the tools used to measure the availability of the system have a minimum time resolution of 1 minute. For example, you want to test that your website is up and available to your users, so you write a script that makes an HTTP request and returns the result. It sends you e-mail if it doesn’t get a response. However, the standard UNIX cron utility that schedules tasks can only run once per minute.
With a CPU running millions of instructions per second and servers typically having multiple processors, one minute is too long. But, if we magically invent a utility that can schedule and execute your script once per second, suddenly your server is overwhelmed by these requests and your script itself brings the system to a halt. What if you have a process that crashes and restores itself in less time than your monitoring tool checks? You wouldn’t consider a server that crashed every 30 seconds reliable, but most monitoring software can’t tell the difference.
Recently, I upgraded our company’s e-mail server because it was crashing under an ever increasing load of spam. The new software was more efficient and no longer crashed, however this meant it was also more efficient at delivering spam. I was happy because I wasn’t getting pages to restart the mail server, but the average user actually saw more spam in their in-boxes. It is difficult to explain to the average person who just wants to read and send e-mail how complex the system is and how upgrading the software was the right thing to do.
Most people don’t understand that e-mail isn’t guaranteed instant delivery, and that mail servers will attempt redelivery if it can’t get through to a server. In our case, when the server was flooded by spammers, all the legitimate e-mail eventually got through while some spam probably didn’t (spammers typically won’t retry delivery when they can’t connect). Now, both spam and ham get through equally quick. Of course, we are working on ways to reduce the spam, but it is an almost intractable problem when you have thousands of people around the world working day and night to devise clever ways to deliver their junk.
One thing that is important from a sysadmin point of view is to document and explain the problem both upwards to management and downwards to the clients and customers. To quantify the problem I’m using log analysis tools to graph the problem over time. Now that I have hard data, I can start to formalize the problem and test the validity of various hypotheses to solve it.
The challenge, as with uptime statistics is to find numbers that are accurate without introducing a sort of Heisenburg effect from monitoring and then presenting the numbers in a way so that the people who depend on the sysadmin to get their work done can evaluate whether that person is doing a good job or not. I’m not sure there is any magic bullet, but it is clear to me that applying some science to the art of system administration can aid in communication, diagnosis and ultimately problem resolution.
It is an area I will be expending more brain slices on in the future and on this blog.