Open source, DIY technical operations

The title seems like a mouthful and belies what I feel I really want to convey.  Systems administration is a broad domain and can contain a multitude of skills but the discipline is the same across all; you are an army of one(1), there is no one coming to help you(2), the cavalry is you(3). Yes, this is nonideal but it is a reality.  Would I love to work in a place where none of these could ever be true? Absolutely.

What I'm trying to convey in the title is largely how I believe anyone in this field could be able to operate, but not necessarily that they should. And indeed, my own experience has largely allowed me to direct issues to the people that can or should handle them. Too often though, I have met people very comfortable in passing the buck and occasionally blame. But let's continue to look at the engineering side of companies for roles other than developers.

Titles and roles like Devops Engineer (yes, I know, generally shouldn't be a 'thing' but it is), Site Reliability Engineer, Platform or Production Engineer and the "old" Systems Administrator are really all the same thing.  And yet, somehow some companies can have all of them and really define .  But the terms/names DO convey a sense of "when".  And to me, the when is sometimes indicative of "what" someone may have experienced. Systems Administration, and Systems Administrator, hark back to a time when a person in that role "wore many hats".

Depending on the size of the company and what they were doing, they could have been provisioning storage on large storage array systems like 3Par, NetApp, or EMC.  They could have been in charge of databases wearing the DBA or Database Architect hat.  They could have also worn the Network Administrator hat provisioning Cisco routers, Juniper switches, Fortinet firewalls, F5 load balancers and creating all the routing ACLs, VLANS, firewall rules and load balancer nodes, pools, and rules on which to match. They could even have done corporate Information Technology or helpdesk work.

The thing is Systems Administrators also dealt with what we might now consider "compute" nodes as well or "servers" whether they were "bare metal" or virtual machines.  And in doing so, they'd need to pick the OS, the filesystem type, whether or not to use LVM or to put different 'file systems' on dedicated partitions what services to run (see '/etc/services') and how to deal with user management.  It's possible they didn't... OR they used LDAP (possibly hooked into MSAD) and top it off with Kerberos. It's possible that if they were doing networking as mentioned above, they tied those together with Radius or TACACS(+).

And this is before I even get into how they'd have to deal with what the developers are creating for the given company for which they all work.  And in the "bad old days", they were the ones on call for everything.  Including the applications code bases in which they have never written a line of code.

They likely also had to deal with release engineering. Which means that they had to use tools to coordinate deployments to multiple machines and ensure that all the machines started using the new application version lest some users get stuck on older versions.

Systems, as in Operating Systems, themselves had minor debugging tools and generally were for point-in-time or for watching live for short periods.  Think ps, free, top, vmstat, sar, et cetera. Those more familiar with monitoring network gear typically used some version of SNMP to remotely probe and retrieve statistics which were then displayed in graphing UIs ranging from MRTG and Cacti to all the NMSes, Network Monitoring (or Management) Systems and their derivatives.

Which brings me to another favorite topic; monitoring. The de facto standard for [polling (pull) based] monitoring was, and probably still is, Nagios.  But so many people have fought with it and hated the experience.  I can honestly say, that for blackbox monitoring using simple Unix service-based polling, there is probably nothing better.  And that includes all the crappy rip-offs that use Nagios as their base but have a different UI or use a DB for backend management of all the services and hosts to be checked.  "Observability" is the name of the game these days and I completely appreciate how far "monitoring" has come. Being able to add an agent, be it statsd, Prometheus or other, to a system so that it can gather and ship [push based] system statistics to something like TIG or TICK stack or some other time series-based system is a great step in the evolution coming from SNMP.  But it's not earthshaking. And dropping in an agent and slurping up all the same info that SNMP would grab just feels wrong. And sloppy.

And this is nothing to say of pushing off any of the services to whatever 'as-a-service' company to handle management of said service. I have worked at different companies have made an executive decisions to pay for some kind of 'as-a-service' service as a way of freeing teams up. However, I do believe that knowing how to run them is important. And obviously looking at what kind of trade-off paying for service versus running AND MAINTAINING IT yourself looks like.

Now, in my humble opinion, "devops" is kind of lost as it was supposed to be a philosophy of harmony by combining dev work and ops work, a way of cross-training for both sets of folks. But the silos still persist. Almost as if to taunt the idea. SRE and Production Engineering, stemming from Google and Facebook respectively, can feel like a continuation of everything above, plus metrics based reviews and OKR type planning. It sometimes doesn't feel like an evolution as much as just "learning more stuff". You want a scale? You want cloud? You want containers?  You want microservices running inside containers on your hyper-scale cloud-based infrastructure?  Welp...

"You gon learn today" is what I feel but you don't have to.  And if you don't, just please don't pretend like you do understand. Own up to it.  Because what is weird is, while the progression of system administration to devops to SRE/platform/production whatever really just adds more "dev" to the mix, the dev side of the house never seems to add anything of what I've written above.  Unless of course, if you are at the point where you are learning about something written above, you might as well go back to points 1, 2, and three in the first paragraph... cause it seems like, "you gon learn today".


(1) It can feel like this even on a team. Whether that be of other sysadmins or if there are devs but they aren't on call or if they are but the app is not one they work on.

(2) I have seen the change, but in some cases, like in an outage, this can be especially true unless you can

(3) Because 1) and 2) above can be true, which is very much a management issue, it really does mean that, even after escalating (and hopefully getting air cover from management) you're still fixing if not coordinating getting fixes in to recover from whatever mess you're in.