The original post: /r/linux by /u/Laserspeeddemon on 2024-12-18 23:00:48.
I was hired on as a DBA back in July for a government contractor. I am not a DBA. They hired me for my Linux experience. The DBA roles, isn't really a DBA position, it's more Linux work then DBA work and the last few DBAs they hired didn't last because the refused to work outside of databases. I, on the other hand, have plenty of data analysis/management experience in my 20+ years as a Linux Admin/Engineer. From the get-go, I found that the Linux team really could use my help and started fixing small things at the OS level. When I came on-board, I only had 20 database servers to manage.
The contract was up for recompete/re-bid and my company lost the contract. For reasons I'm not privy to, the new company did NOT bring the linux team lead back on-board, in addition to the project manager; but they did bring me back. The transition was messy, for 2 weeks literally no one was in the office. The temporary, on-site project manager was just a network engineer. The government client was hit with an ACAS scan and found that patches weren't donen for 2 months. There was also multiple issues in multiple in-house developed applications.
The customer/temp PM initially went to the Linux Team and asked them to address the updates/patching. The team members told him that "that was David's job" (David was the old team lead that didn't return). The PM learned that I had over 20 years experience in Linux/Unix and asked if I could manage their Red Hat Satellite server and manage the patching. I told him that I have very limited exposure to Satellite, but it was something I was really excited about learned and I stated as much. In fact, David was part of the group for my tech interview and when he mentioned Satellite, I was really excited about learning Satellite and he was excited to hear me say that.
The next thing I know I am being handed ALL of David's responsibilities. I had to change the admin password on the command line for Satellite, just to get in. Aaaaand that's when I was hit with the tsunami. There are 1500 hosts registering in Satellite. I thought at best there would've been like 100-150 servers.
I fix a lot of the issues, got Satellite patche, synced the repos and started to sift through the mountain of registered hosts. Most of them are offline, but I don't know if they were just powered off or are no longer in use. The more I dug, the worst it got. I am literally going into this completely blind.
I asked to see the environment architecture. They have none. I asked for documentation. They had none. I started to look through what files they had in the file share server and it's just aaaall over the place. None of our processes have been documented. What little documentation that I found, that may have been useful, was no longer being adhered to. For example there is no discernable naming convention. It's really what ever someone wanted at the moment that made them. Without having an idea of what is Production vs. Pre-Prod vs. Dev vs. Test, I'm very hesitant to power off or apply anything more than minor releases because some of their servers are actively used world wide. Some servers have ZERO activity and are literally just a bare-bones. One had only two logins in the past year. I asked the government and they literally said they have no idea. They also have NO documentation and never enforced the contractual requirement that was to be. I was literally shrugged at when I asked them what a set of servers do.
I'm now being tasked on patching all servers, but I literally have no idea what these servers do or if they're regularly or periodic use. During one scheduled outage for an application suite (Tomcat server, Web server and DB server), I was half through running just your standard update/patching, when two very frustrated government customers and a contractor came running up and asked me what the heck was going on as FOUR applications went down. Despite having scheduled the outage in advance and informing my counterparts (Network Engineer, Application POC and the senior DBA), literally not a single one of them informed me that the database server that I scheduled to go off line also had 4 other applications databases inside of the database application.
I'm being pressured to patch these 1500 servers of critical patches within 24 hours, but the government also requires a minimum of 48 hours notice before any production servers are taken offine. I'm told that we can't work during off hours (everyone must be there in the office during core hours) and I'm being told that I need to patch these during the day, but also that production servers are not allowed to taken down during the day.....
How am I supposed to proceed?
Oh and the Satellite server isn't being used for anything more than an over-glorified file server. It's not configured to automate anything. It literally does nothing beyond just holds rpms...its just a very expensive VERY large repository... All patches on all servers are manually applied on each server....one at a time.
And to make matters worse all but one other Linux Admin quit, so there's two of us. One won't touch David's old work/responsibilities and myself, whom only been here for 3-4 months.... Just the two of us...for 1500 servers.