The biggest bane of my current job is stupidity, whether it be stupid laws like HIPAA or Obamacare, or “We’re so smart, we’re stupid” Indians, or corporate stupidity, brought about by the people running said corporation who have no idea how things actually work, but they have a college degree and theories on how things “should” work. All of this stupidity is quite, quite frustrating.
The account I’ve been working for the past few months is a major health insurance company, whom I’ll name HCIC. Because of Obamacare and HIPAA law and regulatory requirements (to say nothing about any of the countless other laws out there), HCIC has been forced to do a lot of stupid things, including having those of us who actually work the account have to beg for permission to do our jobs every 72 hours (improved from every 24 hours). That way, they don’t run afoul of any laws or regulations. As such, to accommodate this, The Company (the place I’m contracted to) initially had to hire tons of people for their IT Access Administration department. However, since The Company is only in business these days to cut their own costs, this was a major problem. Their solution was to use a tool, called the Vault, to place our “privileged” IDs into, whereby they’d be treated like shared IDs and “checked out” every 72 hours. HCIC agreed to this and that was that.
In order to process HCIC’s nightly batch, which covers everything from insurance claims, billing, pharmacy, data warehousing, medical providers, etc., and make sure the onlines are up by the Service Level Agreement (SLA) time, I have to use my privileged ID’s to accomplish this, as a specific script has to be kicked off to start the ball rolling, part of which has me remoting into other machines to set their environments for processing.
Fast forward to Friday. When I came on shift, I had active privileged IDs. However, a few hours later when it was time to start the night batch process, I found my password wasn’t working. I did a quick check and discovered that my IDs were revoked, manually. So, I contacted the oncall Service Delivery Manager, who naturally is in India, thus making him a pain in the arse to work with, who then started a chat conference with myself, a member of the Access Admin team, and my Dispatcher. It took nearly an hour, but the AA person was able to determine that 86 IDs had been revoked because the AA team had determined that we all needed to get revalidated in order to use the Vault. Since we were no longer authorized to the Vault, even if our IDs were still in the 72 hour activation window, they were revoked.
So, nightly batch processing for HCIC has been delayed for an hour. My team lead and manager are paged by the Dispatcher and apparently, my manager goes nuts, wondering why this wasn’t detected earlier. Yeah, ’cause having my IDs pulled on me after I come on shift, which does not force me out of the system if I’m already logged in, is something I’d know earlier. *_* So, I’m forced to scramble to provide all sorts of information, including proof that I’ve taken the legally required, yearly HIPAA training and certification. I’m told to cut a SEV 2 problem ticket for this issue so that my access request may be expedited. I do so, then I send everything in to my team lead, including the HCIC HIPPA training certificate. I then get notified that I can’t use HCIC’s HIPAA certification, but I have to use The Company’s HIPPA certification. Yeah, thanks to the stupid HIPAA laws and regulations, I have to take redundant certification training (currently, three times since there’s another account that I have access to, but am not currently working on, which falls under HIPAA).
After finding and sending in the proper HIPAA certificate, my team lead fills out all the paperwork, then my manager approves everything and sends it on. However, by the time it gets to the people at Access Admin who are responsible to add their approval, they decide to reject it, citing that specific verbiage has not been used. Even though my manager put, “I approve AstroNerdBoy’s access for the next 72 hours, starting with today’s date”, this was not good enough. It HAD to say, “I approve AstroNerdBoy access for today’s date-time through the next 72 hours.” So, my team lead had to be paged out again, fill out new paperwork, then my manager had to be paged out, approve it with the absolute verbiage, and then it went back to Access Admin.
By now, it has been three hours since batch was supposed to start and if I don’t start it soon, the dozen jobs or so that must complete before online come up won’t be done in time, meaning we’ll have an SLA breach. Access Admin then has to cut two Service Requests to get my ID back. After it gets to the SDM an hour or so later, the SDM then wakes a bigwig at HCIC to approve my request for access.
With all of the approvals in place, now the machinery has to go through the process of actually restoring my IDs. The actual time it takes to remove the disabled flags on my IDs would have taken me about 60 seconds, and that’s for two clustered environments. It took Access Admin nearly two hours to accomplish this, partially because of procedural requirements, and partially because AA uses 3rd party software to accomplish in 10-15 minutes what I could do on the command line in moment.
It has now been nearly six hours, but I now have working IDs so that I can now start the night batch script, do the things that I’m required to do, and then actually launch the night batch. The Indian SDM was desperate for The Company to not look bad (and as a result, make himself look bad), thus he insisted that I not cut a SEV 1 problem ticket and just use my SEV 2 ticket to account for the fact that the onlines missed their SLA. I protested, but my manager told me to let him make the call and take the hit. However, I wanted that SEV 1 because that’s the only way The Company wakes up to problems is when they have to pay financial penalties for the stupidity that goes on.
Thankfully, an HCIC employee in the U.S., who just happened to be Indian, stepped in and overrode the SDM, citing the fact that the onlines not being up impacts hundreds of users in India, thus a SEV 2 problem ticket is not adequate to the issue at hand. Since the customer demands it, the SDM has to bow and agree. I got up and did an evil, gleeful dance.
With the SEV 1 now in place, a new problem raised its head. The entire next shift had their IDs revoked as well. I laughed and laughed at this because this meant that once I went home, no one could work the production environments for HCIC as security policy dictates that no one may use my IDs when I’m not there. Because the next shift would have to submit all of the same stuff I had done earlier, paperwork to get access restored to their IDs couldn’t even be started until after they came on shift. I laughed again.
Once the next shift came on, their Dispatcher told me that I was now authorized to stay as long as I wanted and collect O/T. On one hand, I was tired since I’d not gotten as much sleep as I should have and I wanted to go home. On the other hand, I’d lost a week’s pay when I caught that nasty flu-sinus infection and had a doctor’s order to not go to work. So, I opted to milk the overtime and hang out with the next shift.
By this time, I’d already worked twelve hours. I stayed another five and a half hours before I just ran out of gas and went home. When I left, my co-workers on the next shift still had not received their access back, and it looked like their access request had hit a snag. I laughed wearily and walked out the door.
Hopefully, Access Admin will get hit with the SLA and teach them a lesson that when revalidations are required, advanced notifications must go out instead of just revoking the IDs on a whim.
Oh, and this isn’t an April Fools joke either. *lol*
Can I ask why is all this stuff not automated to begin with, why do there have to be tons of people pushing electronic paper around? No offense to you, but you’re included in the question based of what you describe your job to be. Manually kicking off standard jobs every day seems awfully redundant to me.
Good question. Once I start the job stream for the nightly cycle, it is 100% automated. However, the reason they don’t have this night batch process started automatically is because HCIC may need me to execute service requests prior to the start of night batch. These are often one-shot jobs run outside of the scheduler, which will update certain files to make sure that billing, claims, whatever have the most current rates and such. They may also have me execute a change to modify an application job or jobs to correct some error that had turned up.
HCIC doesn’t want the environments set automatically since they sometimes need their claims loader jobs to run longer, thus requiring the onlines to stay up longer. If the environment on the clusters is set for night batch while these loaders are running, it causes problems.
So, there are reasons to have the start of the nightly batch cycle done manually. Normally, it only takes about 5 minutes to do, and after that, it is all automated.
Ok, but then why do they need an extra person like you to run their jobs, why couldn’t they do it themselves? I don’t get the need for there to be two or more people for this when you and the person telling you what to do could be the same person. It very much sounds like the person giving you your tasks would have to be familiar with the system the same way you are. Either eliminate that person or you from the process chain if they’re looking to cut costs and all that. I get that it would suck for you if it was you but I hope you get my overall point. A lot of what you described in your post seems to be caused by too many people being involved.
>Ok, but then why do they need an extra person like you to run their jobs, why couldn’t they do it themselves?
Another good question. The short answer is lines of demarcation.
The long answer is this. In the IT/IS field, ALL companies of any decent size will have some sort of computer system to process their work. These platforms might be distributed (servers run on Unix, Linux, or Windows), whereby multiple servers spread the workload amongst themselves. Most websites are on such platforms. Some companies, especially smaller ones, will do their processing (whether payroll, billing, etc.) on such platforms.
The next kind of system is called midrange platforms. These are computer that are much larger than servers and specifically designed with larger businesses in mind. Such platforms are AS/400, OpenVMS (DEC), or Tandem.
The final platform are mainframes. These machines are “old school” if you will, though their hardware, O/S, and applications are kept fairly current, were designed with large businesses in mind. They are much larger than midrange machines, but smaller than super computers. Mainframe isn’t nearly as common as it once was. IBM and Unisys still make them but I know that Honeywell got out of the mainframe business ages ago.
No matter what kind of platform, an operations team is needed to run them. Ops monitors for problems, often with the aid of monitoring tools. If a problem is noted, ops investigates to see if it is real; if so, they try to fix it, if possible; if not, they will then send the issue to the proper support team, physically contacting them if required.
For midrange and mainframe systems, more so than distributed, ops also has to perform production support work. This is where ops runs jobs for the customer. In large businesses, like HCIC, they do NOT want their various application teams running stuff at a whim. Further, they wanted a single, controlling force (ops) to play the role of air traffic controller, if you will, rather than have the chaos of users all doing their own things.
In my tale, there are at least a dozen application teams that would have to all coordinate when night batch started if they were allowed to run their own jobs. That’s because many jobs are dependent on each other and all require that the onlines be down initially. So, which support team brings down the onlines to start this process? This is even more important since the Claims team might need onlines to stay up for a while longer than normal to finish loading claims into the system. The Billing team might not want night batch started until new billing rates are put in place.
So, ops handles all of that responsibility, allowing the application teams to just focus on their applications. Ops handles the adhoc service request and change requests as well, eliminating the need for these different teams to constantly have to coordinate their activities with everyone else.
Hopefully, I’ve explain why ops is needed to be here for making sure the systems stay up and run properly as well as why we do production support work and run customer jobs.
As for cutting costs, these different customers, like HCIC, outsource this work to places like The Company in order to cut their own costs. When HCIC did this work, they had 2.5 people assigned to the midrange OpenVMS platform alone (the one .5 person floated between the VMS and Mainframe platforms, depending on how busy they were). I do that work solo most of the time, though when I’m really overrun with requests from India (where HCIC outsourced 90% of their application support work to), when I’ll get one of my colleagues to help, if they aren’t too busy.
Still, because The Company is all about cutting their own costs, most of the ops/production work is done in places like India, Argentina, Brazil, and now China is getting some work. Well, “slave labor” that works for pennies on the dollar is happy with that is a win for The Company. The only reason The Company hasn’t outsourced the VMS work is that they don’t have enough of it to justify creating a VMS ops team in India (or wherever), unlike mainframe or AS/400, where India and other places support hundreds and hundreds of such platforms for The Company’s customers. The only AS/400 work we have in our team is left in the US because of legal requirements or customer demands. Ditto mainframe.
That was rather verbose, but I hope that cleared up things for ya. ^_^