The root of the cause was right there. How could I had missed it? Both the servers were hardened and rebooted on different dates due to the Windows Update. I got hold of the details of the hardening and released that due to the inheritance properties, the hardening of the permission in hte registry has been perform on the "service" branch instead of the individual services. Several critical roles such as "Power Users" were actually removed as part of the hardening processes.
Now, take a step back. So what? I still ahve admin running and it was not working when I log in as admin. The reason being even if you are admin, it does not mean all the services are run as admin. Some of them are still run as other roles and these role unfortunate to say happens to be one of the removed one!
My gut feeling this was the root cause. I checked with another working server / workstation and restored the deleted roles into one of the Server Rabbit and ta-da... Everything was working and kicking again. Actually as long as RPC service is up, I pretty much knew I hit it.
Well, here is the aftermath of the whole incident.
1. There are malware found. Incidentally, this means the malware protection is not enough. Even if this time the malware did not take it down, sometimes in the future, it will.
2. There was no backup. Having a good snapshot / backup will save the situation and minimize the downtime required to recover the whole problem.
3. There is not enough documentation. This applies to logging as well. If there logs and documents are properly put in place, the whole incident could had been easily tracked simply from the look of the documentations.
4. There is insufficent training. While its a good idea to have some of the admin learn how to harden or improve certain functions, it is always good to provide the admin sufficent training to do so. With proper training / certification, the admin would had avoided the incident.
5. Access control was not properly inplemented. The sharing of accounts had made the whole investigation a nightmare. If individual accounts were used, combined with properly logging, it would had provided the information who made some change at what time. This could had greatly help the investigation.
6. Logging is the last time that I felt was missing. In terms of the firewall, IDS etc. If proper logs were generated, I would had concluded whether this was a security incident or not from the very start based on who has accessed the server etc. Of course lets not forget the golden rule that if a machine is compromized, the log is as good as none. Therefore logs from hardware boxes (which is probably not the target of the attack) is greatly helpful in determining whether a real intrusion has taken place or was it simply a misconfigureation or software corruption.
I am glad this was not my day in day out job. :) Well if it is, I am hell not getting paid enough for this. :)