Hi all. Actually details were posted to our clients via our control panel, but the bottom-line is denial of service attacks happen in many ways. There is no “single cause” and no single answer. We do have edge of network denial of service protection systems in place, but in this case it was a bit of a combination of efforts. In the end here is a break down of events, without too much detail as we are still working on the investigation.
Initially it looked like an external attack was first though to be the issue, and possibly related to a number of amplification type attacks. Send a large amount of traffic to a target and a larger amount gets pushed back out to a different target system, but in the end.
a) a clients site was broken into because of an outdated script (not EE), they then implemented within the clients account a denial of service script, and a remote control script to allow it to be triggered whenever they wanted to use it to launch at attack, likely using a number of systems with the same script in place.
b) the script was triggered at approx 2:35am central time at a remote target, our border systems throttled outgoing traffic back because of the sheer amount of traffic was outside of “norms”. At the same time on duty and Sr staff (including myself) were woken up by internal and external monitoring systems because of the border system tripping the threshold to throttle traffic that was “abnormal” and because of performance issue or full timeouts on clients sites as seen by our monitoring systems. Investigation into the issue started at that time.
c) The script was possibly triggered by more than one person and outside of the outgoing bandwidth at a remote target, the script was also targeted at two internal clients on other clusters. This has the effect that the attack traffic was at full Gigabit ethernet speed traveling through our load balancers as traffic between servers within the clusters is all routed through the load balancers. Because of how traffic is routed, along with the sheer amount of traffic, at this point what looks to be both incoming and outgoing, each issue needed to be investigated in order.
So in the end once the DoS script was identified in a specific clients account access to that clients site was disabled, the sites on those systems were moved to separate clusters while we fully investigate the infected systems. Reviewing audit logs, only that clients account was modified, and all files were still under their UID and GID information which were the only privileges available to them on the systems.
Steps were taken on our border IPS systems to now allow access to that or similar scripts again, traffic throttling was removed from the outbound traffic from the affected cluster, and with the removal of the internally targeted DoS attack traffic all other systems returned to normal instantly as well.
Because of the multiple layers of attack in this case it was much harder to track than if the script was simply aimed at a remote site and that was it.
Please keep in mind that downtime issues can affect anyone, like Tuesday as well Craigslist, Technorati and Yelp to name a few, including all of Six Apart’s Sites (livejournal.com, typepad.com, sixapart.com, etc.) and others were all offline. This was because of a power outage in San Francisco and it seems the backup generators did not come online, and either the UPSs did not work either. In Sixapart’s case their systems were not shut down safely before the battery systems ran out of power because in the case of Livejournal.com they were down for 7 hours today, give or take a bit.
Edit: a better power outage link
So needless to say Tuesday July 24th was not a good day across the Net at all for big and small sites alike.
PS: Yes I posted this at almost 4am central US time. As you can imagine there has been little sleep in the past 48 hours around here. 😊