OK so I’m stumped. We run EE on a bunch of sites and while we run into the odd performance issue, basically we’ve found the system works well.
Except for one site. On this installation, our mysql database keeps going crazy to the point where our ISP tags us as ‘abusive’ and shuts us down. I’ve been trying to figure out why.
The server:
This site is hosted on a ViaVerio Linux VPS Pro Plus account (a virtual server) that is advertised as having
Dual Intel Xeon CPUs
4 GB DDR ECC memory
3 10,000 RPM Seagate HDDs:
Dual 300 GB user volume with RAID 1 mirroring
300 GB 24-hour user volume backup
2 “Ultra SCSI 2″ disk I/O channels
MySQL server version is 5.1.47
PHP version is 5.2.14
EE Stuff
————
ExpressionEngine 1.7 Build: 20110509 (upgrading was the first thing I tried)
MSM 1.1
Modules:
Freeform 3.0.6
Tag 2.6.6
Plugins:
Simple Pagination 1.1
Word Limiter 1.0
Find and Replace 1.2
Randomizer 1.0
Strip 1.0.1
XML Encode 1.2
Character Limiter 1.0
PHP Text Format 1.0
HTML Stripper 1.0.0
The only thing unusual about this installation of EE (compared to our others) is that it uses MSM.
Here’s a partial TOP from right now. The site is behaving ‘ok’ at the moment:
top - 11:19:17 up 1 day, 2:29, 2 users, load average: 1.58, 1.72, 1.87
Tasks: 54 total, 1 running, 53 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.8% us, 0.2% sy, 0.0% ni, 92.4% id, 3.6% wa, 0.0% hi, 0.0% si
Mem: 1572864k total, 365784k used, 1207080k free, 0k buffers
Swap: 0k total, 0k used, 0k free, 0k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20312 apache 16 0 38736 20m 3272 D 11 1.3 0:01.80 httpd
1371 mysql 15 0 42788 16m 5152 S 4 1.1 4:33.59 mysqld
10130 apache 16 0 33032 15m 3288 S 2 1.0 0:08.39 httpd
21929 apache 18 0 31780 14m 3268 S 2 0.9 0:00.50 httpd
21959 apache 18 0 31700 14m 3268 S 2 0.9 0:01.72 httpd
20395 apache 18 0 33604 15m 3280 S 0 1.0 0:02.02 httpd
21958 apache 15 0 33052 15m 3268 S 0 1.0 0:00.74 httpd
32026 apache 15 0 31408 13m 3252 S 0 0.9 0:00.06 httpd
1 root 15 0 1820 628 532 S 0 0.0 0:02.38 init
32264 root 18 0 1724 572 472 S 0 0.0 0:00.37 syslogd
And a netstat shows 116 connections to :80
We see seemingly random load spikes that push the top load average well over 50 (I saw it hit 140 once!). The last one happened shortly before 3 am this morning but generally they happen during business hours.
Traffic on these sites, cumulative, is about 30K page views for this week (Mon-Fri). We’ve got a core site that just holds the control panel stuff, and 3 outward facing sites that serve customers.
The viaverio support people say it’s the database that is burying the server. When it gets out of line they helpfully shut it down without notifying us.
I have set up a slow query log but am having trouble identifying the actual culprit. It seems like when the database starts to thrash, *everything* becomes a slow query:
# Time: 110520 10:04:05
# User@Host: [redacted] @ localhost []
# Query_time: 45.598073 Lock_time: 0.000022 Rows_sent: 0 Rows_examined: 0
SET timestamp=1305900245;
UPDATE exp_tag_tags SET clicks = (clicks + 1) WHERE BINARY tag_name IN (‘3PL’,‘Third Party Logistics’);
Clearly there’s nothing wrong with that query, and certainly nothing there that should cause it take 45 seconds! I do see a LOT of queries relating to Tags in the slow query log, though.
I’m no DBA but I’m wondering about a few settings
Open_tables is always at 64—should that be increased somehow?
Max_used_connections = 26—that seems reasonable (tho the database has only been up a few hours)
Queries 315629
Uptime 6362
Does that seem like a reasonable number of queries for that amount of uptime?
We’ve really tried to keep these sites very ‘clean’ with very limited PHP in templates and nothing really fancy going on.
We have a second, identical server, with ViaVerio and are running 3 separate installations of EE on it, plus another 15 or so custom sites, with no performance issues. The only thing on this problem server is EE/MSM, serving 4 sites and 30K pageviews/week.
Surely something is wrong but I’m not sure what my next steps should be. I’ve been bashing my head against the problem for a few days now.
EDIT: Oh, 1 more detail. While this installation has had its problem moments in the past, the issue has just become acute this week; no new development has been happening on any of the sites recently. Just 1 more mystery…content has been added, of course, and maybe we got to some critical number of -somethings-.
Also, at one point our database became corrupt. I repaired it, but I don’t know if that’s a clue or not.
Any and all suggestions would be very, very welcome.