For SiteDash I built a worker queue, based on MySQL, to handle processing tasks asynchronously. There's a central database and separate worker servers inside the same private network that poll for new tasks to execute. These worker servers run PHP, using xPDO 3, to perform the tasks the application server has scheduled.

One problem that would occasionally pop up is that the worker servers would lose connection with the database. The database is on a different server in the network, so that could come from rebooting the database server, a deploy or backup causing high load, network glitch, or just.. gremlins.

Obviously, the worker servers need to talk to the database to be useful, so I started looking at a way to 1) detect the connection was dropped and 2) automatically reconnect if that happens. It turns out to be fairly straightforward (once you know how!).

First, I implemented a check to see if the connection is alive. It does that by checking if a prepared statement (query) could be prepared.

<?php
while (true) {
    $q = 'query that is irrelevant here';
    $stmt = $xpdo->query($q);
    if ($stmt) {
        $stmt->execute();
    }
    else {
        reconnect();
    }
    // Execute task, if any
    sleep(1);
}

function reconnect() {
    global $xpdo;
    $xpdo->connection->pdo = null;
    return $xpdo->connect(null, array(xPDO::OPT_CONN_MUTABLE => true));
}

The workers run in an infinite loop, one loop per second, so this check happens every second. When the statement can't be prepared it's treated as a dropped connection, and we call the reconnect method to restore the connection.

The reconnect happens by unsetting the PDO instance on the xPDOConnection instance. Without that, xPDO thinks it still has a connection, and will continue to fail. Because we don't unset the xPDOConnection instance, we can just call $xpdo->connect() without providing the database connection details again.

With this check in place, the loop can still get stuck in a useless state if there's a reason it can't reconnect. That can have some unintended side effects and makes it harder to detect a problem that needs manual interference, so I also implemented another check.

Every 10 loops, another query is sent to the database with a specific expected response; a simple SELECT <string>. The idea is the same as the check above, see if the statement can't be prepared or doesn't return the expected result, and if so, do something.

Here's what that roughly looks like:

<?php
$wid = 'Worker1';
$loops = 0;
while (true) {
    $loops++;
    
    $q = 'query that is irrelevant here';
    $stmt = $xpdo->query($q);
    if ($stmt) {
        $stmt->execute();
    }
    else {
        reconnect();
    }
    // Execute task, if any
    sleep(1);
    
    // Every 10 loops, check if the connection is alive
    if (($loops % 10) === 0) {
        $alive = $xpdo->query('SELECT ' . $xpdo->quote($wid));
        if (!$alive || $wid !== $alive->fetchColumn()) {
            break;
        }
    }
}

function reconnect() {
    global $xpdo;
    $xpdo->connection->pdo = null;
    return $xpdo->connect(null, array(xPDO::OPT_CONN_MUTABLE => true));
}

In this case, we're not calling the reconnect() method. Instead, we're breaking out of the loop. This way the PHP process can end gracefully, instead of pretending to be churning along properly. When the process ends, supervisord is used to automatically restart it. When a new process is unable of connecting, the logs and monitoring get a lot louder than when a worker silently keeps running, so this system is working nicely.

Now, this obviously isn't the entire worker code for SiteDash. Over time it has grown into 300 lines (not counting the tasks themselves) of worker logging, automatic restarting when a deployment happened, analytics, ability to gracefully kill a process, and dealing with unexpected scenarios like a database connection getting dropped.

Overall this system has managed to keep the processes running quite nicely. There were some issues where certain tasks would cause a worker to get stuck, which have now been resolved, and currently the biggest limiting factor for the worker uptime is deployments. The workers need to restart after a deployment to make sure there is no old code in memory, and I have been fairly busy with adding features to SiteDash (like remote MODX upgrades last week!).

It's also been fun and interesting to try to get a good insight into these background processes and tweaking the monitoring to notify about unexpected events, without triggering too many false negatives. A challenge I'd like to work on in the future is automatically scaling the number of workers if the queue goes over a certain threshold, but for now I can manually launch a couple of extra processes quite quickly if things take too long.

Some fun numbers:

  • Overall, since launching workers as indefinitely running processes, the average worker process was alive for 12,5 hours
  • Since fixing the last known glitch where a process could get stuck executing certain tasks, on October 29th, the average worker stayed online for 2 days, 12 hours and 52 minutes.
  • The longest running workers started on November 1st and stayed up for 12 days, 5 hours and 40 minutes before being restarted due to a deployment.