Revisiting Redshift

Share on X (Opens a new window) Share on Tumblr (Opens a new window) Share on LinkedIn (Opens a new window)

Redshift is a little async programming library I wrote in PHP over 9 years ago. My goal back then was really just getting a better understanding for asynchronous code execution and channel-based cross-routine communication. As cool as it sounds, PHP imposes some significant limitations on the paradigm, which is why it never made it past the proof of concept stage. I wanted to take a few minutes to analyze what those are and whether a more viable approach is possible now, considering everything that's changed over the years.

How it works

PHP is inherently single-threaded, so the best analogy we can make is with JavaScript and its event loop, particularly async functions. Code is executed completely synchronously, until it completes or blocks, at which point we switch to the next stack in the queue. As we keep going through the queue we eventually loop back around to the original stack and attempt to resume execution where we left off.

Limitations

Unlike Go, Clojure or even JavaScript and its async functions, co-routines are not first class citizens in PHP. Instead, I had to to rely on generators which are definitely fit for purpose, but also lead to overly verbose code that's not necessarily easy to understand. For example, every time you synchronously write or read from a channel, you need to use the yield keyword:

$message = yield $channel->read();

In context of Redshift, it serves the same purpose as JavaScript's await. It's not the end of the world, but I think we can all agree the latter reads much better. It gets worse.

You can't just yield anywhere. All a generator does is return an Iterator that will resume the generator function execution when next() is called. To handle this correctly, Redshift needs access to the Generator object. This is done by wrapping your asynchronous code with async(), like so:

async(function ($channel) {
    echo 'Hello World!' . PHP_EOL;

    yield $channel->write('Quit');
}, $channel);

async() is the equivalent of Go's go blocks. Similarly, it ensures that parameters are evaluated in the current scope and the callable is called asynchronously. Compared to JavaScript, this approach gives us more flexibility, as the same function can be executed synchronously or asynchronously depending on the needs. The downside? For all of it to work, you need to yield directly inside an async block.

This is a big one, which is why it's basically a non-starter for any serious project. Unless you commit to this architecture from the very beginning, and tightly couple all your code to it, you're bound to run into issues. Consider this:

$user = $db->users->findOneById($userId);
$post = $db->posts->findOneById($postId);

A classic example, we're checking if a $user is allowed to perform an action on a $post, and we need to fetch both records from the database.

Normally, these requests would be done completely synchronously, meaning we first request the user, wait for the results, then we request the post, wait again, and only now can we start processing. That's a lot of waiting time, especially as relation complexity increases. Ideally, we'd fetch the data in parallel, maybe do some pre-processing while we wait, and resume execution once we have everything.

Unfortunately, there's no easy way to do this. While we totally could execute the calls to these methods asynchronously, the internal implementation is still fully synchronous. So unless the library we're trying to use already supports asynchronous execution in some way, which would allow us to only worry about passing back a value, it's a very tough sell.

Fibers to the rescue

PHP 8.1 introduced a new feature called "Fibers". This alleviates some problems that stem from relying solely on generators, and gives us a native way to pause and resume execution at any point within the stack. Instead of relying on yield statements inside the closure that's passed to async(), it would allow for calling $channel->write(...) or $channel->read() as is, wherever, and let the magic work itself.

That said, while it would allow for better syntax and make complex implementation easier, it still doesn't do anything for task management or the order of execution. We'd still need async() and the underlying event loop for that.

Channel communication

So far, we've only been focusing on asynchronous execution, but the library offers more than that. In fact, I'd go as far as saying channels are the main feature.

What channels allow you to do is write what looks like completely synchronous code, and let the library worry about executing things in the right order. Let's take the example above, and our database access layer supports channels:

$userChannel = $db->users->findOneById($userId);
$postChannel = $db->posts->findOneById($postId);

$user = $userChannel->read();
$post = $postChannel->read();

// do stuff

In this case, findOneById() is responsible for starting the request, and returns a channel. $channel->read() is a blocking operation – meaning our code will pause there until a value becomes available.

The power of this doesn't really become visible until you start dealing with complex event pipelines, or handling multiple streams that have dependencies on each other. For example, imagine you could do something like this:

// A channel set up to receive pings from elsewhere.
$signals = new Channel();

while (true) {
    // Timeout is a special read-only channel
    // that will unblock after the given duration
    $timeout = new Timeout(60);

    // Blocks until either channel resolves a value
    $channel = Channel::any($signals, $timeout);

    if ($channel === $timeout) {
        // We haven't received a ping in 60 seconds.
        break;
    }
}

This bit of code listens for pings on $signals, and breaks the loop as soon as 60 seconds have passed from the last ping. I could see this being a plausible use case for a monitoring agent of sorts. Let's say you tried to do the same using regular callbacks... I'm going to use JavaScript because I don't even know where I'd start with PHP – well actually, you'd have to do a good chunk of what Redshift does internally on your own to get a timeout to work without needlessly spinning the CPU.

let timer;

export const ping = () => {
    if (timer) {
        clearTimeout(timer);
    }

    timer = setTimeout(() => {
        // We haven't received a ping in 60 seconds.
    }, 60000);
};

Here we already had to resort to maintaining state for the timer, so we can prevent the callback from triggering when a ping is received. It can also result in tight coupling between ping() and every place it's used throughout the application. This becomes even more complex once you start passing data to it. What if you need to support independent timers for multiple sources?

This is still perfectly good code, I'm just trying to highlight that there are different ways of doing things. And I don't feel like I'm doing a particularly good job about it either, which brings me to the next point.

Is all of this really worth it?

TL;DR: No.

As fun as it is hypothesizing, PHP was developed with different goals in mind. And while it's growing, if you're dealing with these kinds of concurrency problems, you're probably much better off using Node.js, Go, or even Java. Not only because they already come with async support built in, but their ecosystems were built around it too – meaning you're much more likely to find the tools you need, ready to go. And the library was literally trying to replicate Go's Goroutines.

That doesn't take anything away from it being a fun experiment, though!

X Click to share (Opens a new window) Tumblr Click to share (Opens a new window) LinkedIn Click to share (Opens a new window)
Published with Ghost