Back when I did the first Mongrel I realized that the perfect data structure for routing requests by URL was the ternary search tree. This let Mongrel route large numbers of URLs very quickly by their prefix without having to parse the path at all. The downside though was you couldn't do anything more than prefix matching, which was fine in Mongrel. In Mongrel2 I wanted to go with the TST again, but I also wanted to try out a new idea, where I added the Lua pattern matching code so you get both fast prefix matches followed by patterns similar to a regular expression.
As of commit 079c177756 I have that working, and it took me just one day. Best of all, using the two data structures, the entire piece of code that meshes them together is:
int RouteMap_collect_match(void *value, const char *key, size_t len)
{
assert(value && "NULL value from TST.");
Route *route = (Route *)value;
return pattern_match(key, len, route->pattern) != NULL;
}
list_t *RouteMap_match(RouteMap *routes, const char *path,
size_t len)
{
return tst_collect(routes->routes, path, len,
RouteMap_collect_match);
}
There's some initial setup code, but the meat of the actual algorithm is just this. I use the tst_collect function to find all the values that have a given prefix, and then use the pattern_match function to weed out any that don't fit the pattern.
The pattern code was taken from Lua's code, and then I stripped out the captures feature since they aren't necessary yet. The TST code is from an old Dr. Dobb's journal and I've been using it for years and tweaking it so it looks nothing like the original. Combined the entire routing systems is less than 500 lines of code:
mongrel2 $ wc -l src/routing.c src/pattern.c src/adt/tst.c 63 src/routing.c 229 src/pattern.c 139 src/adt/tst.c 431 total mongrel2 $
Of course this will probably get bigger as features are needed, but not by much. I anticipate other parts of the system will simply use this for matching hosts and paths against patterns for routing purposes, and this code will only slightly increase as needed.
I find the average programmer's lack of understanding of the TST data structure and its variants maddening. Right up there with not knowing how to write a parser. Even worse is that since they don't understand them they usually scoff at using them in any applications outside of what they're used to. Put a parser in your web server and people think you're mad, even if it blocks many security attacks.
I imagine using a TST for the routing will get the same reaction. Programmers will read about it, not try to understand the code at all, and the scoff because they would use a hashmap instead. Yes, and then you'd write something way more complex than what I've written that does the same thing for one very simple reason: You can't hash an arbitrary prefix.
You see, with a TST I pay a penalty initially in RAM and insert speed, but after that I can quickly find all matching nodes for a given prefix by just scanning the input string once. With a hashmap you have to parse out the path components, hash each component, walk a nested set of hashes, and then collect up all the possible hits, and then try to weed out the ones that don't match. In a way, using hashes is like doing a more complicated TST for a path based routing system.
A TST is much simpler for this need. They beat Red-Black trees (which are nasty and require too many string comparisons), hashmaps, and pretty much every data structure for any application requiring prefix matching. No obviously you could probably sit down and craft a scheme that was faster than the code I got, but it would also not be 139 lines of clean C.
Next is the Lua pattern matching language. Lua uses a tiny little bit of C code to implement a very basic kind of "mini-regex" system. By tossing out features that very few people use in a regex they were able to give a nice little alternative that's easy to implement. I've used it in a few things that needed basic string pattern matching, and it's a great bit of code. A little hacky with its "goto as tail recursion" but otherwise nice and understandable.
What I've always wondered is, could you combine the two. At first what I wanted to do was have the Lua pattern matching code walk the TST and collect the matches. This would sort of turn it into an N-way pattern match algorithm. When I attempted that it was way too hard to get right and made the code more complex. Once I scrapped that and went with the simple "find the prefixes, run matching on those" it was a simple yet powerful combination.
How I'm doing this is when you insert your pattern the routing system looks for the first '(' in your pattern like '([a-z]+)'. If it finds one it registers everything up to that point as the "route", and then uses the whole string as the pattern. If you don't have that then it's a simple prefix match.
Later when the routing system is queried the TST does all the work of finding all the common prefix nodes, and then calls my little pattern match function on them. The patterns are nice and tightly processed by the Lua code and only those that match are left. If the pattern is matched completely in the TST then it's simply returned as-is, assuming you found a whole match and don't need to use a pattern.
You can take a look at the unit test to see how it's used. Keep in mind there's a slight logic bug I'm hunting down that makes it hard to match multiple patterns on the same prefix, but otherwise it works great.
I kept the C code simple as I could and hopefully other people could steal this design. I also anticipate it'll be easy to add specific routing patterns to the code, and maybe bring back captures for a complete routing system.
However, a warning is in order....
Nothing brings out the Bikeshedders like templating, unit testing, and routing. Never mind that in a web application those two parts of the system are the least of your performance woes. People will still try to get their routing system to go from 2ms to .2ms while their ORM loads whole tables into a view and only memcached can save them. Routing and templates are something most programmers can understand in a naive way. Parsing HTTP, handling socket I/O, creating an ORM, and writing a server aren't, so they don't have any opinion on those.
Added to this is that it's easy to test the speed of a routing system. Make a ton of string, try to route them, do it in a loop. That's all you need and you can start complaining about the speed, so that's what everyone does. The purpose of the Mongrel2 routing system is a little different from other web frameworks, so comparing speed isn't really fair in either direction. Mongrel2 needs to just match URL and hosts, probably tons of them, and then route them quickly to a backend. It doesn't need captures, parameters, integer conversion, or many other features.
This code is a massive bikeshed magnet because it has the following properties:
Right away, the bikeshedders are gonna start benchmarking this 1 day old code against their current system. Of course their very well baked multi-year project might be faster, it might not. Next they'll start comparing features, and complain that their favorite routing system can map arbitrary arguments to different functions and then check that they are integers or strings. After that they'll start seeing how they can add their favorite features, or tweak the speed with ugly hacks.
Who cares. The purpose of the Mongrel2 routing system at this stage is simply the following:
That's right, I want this code to get the full bikeshedding treatment. I'm almost anticipating that it will be the first thing every hacker wants to improve and I'm all for it. If you want to get in and fix this, improve it, make it do new stuff, then go for it. I'm all for the patches.
However, any contributions have to keep these three design requirements in place. Yes, even the 3rd one.
Today I'm gonna go out to a coffee shop in SF and try to get the final data structure design done. It's almost there, and once it's ready I'll be working on how you actually configure and deploy a Mongrel2 service.
Stay tuned for more.