By Zed A. Shaw

Confounding, External/Internal Validity

When I wrote about epoll vs. poll I sort of knew people would have a hard time with it, but I didn't anticipate the huge response (mostly from a few folks). I mean, I understand that the majority of the history of programming has been more marketing, FUD, and meme spreading than actual analysis. I just figured epoll vs. poll is so banal a topic, and it's only about my Mongrel2 project so really what is up with all this insanity over something so little?

A Little Note About Trolls

I think the more important fact is that what I've learned since my Ruby on Rails days is you cannot ignore the trolls. Trolls have become more clever and are very motivated to shape public opinion. If you ignore them then huge swathes of people will simply blindly believe anything the trolls say no matter how wrong or weird it is. I guess it's a guy thing where they'll just believe whoever's tallest, and "tall" on a forum is who's most obnoxious and writes the most.

Before I would just ignore them, thinking that the general public would finally come to its senses. Later I realized the public never does. A simple comment like, "Never hire Zed Shaw because he will destroy your company" can have a major impact on your life if you're not careful.

Sadly, trolling won't end, nor spreading FUD, nor talking tough when you really look like this and probably couldn't hurt a fly. Frankly, I rather enjoy super-trolling. It's fun to get in there and rip on people who's online persona is Mike Tyson when they're really not. I mean, these guys aren't going to listen to me anyway so might as well have fun right?

WTF Is Confounding?

One annoyance I have is that after years of trying to tell people about the simple concept of confounding programmers still don't get it. In the HN thread the constant theme was that I wasn't testing "real world" applications. Well, obviously, I was performing one specific test that had been performed many times before. I was confirming all the previous poll vs. epoll arguments and data. Just simple straight forward experimental replication to confirm what I already knew.

Yet, guys who look like this still say things like this (in their best Jim Rome voice):

Yes Zed, where the fuck is it? You're claiming SCIENCE! based on your worst-case synthetic localhost benchmarks, and then turning around and wildly guessing as to real-world performance characteristics with internet latencies.

Apart from the completely obviously false tough guy voice, this statement simply proves a lack of understanding of confounding. I feel it's my duty now to come up with the simplest explanation I can of confounding that a coder can understand. Because until everyone gets the concept they'll continue to fail in their evaluation of research and become victims to bogus information. Whether that bogus information is my own or anyone else's information.

A Coder's Explanation Of Confounding

The idea of confounding is if you are testing something, you want to narrow the test as much as possible to remove any potential additional influence you can't control. Any remaining influences are then controlled for by statistical controls. This avoids common problems like "correlation is causation" since, if there's only 2 variables, then there's a higher chance that the correlation between them is valid.

The best explanation of confounding for a programmer is that it's similar to to a giant function that should be four functions. You've seen them before. Those huge massive functions with convoluted logic and repeated code all over the place. If you're experience you know first thing you want to do is bust that thing up and make a few little functions that are orthogonal. Each one does its own thing

This giant monster function is confounded because you've mixed in designs and functionality from too many different sources and tied them together in a complicated way. If you're reading a socket in a function, you also don't parse the results in the same function because separating the errors from the two activities is too hard.

And this is really the crux of confounding: if you have confounding it's much much too hard to figure out what your errors are. In addition to having a complex analysis that's difficult to get conclusions from, you also have a harder time knowing if you made errors, or even what your error levels are.

External And Internal Validity

Two of the most common errors you can get from an experiment is a lack of external and/or internal validity. Internal validity is fairly easy to understand, since it's simply if you made an error in some part of your experiment. Did you get the math wrong, design the variables wrong, gather the data wrong, etc. External validity is harder, but basically it's whether your experiment applies when repeated, or can be generalized to other situations and researchers.

In terms of code, this is easily mapped with internal validity being does your function have errors, and external validity being can it be used outside your software. Your function isn't internally valid if it's got a bug or crashes the process. That's easy to understand.

What's tricky is external validity and functions, because your function could be fine, but then someone on your project uses it wrong and BOOM! It can also mean that the function is so heavily confounded in your system that you can't take it out and share it.

Removing confounding increases your internal validity because there's fewer ways you can get the experiment wrong. Same way if you make your functions simpler there's fewer chances for there to be bugs.

However, removing confounding tends to sort of help your external validity, but sort of not. Confounding helps the external validity of your experiment (or function) because it's modeling potentially more of the real world. More people can look at it and see that you covered more variables and start to agree with it. Confounding hurts your external validity though in the same way it makes your functions buggy. If your experiment is too confusing, too hard to replicate, depends on too many factors specific to your environment, or just has bugs (internal validity) then it won't be externally valid.

Another way to say this is, if you make your function simple and easy to use, then you can share it with others. But if it's too simple then nobody will want to use it because it doesn't do what they need. That's the relationship between confounding and external validity.

How That Matters For epoll vs. poll

My epoll vs. poll experiment has high internal validity because I kept it simple and I redid the same experiment everyone has used for years. I only changed a variable (number of actives) and that's it. Since the experiment was an acceptable way to compare epoll vs. poll performance when dealing with different levels of file descriptor activity, it's very internally valid.

Yet, even then I make sure to not believe it and ask people to compare the results. The fact that the test has been done by thousands of other people over the years, and is widely accepted, and is simple to run and analyze means that it has good external validity, but only if you are clear that it's testing one specific thing:

epoll is faster than poll when the active/total FD ratio is < 0.6, but poll is faster than epoll when the active/total ratio is > 0.6.

That's it. It's not testing the "real world" performance of a full on server. It's not testing how sockets work inside poll or epoll. It's not testing disk performance or disk latency. It's not testing phone lines, serving porn, making donuts, nothing else. It's specifically testing that.

Now, if enough people rerun my experiment and confirm my results then I'll be able to use that research in Mongrel2 to see if I can make it scale real well. This is where I'll get additional external validity.

You see, I'll try out the ideas in Mongrel2 and then test its speed in various ways. However, this test will do some additional things that the little epoll vs. poll experiment have shown could be important. Like tons of connections that are idle and lots of fast moving data, basically different levels of active/total ratio (ATR).

If that proves to be a win, or at least have the same performance, then I'll have another experiment to report on that has more external validity. Other people can test Mongrel2's implementation and even try their own to see if it helps them. The cycle will repeat and hopefully we'll see an improvement that everyone can agree is a good move.

If I put this design idea in Mongrel2 and it's slow as shit, well I'll report that too. As I get more and more externally valid (aka "real world") I'll learn more and be able to tell others what I tried. Maybe someone will have a better way to do it, maybe it'll be a dead end.

Either way though, the process I'm following is simply keeping the internal validity high and balancing the external validity with usefulness by controlling confounding.

The Elephant In The Room

The real question though is, given that my test is internally valid, why wasn't this trade-off with epoll vs. poll understood earlier. I'm much like everyone else in that I though epoll and kqueue were way faster than poll. Yet, using the exact same test epoll advocates used I easily find that it's not. And that it's not in a fairly significant way.

What this boils down to is, if someone says my test is invalid, then that means the whole basis of epoll being faster is invalid. If my test is valid, then the experiment I've done is showing that the basis of epoll being fast is still invalid.

Basically, if people try to say that this test is bogus, as in having no external validity, then where the hell were they when everyone was using it to confirm that epoll was faster?

I'd really like for there to be a large suite of these tests then, and potentially make them very externally valid and useful for testing kqueue as well. There's quite a few BSD style OS that use kqueue, so tests of this kind could help make everthing faster.

But more importantly, if there are tests that are not confounded, easily performed, and externally valid then we can start to see real improvements. Maybe then the Linux kernel can just optimize epoll so that it's as fast as poll. That'd solve the problem very neatly.

Finally, don't take anything I'm saying to mean you're an idiot. This is just how things work when you're working in a field where knowledge and information are important. You get new information and then you're wrong. Oh well. It's not that big of a deal.