Thursday, May 31, 2007

Random Software

We take a break from our regular conference schedule to think about software.

Random Numbers and π

During a talk being given at the conference by a friend, an example of calculation complexity was given of determining the digits of π. This got me musing about something I did several years ago, where I demonstrated the Monte Carlo technique for someone by showing how it could be used to calculate the value of π.

The idea is pretty simple. Take a square with a side of length 2, and draw a circle of radius 1 inside, so it touches all 4 sides. The area of the square is therefore 4, and the area of the circle is π. This means that the ratio of the total area (the square) to the circle's area is 4:π, or 1:π/4.

The Monte Carlo technique is to select random points, and determine the ratio of points inside the circle to the total number of points. This gives an approximation of the ratios of the areas. The more points you have, the better the approximation. Eventually, with an infinite number of points, the value of the ratio will approach π/4;.

Picking a random point is trivial, as you just need values for x and y which are between -1 and 1. This gives a square which is centered on the origin. The point is then within the circle if:
  x2+y2 < 1

Actually, we can simplify a little further if we just pick the top right quadrant. This means that x and y are now between 0 and 1. The area of the quadrant of the square is 1, and the area of the quarter circle is now π/4, which is the same ratio.

So I knocked up a short program to compare these ratios and see how well it converges. I did it just a couple of minutes, so rather than test for convergence, I printed out every 10th iteration so I could eyeball how it was going. This manual step was interesting, as it helped me to see what the data was doing.

An important aspect of doing this is the "Randomness" of the Random Number Generator (RNG). For an approximate answer you don't need a particularly good one, but if you want the result to get better over time then you want some important properties. For instance, you want a flat distribution over the space. This means that every possible output has the same probability. You want to ensure that for any 2 observations of a sequence of n outputs from the RNG, then the probability of the n+1 being the same in both sequences is negligible. This is another way of saying that you don't want cycles. There are a lot of other properties, but these are the first 2 that come to mind.

Unfortunately, the random number generators in POSIX aren't that great. They are in fact pseudo-random number generators (PRNG). The rand(3) function is actually terrible, so it was replaced with rand48(3), and eventually random(3) (and the associated srandom). However, these are all based on pseudo-random functions. This means that they are cyclical functions, with only the low order bits returned as output (otherwise the output would encode the entire state of the function, meaning you could calculate the next number in the series - making it even less random). While they look sort of random, the distribution often leaves a lot to be desired.

The Linux /dev/random driver does its best it accumulate entropy in the system, from timing and content of keyboard/mouse actions, hard drive delays, and network packet timing, but that is usually an easily exhaustible pool of data, and so this is typically just used to "seed" one of the PRNGs. (I once used an email encryption program that just used the raw data from this device. Every so often it would lock up while waiting on "random" data. I could restart it again by wiggling the mouse, hence providing /dev/random with some more data.)

A better alternative is an external device that generates "true" random data by measuring thermal noise, background radiation, or something similarly obscure. However, to my knowledge these are generally only used in high end security applications.

The immediate thing that I noticed was that the ratio (×4) approached 3.14 very quickly, but then it started diverging again. No matter how many points were added, it would fluctuate as far out as 3.09 and 3.2.

My first thought was that the accuracy of my intermediate values may have truncated some bits. So I moved the integer calculations of the random number up to 64 bits (making sure I didn't cut off any high order bits by accident), and moved the floating point calculations out to 128 bits. This is far more than is actually required, but at least I could now guarantee that I wasn't truncating anything.

The resulting code isn't very good, but in my defense I was trying to pay attention to the talk that I mentioned at the beginning:
#include <stdio.h>

#define FREQ 10
#define MAX_RND 0x7FFFFFFFULL

int main(void) {
long all;
long in = 0;
long double x, y;
long double pi;
unsigned long long maxRndSq = MAX_RND * MAX_RND;

for (all = 1; ; all++) {
x = random();
y = random();
if (((x * x + y * y) / maxRndSq) < 1) in++;
if (!(all % FREQ)) {
pi = (long double)(4 * in) / all;
printf("%lf\n", pi);
}
}
return 0;
}
Despite all this, the numbers didn't converge any better than they did when I was using fewer bits of accuracy. There may be other reasons, but I believe that it's just because the PRNG I was using is not really random enough.

Thinking along these lines, it occurred to me that this might be a simple and easy test to check if a PRNG (or even a RNG) is any good. Of course, you'd automate the test to check for:
a) Convergence
b) Convergent value

Convergence is a little awkward, since the value fluctuates around a bit before settling down. But if you check the deltas over time it wouldn't be hard.

Testing for the convergent value is easy, since we already know the value for π. Speaking of which, I counted this the other day and realized that I only know 27 decimal places. It occurred to me that if I can get up to 35, then I will know the same number of places as my age. Then I only need to remember one digit every year to keep up. By the time I'm an old man it might even be kind of impressive (not 67,000 digits impressive, but impressive nonetheless). David calls me a geek.

(Thanks go to Bruce Schneier for his books describing PRNGs for me all those years ago).

Interviews

A little while ago I had an idea for some audio software that might clean up some of the digital distortion that radio broadcasters have to put up with when interviewing people remotely. I'd forgotten about it until I listened to the recent Talis interview with David.

This interview was done over Skype, and recorded with Call Recorder for Skype, from Ecamm Network. Forgetting some of the unfortunate ambient sound at the Talis end, the recording is still of poor quality specifically because of the some of the audio degradation which occurs occasionally on the sound coming from David. This degradation is an essential compromise for a real-time conversation, but because anyone downloading the podcast is not getting the data in real time, this is not a compromise they should be forced to make.

The solution that came to me is to have a Skype plugin at both ends (this should be easy to arrange for the interview). The plugin then does local recordings at both ends, and incorporates out-of-band timestamps into the audio data stream. Once the interview is over, the plugin then negotiates a file transfer from the interviewee to the interviewer. The process at the interviewer's end can now merge the two local recordings, and voilà, you have a recording with neither participant sounding like they are speaking across a digital chasm.

Anyone want to build this? I can think of a dozen radio stations that need it. :-)

No comments: