Hardware random number generators

Robert Davies

Statistics Research Associates Limited

http://statsresearch.co.nz

http://www.robertnz.net

14 October, 2000

This is a paper presented to the 15th Australian Statistics Conference in July, 2000 and the 51st conference of the NZ Statistical Association in September, 2000.

The paper is about the testing and the applications of hardware random number generators.

Part of the work arises from a confidential project so at the present time my software is not available. Sorry.

A hardware random number generator is an electronic device that plugs into a computer and produces genuine random numbers. This is in contrast to the pseudo-random numbers produced by a random number computer program. Several hardware random number generators are available from commercial sources. See my 1997 web page. They are used for generating keys for encryption, winning numbers for lotteries, selecting experimental designs and occasionally for statistical simulations.

Testing a hardware random generator differs from testing a pseudo-random number generator. In particular, if one knows the design of the generator one can tailor the tests to be appropriate for that design. The intended application may also be important when selecting the tests. On the other hand, some of the tests used for pseudo-random number generators aren't very useful when applied to hardware random number generators.

Your random number generator might pass the test but this does not necessarily mean that the generator is good.

Or your generator might fail the test but this does not necessarily mean that the generator is bad.

This is from an engineer I was working with. Testing a random number generator is different from testing, for example, an arithmetic unit. While the first paragraph might apply to the arithmetic unit the second wouldn’t. For those used to testing deterministic devices, testing hardware random number generators involves new concepts and challenges.

Hardware random number generator: a physical device that connects to a computer and produces genuine random numbers: A hardware random number generator may use (for example)

decay times from a radio-active material
electrical noise from a resistor or semiconductor

A hardware random number generator uses a physical phenomenon such as electrical noise from a resistor or semiconductor diode or the decay of a radioactive material for the initial source of randomness. The electronic circuitry of the generator converts this noise to bits and then assembles these into bytes or words for use by the computer.

There may be philosophical question as to whether processes used to generate the random noise are really random – as quantum theory suggests they are. Maybe we could calculate the movements of the electrons in the resistor and so model the noise as a deterministic process. This, of course, would impossible, because of huge number of electrons that would be involved even if the processes were really deterministic. So even if God really doesn’t play dice, we can still regard the process as genuinely random, for our purposes.

All the commercial hardware random number generators that I have examined use resistor or semiconductor noise as the source of randomness. There is an Internet site, http://www.fourmilab.ch/hotbits, from which you can download bits generated by radioactive decay. I don’t know whether the generator is available commercially. In this paper I will look at just generators using resistor or semiconductor noise. However, some of the principles discussed here will also apply to radioactive generators.

Here is a picture of one of the commercial hardware random number generators (from Protego in Sweden). This one plugs into the serial port of a computer. You can see the diode at the right hand end, which I presume is the noise generator. I presume that the rest of the circuitry is an amplifier and limiter. This generator uses the computer’s serial port circuitry to digitise the random noise.

Used for

Key word generation in encryption
Generating winning numbers for lotteries
Experimental design
Statistical simulations

The major application is for encryption. One wants to generate a keyword that is hard to guess and using a hardware random number generator is probably the most reliable way of doing this.

Hardware random number generators are also used for selecting winning numbers in lotteries. In particular the New Zealand Lotteries Commission uses one for generating its Keno numbers.

Hardware random number generators are also used for designing for experiments on extra sensory perception (ESP). One of the commercial generators was developed explicitly for this purpose.

The present day hardware random number generators are really too slow for routine use in statistical simulations although it might be very sensible to use them for cross-checking simulations using pseudo-random number generators.

A hardware random number generator generates a series of bits. I will say that they are uniform random if they have expectation 0.5 and are independent.

I use the term uniform random since a sequence of n such bits interpreted as a binary number will be uniform on the integers 0 to 2ⁿ-1. My tests are oriented towards seeing how closely the output from the generator is uniform random.

For most purposes, we would like the bits (statisticians call them Bernoulli variables – variables taking on the values 0 or 1) produced by the generator to be statistically independent and having expectation (long run average value) 0.5. I refer to a stream of such bits as being uniform random since a sequence of n such bits interpreted as a binary number would have a uniform distribution on the integers 0 to 2ⁿ-1. Some people would describe the bits as being perfectly random or completely random. But I regard this as misleading since, for example, I wouldn’t regard a Markov process, from a genuinely random process, for example, as being imperfectly or incompletely random.

Most of this paper is about testing the hypothesis that the output from a hardware random number generator is indeed, uniform random, or, if it isn’t, describing how it deviates.

A hardware random number generator is different from a pseudo-random number generator, which uses a formula for generating the numbers.

So we need a different approach to testing.

It is a physical device so analyse it like a physical device.

Then reasonably satisfactory tests are possible.

But we need to know about its internal structure.

I regard the problem of testing of hardware random number generator as different from that of testing a pseudo-random number generator. In particular, I consider Marsaglia’s Diehard tests to be generally inappropriate for this purpose. Various reasons for this will become apparent throughout this paper. However, the most important is that the Diehard tests are designed for 32 bit words, whereas hardware random number generators tend to generate their bits one by one and don’t have a 32 bit structure.

The philosophy here is that a hardware random number generator is a physical device, so we will test it like a physical device. That is, look at its structure, decide where the problems are most likely to arise and concentrate on these. But, of course, don’t ignore the possibility of a problem in an area you hadn’t predicted.

Following this approach, I think a reasonably satisfactory testing regime is possible. I think this is in contrast with testing pseudo-random number generators where the testing process is never really satisfactory.

As a footnote, if we are going to test it as a physical device, we need to know about its internal structure or we won’t be able to carry out the most effective testing. This means that the manufacturer has to tell us what is inside (or we have to take one to bits).

George Marsaglia

Random numbers fall mainly in the planes
Diehard tests
Random number CD-ROM
http://stat.fsu.edu/~geo/diehard.html

George Marsaglia is a well-known random number guru. His paper random numbers fall mainly in the planes showed that if you plotted the results of many pseudo-random number generators in several dimensions [e.g. plot (X₁,X₂,X₃), (X₂,X₃,X₄), (X₃,X₄,X₅), … as points in 3 dimensional space where X₁,X_2,... are the 32 bit words produced by the pseudo-random number generator] then the points tended to lie on a lattice rather than being uniformly distributed. This resulted a rethink of some of the traditional pseudo-RNGs and the development of new ones.

Marsaglia has developed a series of statistical tests known as the Diehard tests principally for testing pseudo-RNGs. In a certain sense, these do not constitute a formal test since there are so many tests you must get some significant deviations. My approach is to run the series of tests and look by eye for groups of significant deviations or for single extremely significant deviations.

George Marsaglia released a CD-ROM of random numbers which included series from 3 hardware generators. He claimed that none passed his Diehard tests. One of the hardware generators was the Tundra/Newbridge generator that I have tested extensively and the reason it failed was a programming or data handling error. More about this later. But I think this may have tended to discredit hardware generators.

Here is a flow diagram of a hypothetical hardware random number generator including both the hardware and software components.

The analogue noise source is the noise resistor or semiconductor and the digitiser converts the output of the noise generator into a sequence of digital 0s and 1s. These form the raw output from the generator. These will usually be biased, that is, their expected (long run average) value will be different from 0.5. So a corrector is necessary, to bring the expected value closer to 0.5. An example of a simple corrector takes two input bits and outputs the exclusive-or of these two bits.

I recommend taking the exclusive-or of the corrector output and the output of a pseudo-random number generator. This serves two purposes; first to mop up in the remaining deviation from uniform randomness and secondly to guard against bad spots in the hardware generator output or failure of the generator in the middle of, for example, the lottery draw.

The letters in the diagram indicate test points. It will not usually be possible for the user to check point A, but the manufacturer may provide some information. See, for example, the Protego web site, http://www.protego.se, for some oscilloscope traces at point A of their generator.

I think it is important to be able to test the uncorrected output at point B. If we are limited to point C then testing is significantly less effective. It is most unsatisfactory if we are limited to testing at just points E or F. Testing at D, E and F is just to check that all the processes are working correctly. They are important, but not so much a topic of this paper.

For a more complex generator the tests considered here will need to be amended. For example, the Tundra/Newbridge generator has 8 generator units each supplying one bit in a byte. These need to be tested separately using the methods described in this paper. In addition, one also needs to check the independence of the separate streams.

As another example, some of the commercial generators have only one generator unit but return their output in blocks of 4, 8 or 32 bits, for example. At least some of the tests will need to be adapted to take account of this structure.

XOR corrector

Take successive pairs of bits in input stream and take exclusive-or (XOR) of each pair. Use this for the output stream. If the bytes in the input stream are independent & have bias e then the bias in the output stream is -2e ² .

Von Neumann corrector

Take pairs as before. If the two bits are different use the first one; if they are the same throw them away. If the bytes in the input stream are stationary the output stream is unbiased.

Here are descriptions of two types of correctors used to reduce bias in a stream of random bits.

The XOR corrector just takes the exclusive-or of pairs of bits and dramatically reduces bias if the original bits are independent.

The Von Neumann corrector also looks at pairs of bits but uses the first one if the bits are different and otherwise throws them away. If the input stream is stationary the output is unbiased. However it may still be auto-correlated if the original stream is auto-correlated. Also note that the Von Neumann corrector will produce a biased output if there is a cycle with period 2 on the input stream – and if the corrector is implemented in hardware it may well interfere with the generator and produce just such a cycle.

Raw

101100101001000110100101110010001010010111

XOR

1 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 1 1 0

Von Neumann

1     1 1 0   0 1 1 0 0     1   1 1 0 0

Here is an example of the processing by the two correctors. Alternate pairs of points in the raw series are coloured blue to help you line up the columns.

Bias
Drift
Short term auto-correlation
Other short term dependencies
Discrete frequencies

1/f noise
Other non-whiteness
Bad spots
Back door

Here is a list of deviations from uniform randomness that might occur in the raw bits. The generator is likely to be biased, which is why we need the corrector stage. The bias may drift over time. While this will largely be corrected by the corrector, if there is drift, eventually the generator may drift outside the range that can be handled by the corrector or by the amplifiers in the generator. So we need to check for this.

If we sample the analogue output too fast, adjacent bits will be correlated. So we need to check for short term auto-correlation. It is possible to have short-term dependencies apart from correlation so it is a good idea to check for other types of short term dependencies. [Throw a fair coin twice and then simulate a third throw by pretending we got an H if the previous two throws were the same and T if they were different. Then the three throws are uncorrelated but they are not independent since given any two we can deduce the third.]

It is possible for the generator to pick up external electrical interference. The most obvious effect of this will be discrete frequencies from the electrical mains and the various oscillators present in a computer.

Semiconductor noise tends to have an excess of low frequency – for very low frequencies the spectrum is proportional to 1/f to some power near one. Even if the generator is using a resistor for generating the noise there will be some semi-conductor noise from the amplifier so we should check for excess low frequency noise.

There may be other distortions to the frequency. For example, some circuits for random number generators on the Internet show a capacitor between the noise device and the amplifier. This would chop out the low frequency.

Bad spots are short periods where the generator ceases to work, possibly due to electrical interference or to extreme excursions of the generator overloading part of the circuitry.

Back door refers to deviations from uniform randomness deliberately introduced by the manufacturer to enable those in the know to guess keywords or win lotteries. Suppose instead of being a hardware random number generator the device was really a high quality pseudo-random number generator with a 40 bit seed. It would be hard to detect this from the output but it would be computationally feasible for some-one in the know to search through all the keywords it might generate.

Defect	Test	Relevance
bias	mean	both
drift	chi-sq on block mean	raw
short-term autocorrelation	autocorrelations 1 to 7	both
short-term dependence	4,8,16 bit chi-square & monkey tests	both
frequency components	periodogram & spectral analysis	raw
1/f noise	autocorrelation?	raw
non-whiteness	spectral analysis	raw

Now look at the kind of tests we might use to look for these defects – deviations from random uniformness. We should look at both the raw series of bits (point B in the flow diagram) and corrected series (point C). For most of the tests I have used 10 megabyte samples but on occasion go up to 1 gigabyte.

For bias, obviously, we look at the mean. For the raw series we want an estimate. For the corrected series, depending on how good the corrector is, we may want an estimate or we may want a test.

We normally expect there to be bias in the raw series. So the tests which are applied to the raw series, apart for the test for bias, must be insensitive to bias. So our random number test suite much provide versions of these tests which are insensitive to bias.

For drift, I suggest dividing the series into a small number of blocks and testing that the mean is the same in each block. A number of other tests for drift are also possible.

For short term auto-correlation, obviously work out the auto-correlations. I look at the first auto-correlation by itself and the auto-correlations 2 to 7 as a combined test.

For other short term dependencies I use the chi-square goodness of fit test for the distribution of words consisting each of 4, 8 and 16 bits after partitioning the sample into 4, 8 or 16 bit words. The name monkey test was introduced by Marsaglia and is like the chi-squared test but uses all 4, 8 or 16 bit sequences of bits rather than just disjoint ones. For most alternatives it is rather more sensitive than the chi-squared tests. In both cases, the 16 bit version does not have a lot of sensitivity and is probably not a lot of use.

I think I need to do more investigation on the detection of 1/f noise. The one example I have of semi-conductor noise that really showed this effect had a power of around 0.75. i.e. I was seeing (1/f )^0.75 noise. For this type of noise just looking at the first auto-correlation does a reasonable job. See my 1987 paper on Hurst effect.

Usually, when one looks at the results of the tests, these results are not very exciting. The generator passes the test and there is nothing more to report. Here is an exception that shows the spectrum of a noise generator badly contaminated by a frequency component. This is from the German generator on Marsaglia’s cd-rom. Of course, Marsaglia’s Diehard tests said the generator was hopeless but didn’t say what was wrong.

This is the spectrum from the Canadian generator on the cd-rom and shows what a spectrum should look like. When I was first doing these analyses, I would just look at a spectrum like this and say it doesn’t go over the bounds very often so it is OK. I didn’t get away with this for long and have had to provide an explicit significance test.

Defect	Test	Relevance
bad spots	chi-squared on blocks	both
back door	???	both
unknown and other	bit-within-word means	both
	variance of run length	corrected
	CR/LF test	both
	Maurer	corrected
	sparse monkey	corrected

I don’t think there is a good test for bad spots. I divide the series into blocks of around 1000 bytes and do an 8 bit chi-squared goodness of fit on each block then sum the results. This is probably more sensitive than the simple chi-squared test for detecting bad spots that might upset just a few of the blocks. But I doubt whether it is very sensitive.

I doubt whether there are any good tests for back-doors although I would be very surprised if one existed in a commercial product. It is unlikely that a generator with a back-door would exhibit bias or simple a auto-correlation structure. With the generators used by the Lotteries Commission I simply increased the sampling speed and watched the auto-correlation increase. This, together with the observed bias, suggested very strongly that it was a genuine hardware random number generator and not a pseudo-random number generator in a fancy box.

I also include a number of general purpose tests for attempting to detect defects we hadn’t thought of or are relevant only to specific generators.

If the generator assembles the bits into 8 bit bytes or 32 bit words it is a good idea to check the means of the individual bits to check that, for example, the bits at the ends of the byte or word aren't being lost. I originally included this test because the information I needed for it arose out of another test. But it does seem worth including because the electronics can lose bits at the beginning or end of a word.

A run is a sequence of just 0s or a sequence of just 1s. There is not much point in looking at average run length since this provides no new information beyond that already available in the mean and lag 1 auto-correlations. But the variance of the run length does contain new information, so one can look at this as general purpose test with no specific alternative in mind. A run length test, but not this one, is used in the FIPS 140-1 (section 4.11.1) test so I wanted a runs test in my set of tests.

I have included the Maurer test because it is often recommended. It is not at all sensitive and, in my experience, when it is beeping the other tests are screaming.

The sparse monkey test is an adaptation of Marsaglia's DNA test, but looking at long sequences (eg 27 bits) of single bits and seeing if there are any sequences that occur much too frequently. I doubt that it is much use, but it might trap a poorly hidden back door.

The only test I am going to talk about in detail is the carriage return/line feed one.

ASCII copy from UNIX to PC

Binary representation of 10 => two bytes 13,10

(e.g. Marsaglia’s Canadian and German files)

ASCII copy from PC to UNIX

Byte pair 13, 10 => single byte 10

Same effect if you do an ASCII write with a binary file on a PC

If you copy a file from a UNIX machine to a windows PC in ASCII mode the binary representations of 10 get converted to two bytes, a 13 followed by a 10. There is a corresponding error if you copy in the opposite direction. This error happens so often in real life that I now include a test for this error in my set of tests. It happened to Marsaglia’s samples for the Canadian and German random numbers and it is why he so strongly rejected the Canadian one. There is a slight bias in the Canadian one so it does fail one of the Diehard tests after you fix the copying error. But the failure is not spectacular.

There is a related problem with the convention used by the computer for the order in which the bits or bytes are stored in a word. PCs follow the little-endian convention and some mainframes follow the big-endian convention. Copying between machines that follow different conventions might rearrange the bytes and so reduce the effectiveness of the correlation and monkey tests, in particular. There can also be a problem if you are using a program written for one convention on a system designed for the other convention - e.g. Marsaglia's Diehard tests used on a PC.

Must always be prepared to be surprised.

But – there is no obvious mechanism in the RNG for producing long-term dependence apart from simple trends and cycles.

Only a few general purpose tests for this kind of effect.

Emphasis is on testing for and measuring the kind of defects that one would expect to see.

Defects are likely to be more obvious and simpler before the corrector rather than after so emphasis is on testing the raw bits whenever possible.

In the testing procedures we have to be prepared for the unexpected, which is why I have included some general tests.

However, it is hard to see a mechanism for producing complex long-term dependence patterns apart from trends and cycles. So the testing concentrates on bias and short-term dependence. Long-term dependence is really hard to detect because there are so many possibilities and tests that look for a wide range of possibilities will not be at all sensitive.

Defects are likely to be more obvious and simpler before the corrector than after, so if at all possible, we want to do the majority of the testing on the raw sample.

Classify tests as

very important – report 1% significance

important – report 0.1% significance

less important – report 0.01% significance

In general, print out -log₁₀(significance probability)

The test suite involves a large number of tests, especially as there are several versions of some of them. If we do the complete set we are very likely to get some significances even from a perfect generator. I have classified the tests into three groups. For the very important I report any that are significant at the 1% level. The tests classified as very important depend on whether one is looking at the raw or corrected dataset. However they include the bias, autocorrelation, the 4 and 8 bit monkey tests. The less important ones include the general purpose tests.

With this classification the number of false alarms is reasonably manageable. If a test does report a significant result one can run another sample and do the test again.

application	accuracy	true or pseudo	speed
encryption	medium	true or mixed	low for use, medium for test
encryption - OTP	medium	true	medium to high
lottery	high	true	low for use, medium for test
experimental design	medium to high	?	low for use, medium for test
statistical simulation	compromise	?	high

I want to look briefly at the accuracy and speed requirements for the various applications. For different applications one has different accuracy requirements so the final application is also relevant to the testing process.

For keyword generation in encryption one can estimate how close to uniform random the generator needs to be and the answer is that minor variations from uniform random don’t greatly reduce the security – this is not surprising since one uses only around 100 bits in keyword (or maybe a few 1000 in some applications). Similarly, speed is not a problem for generating the keys since only a small number of bits are involved. Generators that run at around, say, 1000 bytes per second or more will be sufficiently fast. But for testing we do want moderately high speed since we probably want to test on a 10 megabyte set of numbers and don’t want to wait all day for them. Around 10,000 bytes per second is fine for this. Ideally you need a hardware random number generator, but there are schemes involving a pseudo-random number generator plus some random input that seem reasonably secure.

There is a particular type of encryption known as one-time-pad. For this, speed may be important, since one will want to generate quite a number of megabytes. This has to be in a secure situation and so we won't want the generation to take too long. I don't know if one-time-pad is used in practice these days. For one-time-pad only moderate accuracy is required but a hardware RNG must be used.

For lotteries, accuracy is important. For Keno we might require that the probability of each possible draw is within 10% of the theoretical value. One can use this to work out the required accuracy of the hardware generator. Speed is not important for the actual draw, but it is important for testing, not only the generator itself but also for testing the processing software between points E and F in the flow diagram.

I suspect the requirements for experimental design are similar, although it is peculiar circumstances that require the use of a hardware random number generator.

For statistical simulations the big requirement is speed. I include simulation of statistical tests, simulations of complex models, Markov chain Monte Carlo etc in this section. Unless you need a lot of numbers you may as well use a pseudo-random number generator. So if you are going to use genuine random numbers you will need a lot of them. I suggest compromising on accuracy and leave out the corrector if you can, run the generator faster than recommended and rely on the combining with a pseudo-random number generator to remove the bias and auto-correlation. But there is no need to generate the numbers as you need them. At 10,000 bytes per second you can get around 1 gigabyte per day which can be stored on disk and accessed as required. I think a reasonable approach would be to use a good pseudo-random number generator for simulation most of the time but occasionally cross-check results with a simulation based on a hardware generator.