Hypothesis testing for statistical model checking

This web page is a companion to our papers "On Hypothesis Testing for Statistical Model Checking", by Daniël Reijsbergen, Pieter-Tjerk de Boer, Werner Scheinhardt, and Boudewijn Haverkort, in the STTT journal (August 2015), and "Interactive comparison of hypothesis tests for statistical model checking", by Pieter-Tjerk de Boer, Daniël Reijsbergen, and Werner Scheinhardt, at the Valuetools conference (December 2015).
It is a "living document" that gets enhanced with new features and/or tests as they are developed.

Below the figure and its controls, we describe the meaning of the figure.
For more details, we refer to the papers; the STTT paper gives all the mathematical details, while the Valuetools paper gives a gentle introduction and describes this website.

allowed error probability:		α=β=
indifference level:		δ=ζ=
guess for p-p₀:		γ=
decision threshold:		p₀=
N-range:
Z-range:

show class-I tests: risking wrong conclusion if p close to p₀

show class-II tests: risking no conclusion if p close to p₀

show class-III tests: risking long calculation if p close to p₀

show "expected" lines

Simulation

Real p (for simulation):

Calculated parameter values:

Statistical model checking is a procedure to check whether some (model of a) system displays certain behaviour (e.g., a train runs on time) with a probability that is above or below some specified threshold (e.g., a requirement set by the government). We assume that we have access to a simulation model of the system, which can generate random samples of the system behaviour; for each sample, it returns a 1 or 0, depending on whether in that sample, the behaviour of interest did or did not occur, respectively. Then we run this model a large number of times, count how often it returns 1, compare that to the threshold, and try to draw a conclusion.
[Note: the simulation we talk about here, is the simulation of the system being modeled or studied; the simulation buttons in the above tool are different, they are about simulating the hypothesis test procedure, in which the system simulation is replaced by a coin flip.]

Example: Suppose our threshold is 0.6 (i.e., 60%). Let's run our simulator for 1000 experiments, and let's assume it returns 1 in 800 of those 1000 experiments. Intuitively, that suggests the probability of interest is about 80%, which is well above our threshold of 60%. But how sure can we be of this?
That is what this webpage and our papers are about: techniques for drawing statistically sound conclusions, like the following: "Based on our experiments, we can be 95% sure that the probability of this behaviour is above the threshold of 60%".

More precisely: We do N experiments, and call their output (0 or 1) X_i for i=1...N. We call the threshold p₀, and define Z=X₁+...+X_N – N p₀. Clearly, Z>0 hints that the true probability p is above p₀, and the other way around.
Then the question is how large to choose N, and for what values of Z we can conclude with sufficient confidence that p>p₀ or p<p₀. A set of rules for drawing such conclusions from N and Z is called a "statistical hypothesis test", and our work consists of comparing several such tests for this class of problems.

Each of the coloured lines in the figure represents one such test. Experiments consist of repeatedly drawing samples and recording the result, thus following some path in the N,Z-plane that the figure represents. As soon as this path crosses the coloured line of the chosen test, a conclusion may be drawn. In principle, each test comes with a guarantee that its probability of drawing a wrong conclusion is below some (user-chosen) value, called alpha or beta in the figure.

Intuitively, drawing a conclusion is harder when the true p is closer to the threshold p₀; more samples are needed in such cases to still draw a conclusion. O.t.o.h., if the real p is indeed very close to the threshold, then perhaps the user does not really care whether the conclusion is correct. Thus, tests can be sorted into three classes, based on their behaviour when p gets close to p₀:

red lines: class-I tests, which risk drawing a wrong conclusion if p is less than some user-set "indifference level" away from p₀;
blue lines: class-II tests, which risk drawing no conclusion if p is less than some user-set "indifference level" away from p₀;
green lines: class-III tests, which simply demand an ever larger number of samples as p gets closer to p₀, in order to still draw a reliable conclusion; these tests do not depend on an indifference level, but can be be optimized (for minimal runtime) based on a "guess" the user may have for |p-p₀| .

Which of these tests is most appropriate, depends on the problem at hand and in particular on the correctness requirements.

Comments about this page can be sent to Pieter-Tjerk de Boer, p.t.deboer@utwente.nl.
If you find this material useful for your own scientific publication, we request that you cite this work through one our papers on it, listed at the top of this page.