Hypothesis testing for statistical model checking

This web page is a companion to our papers "On Hypothesis Testing for Statistical Model Checking", by Daniël Reijsbergen, Pieter-Tjerk de Boer, Werner Scheinhardt, and Boudewijn Haverkort, in the STTT journal (August 2015), and "Interactive comparison of hypothesis tests for statistical model checking", by Pieter-Tjerk de Boer, Daniël Reijsbergen, and Werner Scheinhardt, at the Valuetools conference (December 2015).
It is a "living document" that gets enhanced with new features and/or tests as they are developed.

Below the figure and its controls, we describe the meaning of the figure.
For more details, we refer to the papers; the STTT paper gives all the mathematical details, while the Valuetools paper gives a gentle introduction and describes this website.



Your browser does not seem to support the HTML5 Canvas. Please try a recent version of Firefox, Chrome, IE or Opera.
allowed error probability: α=β=            
indifference level: δ=ζ=
guess for p-p₀: γ=
decision threshold: p₀=
N-range:
Z-range:
show class-I tests: risking wrong conclusion if p close to p₀
      
show class-II tests: risking no conclusion if p close to p₀
      
show class-III tests: risking long calculation if p close to p₀
      
show "expected" lines

Simulation
Real p (for simulation):
Calculated parameter values:

Statistical model checking is a procedure to check whether some (model of a) system displays certain behaviour (e.g., a train runs on time) with a probability that is above or below some specified threshold (e.g., a requirement set by the government). We assume that we have access to a simulation model of the system, which can generate random samples of the system behaviour; for each sample, it returns a 1 or 0, depending on whether in that sample, the behaviour of interest did or did not occur, respectively. Then we run this model a large number of times, count how often it returns 1, compare that to the threshold, and try to draw a conclusion.
[Note: the simulation we talk about here, is the simulation of the system being modeled or studied; the simulation buttons in the above tool are different, they are about simulating the hypothesis test procedure, in which the system simulation is replaced by a coin flip.]

Example: Suppose our threshold is 0.6 (i.e., 60%). Let's run our simulator for 1000 experiments, and let's assume it returns 1 in 800 of those 1000 experiments. Intuitively, that suggests the probability of interest is about 80%, which is well above our threshold of 60%. But how sure can we be of this?
That is what this webpage and our papers are about: techniques for drawing statistically sound conclusions, like the following: "Based on our experiments, we can be 95% sure that the probability of this behaviour is above the threshold of 60%".

More precisely: We do N experiments, and call their output (0 or 1) Xi for i=1...N. We call the threshold p0, and define Z=X1+...+XN – N p0. Clearly, Z>0 hints that the true probability p is above p0, and the other way around.
Then the question is how large to choose N, and for what values of Z we can conclude with sufficient confidence that p>p0 or p<p0. A set of rules for drawing such conclusions from N and Z is called a "statistical hypothesis test", and our work consists of comparing several such tests for this class of problems.

Each of the coloured lines in the figure represents one such test. Experiments consist of repeatedly drawing samples and recording the result, thus following some path in the N,Z-plane that the figure represents. As soon as this path crosses the coloured line of the chosen test, a conclusion may be drawn. In principle, each test comes with a guarantee that its probability of drawing a wrong conclusion is below some (user-chosen) value, called alpha or beta in the figure.

Intuitively, drawing a conclusion is harder when the true p is closer to the threshold p0; more samples are needed in such cases to still draw a conclusion. O.t.o.h., if the real p is indeed very close to the threshold, then perhaps the user does not really care whether the conclusion is correct. Thus, tests can be sorted into three classes, based on their behaviour when p gets close to p0:

Which of these tests is most appropriate, depends on the problem at hand and in particular on the correctness requirements.
Comments about this page can be sent to Pieter-Tjerk de Boer, p.t.deboer@utwente.nl.
If you find this material useful for your own scientific publication, we request that you cite this work through one our papers on it, listed at the top of this page.