In this motivational post, we will see what issues software testing faces in practice. Much can be done to improve the results.

Let’s first turn our attention to the limits of software testing.1

1 Denotational Semantics

It turns out that when we speak of “functions” in the context of computer programming languages we implicitly refer to something quite different from what we consider “functions” in mathematics. Our computer functions are just models of actual functions that they are deemed to represent. Even in Haskell. Even in total functional languages. The reason is almost trivial: a reduction to the set of computable functions is not enough. Even though the set \(\mathbb{N}\) of natural numbers is countably infinite, with the first infinite cardinal number \(\aleph_0\) as its cardinality, the set of all indicator (characteristic or Boolean) functions \(2^{\mathbb{N}}\) is already uncountable, i.e., \(\# 2^{\mathbb{N}} = 2^{\aleph_0} = \mathfrak{c} = \# \mathbb{R}\), where \(\#\) denotes the cardinality function and \(\mathfrak{c}\) the cardinality of the continuum . At the same time, the set of all programs that the computer can execute is finite. And this set of all indicator functions contains one that solves the Halting problem.

The same problem arises when we use the floating-point arithmetic to model our actual model of arithmetic. The set of all floating-point numbers is finite. But even \(\mathbb{Q} \simeq \mathbb{N}^2\) is (countably) infinite. This (extremely crude!) approximation however has a well developed error theory in numerical analysis, relying on the immense work that laid ground for the IEEE 745 standard of floating-point arithmetic. Even though it is very hard to track down all numerical errors individually, we have good error estimates that provide the bounds of error. The tragic Apollo catastrophe stirred up the drive for standardization. Nevertheless, up until recently Microsoft Excel had been exposing its users to a long-known bug in its use of the floating-point arithmetic.

But when it comes to computer programs as approximations to our actual models, we are still not there yet. All we can deal with are the functions continuous in the Scott topology. This topology is much weaker (i.e, coarser) than the canonical topology of the real line.

Let \(P\) and \(Q\) denote a pair of posets and \(f : P \to Q\) a function between them. Then \(f\) is called Scott–continuous2 if, for every directed subset \(D \subseteq P\) that has a supremum \(d^*\) in \(P\), its image \(fD\) has a supremum \(q^*\) in \(Q\), such that \(q^*=f(d^*)\), i.e., if it preserves directed suprema; in symbols, if the following implication holds, \[\begin{array}{Rcl} \forall{D \subseteq P} &:& \big( (\forall{a,b \in D}\; \exists{c \in D} :\; a,b \le c) ,\; \exists{d^* \in P}:\; d^* = \sup{D} \big) \\ &\Rightarrow& \exists{q^* \in Q}:\; q^* = \sup{fD},\; q^* = f(d^*). \end{array}\]

The Scott topology is coarser than the Alexandrov topology, defined as the collection of all upper subsets of a poset. The Sierpinski topology is the Alexandrov topology \(\mathfrak{A} := \{\, \emptyset,\; \{1\},\; \{0,1\}\,\}\) on \(2\). Each function into \(2\) is an indicator of its domain. So \(2^{\mathbb{N}}\) consists of all the indicators of the subsets of \(\mathbb{N}\). An indicator \(1_A\) is continuous iff \(A\) is open in \(\mathbb{\mathbb{N}}\), which induces a bijection \(\mathfrak{T}(\mathbb{N}) \simeq \mathrm{C}(\mathbb{N}, 2) \subset 2^{\mathbb{N}} \simeq \mathbb{R}\), where \(\mathfrak{T}\) assigns the topology to topological spaces, and \(\mathrm{C}(\mathbb{N}, 2)\) denotes the set (in fact, the algebra) of continuous functions \(\mathbb{N} \to 2\). By monotonicity of Kuratowski operators, the strict subset relation is preserved for the Scott topology and any base of the canonical topology of the real line.

So whenever we try to draw conclusions from this crude model of computer programs to the actual specification, we are inducing another model error, one we can’t estimate.

Domain theory and denotational semantics try to estimate the gaps and provide rules for rigorous reductions and discrete approximations.

2 Software Testing

2.1 Caveat Emptor

Software testing as a mathematical problem contains the Halting problem, and is therefore undecidable.

All and any software testing is but an extremely crude approximation to checking whether the intended representation is correct. There is only a finite set of patterns that we can test. And even \(\aleph_0\) is so much larger than the cardinality of that set. And if it is used to approximate our actual function with domain of cardinality \(\mathfrak{c}\), things get much, much more complicated. So a good approach would be to discretize our actual problem and to check how good an implementation at hand we have as compared to our discretized problem. All we can do otherwise is hope that we discover some pattern that we might have missed. But even then, failure to discover a perhaps uncountable set of invalidating patterns will remain opaque to us.

2.2 Uncertainty Remains

What I mean by this is simple: most students of mathematics are introduced to rudiments of measure theory; therein, the set of real numbers \(\mathbb{R}\) is furnished with a natural measure \(\lambda\), known as the Lebesgue measure on the real line. Now, for this measure, all countable subsets of \(\mathbb{R}\), such as \(\mathbb{Q}\) and \(\mathbb{N}\) have measure zero, so when integrating a function with domain \(\mathbb{R}\), we can neglect its values on the countable union of all countable subsets of its domain, so as to neglect potential pathologies. This leads to the integral of zero for the Dirichlet function \(1_{\mathbb{Q}}\), for example. But it does take on a value. And by extension, under this measure, all computable functions have integral zero. So to deal with such discrete domains we consider discrete measures, like in probability theory. Now consider such a discrete measure defined on the real line, say the Bernoulli measure \(\mu\{0\} = \mu{1} = 1/2\) but on \(\mathbb{R}\). Now under this measure, all larger sets are null sets. Consider a simple Boolean function \(\operatorname{id} : 2 \to 2\), where \(2 = \{0,\,1\}\). Its integral with respect to \(\mu\) is \(\mu\, \operatorname{id} = \int_2 \operatorname{id} \,d \mu = 0 \cdot \mu\{0\} + 1 \cdot \mu\{1\} = 1\), and the other bijection \(f(0) = 1\) and \(f(1) = 0\) has \(\mu\, f = 0\).

And this is effectively what software testing is. Consider the popular metric of test coverage as such a discrete measure. What it effectively claims is that out of a finite set of test samples, everything seems fine, for the specified (finite!) number of test cases and functions in the code. But this says nothing at all about the uncountable set of possible test patterns that we cannot even think of here, but which can come up in reality.

This is similar to virus antibody tests taking all around the world right now. They are subject to false positive (Type I error) and false negative (Type II error) outcomes.

2.3 A Silver Lining

Now, despite this gloom and doom picture painted by formal methods, in reality our computer program will face only inputs that come from another program. All such inputs must be computer-representable numbers. Such numbers are the integers and the floating-point numbers, for most of modern-day computers. So to ensure maximal practical correctness we can safely restrict our attention to this actual domain in which our computer functions will operate. And this is crucial.

Still, a practical software test will rarely cover all possible inputs! This is where a type system comes to help. Haskell’s for instance is highly advanced in its expressivity and power. In contrast, most OOP languages lack much of that expressivity.

Oh, by the way, how do we test higher-order functions? Such functions take other functions as arguments, e.g., an f :: (a -> b) -> c take a function g :: a -> b as an argument. How do we test f? We must somehow generate a function as a test value! Later we shall see how ingeniously QuickCheck solves this problem. But languages like C++ on the other hand don’t have functions as first-class objects, and a template type like Function<S, T> does not even remotely describe the structure of a function; this shortcoming is a bit alleviated with the somewhat confusingly named concept of “concepts” introduced with C++20, however, compilers are just not there yet. C# is better in this respect, with testable Func<S,T>, even though fairly limited. As far as I can speak, the developers of Swift seem to have made the right choices, and Kotlin would be my second choice. Rust is good in this regard too.

A type system is still imperfect, as it is also but an approximation and as an implementation underlies its own bugs, like any compiler does. Its guarantees are limited by its correctness with respect to its specification.

3 Nested Hierarchies by Testing Priority

The aspects highlighted so far are all of technical nature. But a software project as a whole must first and foremost meet its budget constraint. We must estimate how much of our financing we can dedicate to software testing. This depends on the desired quality of our product, which depends on the domain of the problem and the demand of the end-user. Many modern retail-oriented software products are produced in an agile fashion, with bugs fixed (and introduced!) and features rolled out steadily through consecutive updates. While in business-oriented software product, in B2B software, its quality is expected to be at a whole lot higher level, simply because business users occupy an application domain with a very different risk exposure to the results of the software programs. Already depending on this concerns, a product manager must allot the limited resources efficiently. There are many other stakeholders in a software product, still, whose risks and interests must also be taken into account when planning a software product. But it is this resulting financial constraint that determines the ultimate upper bound on the rigor of software testing practices.

Nevertheless, we need to establish a certain viable degree of rigor when testing software and may not forget that simple fact from above: failure to find an error does not mean that there is no error! Reputation is at stake. In this sense, a type system in concert with a rigorous test suite establish a minimal viable correctness assertion that we can make in practice. But whereas we can vest our trust into Haskell and GHC, which, unless we participate in its development, we can’t control, the test suite is all in our hands, and we must exercise pertinent rigor in developing it.

In other words, there is a hierarchy of “mission criticality,” wherein more mission-critical software dedicated to application domains such as medicine, finance, or space and aero-engineering, requires a more comprehensive error analysis. Whereas less mission-critical software such as a text editor needs less rigor, e.g., if we only can guarantee that the document state across saves and edits is maintained and resilient to text document corruption, all other tests become of secondary concern, since we then would have managed to guarantee the essential property of a text editor product. Such internal product-dependent hierarchies are in fact a great way to manage costs and expectations.

4 Attitude Toward Software Testing

So we must modify our perspective at software testing. Software testing is not a means to formal verification which would establish formal correctness, but solely an heuristic to guide us in our assertions about the properties of our program, a sort of a hint like “it seems okay so far” whenever all tests are passed. But this assertion also depends on how we devise our test suite and the correctness of the testing framework!

Instead, we should reason about software testing as a verification of our drastically reduced model of our computational domain problem. The problem with which we begin need not be computation in nature, but we derive a computational solution to (at least a part of) it.

Here the problem must first be well-posed mathematically, and then be made amenable to an algorithmic solution, and then by means of implementation of the algorithm amenable to computation.

Our conceptual mathematical model is one level, its discretization another, its constructive algorithmic approximation another, and its computer implementation, in the scope of which we perform our testing, is a fourth level of modeling. In numerical analysis and science in general, this gradual relationship is well-known, and software testing corresponds essentially to simulation. The mental change from level to level of modeling occurs often subconsciously. But if we want to estimate the error introduced thereby, we will have to make it explicit.

5 Estimation of Untested Error

Since many companies today are founded around a software project, the estimation of the error interspersed between each of these levels of modeling becomes crucial to prediction of the viability of the entire business.

We must know how well we can control this risk of error. This lies at the heart of test-driven software development, derived from the roots of extreme programming, where test cases are formulated upfront and code is written with iterative refinement so as to satisfy the test cases as a spec. With Haskell, we have our type checker as our partner and the type signatures as our spec. We augment it with test cases to try to catch unexpected behavior that slipped through the type checker, so in concord with the type checker we increase the quality of our software. Most other languages lack this help of a highly advanced type checker, whereas some languages like Coq (Gallina), Idris, and Agda have a far more advanced type system, with thereby a stricter type checker, and are best suited for as formal proof assistants, as of writing, rather than general-purpose industry-grade software development platforms. Haskell managed to strike the sweet spot.

5.1 Psychological Aspect

Regardless, speaking of software as such, and since it is such a porous and subjective approach to quality assurance, we should add in a psychological aspect to it: it is crucial to be aware of the lack of the reciprocity in the failure to find an error, i.e., if we didn’t find any error, it doesn’t entail that there is no error indeed. This is just conventional wisdom. But the deadline stress lures many developers into believing that they can justify their sloppiness, perhaps due to accrued fatigue, by delegating the responsibility to the test suite, which not even them personally need have written.

As you see, there are many aspects to software testing, some of which I tried to highlight.

6 Outlook

This article is work in progress. As time goes by, I will rewrite some passages, add insight, and perhaps revise some things. Much remains unspoken here about software testing and quality assurance, which is a discipline in its own right.

Now having stated some major caveats of software testing, it’s finally time we examined some examples. I’m preparing the next installment of software testing awesomeness, where I’m going to compare unit testing in Haskell using HUnit, with Java using its originator JUnit, and in C++ using CppUnit or Boost.Test, and perhaps in C# using NUnit. By means of that comparison we will recognize how much or how little of our program we can actually test in the scope of actually the same conceptual unit testing framework (“xUnit”), depending on the expressive power of a language.

Another pivot to software testing lies in quality assurance of database queries. In fact, nested and otherwise complex SQL queries are prone to error. A minor typo in a query brings the potential of inflicting a hefty hit to the budget with just a single query. We will see how we can employ Haskell’s type system to guarantee that such mishaps are impossible. This will both enable advanced query optimization so we can balance data processing costs and let us sleep at night, resting assured that all database transactions work as expected. Stay tuned!

  1. By the way, if you read German, I highly recommend Dirk W. Hoffmann’s book “Die Grenzen der Mathematik,” in which this computer scientist dives into the roots of computational and logical limitations of our modern-day mathematical problems. It’s an easy read for the motivated.↩︎

  2. And a subset \(A \subseteq P\) is said to be Scott–open if it is an upper set such that, all directed sets \(D \subseteq P\) with supremum in \(A\) intersect \(A\), i.e., \(D \cap A \ne \emptyset\). The Scott–topology on a poset \(P\) is then just the collection of all Scott–open sets of \(P\).
    Canonically, a function between posets is Scott–continuous iff it is continuous with respect to the corresponding Scott–topologies.↩︎