1 Testing

Software testing is aimed at discovery and assurance of properties of units, components, or the entire system. Testing can be organized in a pipeline fashion of CI/CD as a distinct stage, which constitutes a pipeline in its own right, in which testing stages are organized in sequence or in parallel so as to facilitate the development, integration, deployment, and delivery of software functionalities to the end user execution or run-time environment. Each test comprises at least one test case, which specifies the actions to be taken or inputs and outputs to be processed and matched. Tests should cover functional and qualitative properties of the program, including correctness of the implementation, absence of errors, performance bounds, etc.

Tests are run in a testing environment, which should ideally match the end-user’s deployment environment.

If it’s a pet project or a PoC for internal use, testing can be of secondary concern. But if your project is mission-critical, it must be tested. All financial, medical, physical, automotive, or aero-engineering applications, to name a few, must meet comprehensive quality and reliability requirements.

Haskell and GHC together provide strong type-level guarantees, which in fact is very much compared with other mainstream industry-grade production-ready languages. There have been attempts at increasing Haskell’s guarantees, e.g., with Liquid Haskell or Dependent Haskell, but it’s a different story. GHC like every compiler has its own bugs , so the guarantees are somewhat tainted. There are many facets to quality assurance in software engineering. We’re delving deeper into the type-theoretic guarantees in other articles on my blog, and here we’ll dive into the software testing part of it.

In this post we’re going to see how we can leverage Haskell’s type-system guarantees to even further improve the quality assurance for our software by harnessing Haskell testing facilities.

In fact, we can use Haskell’s testing facilities anytime with any non-Haskell project. Say, we develop a C++ program, we can employ to test it. But we can also embed that C++ program in a Haskell testing environment and run the tests there, all we need to do is to map the types.

We’ll focus on dynamic testing, with test data generated automatically or provided by hand.

It is important to make clear that software testing is not a formal method of verification. It only enables a heuristic validation. The testing problem is combinatorial in nature (providing all feasible permutations of input and output pairs) and as a decision problem constitutes a superset of the halting problem, which is undecidable.

1.1 Black-, White-, and Gray-Box Tests

Some libraries provide white-box (structural) testing, while others are specialized in black-box (functional, property-based, invariant) testing.

There is also a variety of software testing approaches or models. More or less common is a hierarchy of “levels” of the tests. These are essentially four steps distinguished by the degree of abstraction or specificity.

1.2 Unit tests

At the lowest level of abstraction (and the highest level of specificity) in this hierarchy, the smallest coherent pieces of code, such as functions, are called “units.” This notion need not be as restrictive: any piece of code can be considered a unit essentially. Unit tests are written by the developer in a white-box fashion.

In unit testing, we want to isolate parts of our software and make sure that they individually work as intended. How they interact and whether they work well together is a different question which we are not concerned with at this level; it is answered at the integration level.

1.3 Integration tests

When writing software, we compose functions or objects. Our entire modern-day mathematics is a construction derived from a basic set of axioms in a particular logic. All system operations amount to mutation of CPU registers. Composition is everywhere, whether in sequence or concurrently in parallel.

An integration test is aimed to heuristically ensure compliance of a component’s API with another components’ APIs. Each component or module exposes a set of functions, variables, or classes to everyone else. Suppose our program composes components A and B. An integration test must ensure that both components’ interfaces are consistent. But, in contrast to a unit test, we do not look inside!

1.4 System tests

On a higher level, when all our components are integrated and the integration tests are all passed, we can check whether the entire system works as expected on our particular test cases. The desired input and output pairs must be matched by the actual output of our system.

1.5 Acceptance tests

User-acceptance (beta) test cases are best developed in collaboration with the end users or domain experts, in which real-world scenarios are played through for the program. This may include any regulatory and contractual constraints that the end-user must comply with.

The distinction between alpha and beta testing is usually simple: alpha testing is acceptance testing on the developers’ premises or environment, while beta testing is acceptance testing on the end-users’ premises or environment; with infrastructure as code, provisioning and accessing environments can be included in the developers’ test pipeline, in which case “beta” testing amounts to the involvement of domain experts on the buy side.

From cost efficiency perspective it should be apparent that the sequence of tests should follow this hierarchy, i.e., unit -> integration -> system -> user acceptance. It is a good practice to devise a set of acceptance tests prior to beginning the project, in order to use them as the ultimate assurance and guiding line, from which the developers should not deviate. And then they can be refined in the process of continuous delivery.

Intertwined with each of these four stages, are functional and nonfunctional tests, and performance analysis.

2 Systems Design and Architecture, Requirements, and Tests to Assure Behaviors and Qualities

A system may exist ephemerally, but when we need to deal with it, different actors will have different perspectives on it, and different views (functional, informational, organizational, infrastructural) will represent the different concerns of the stakeholders.

In systems design, we specify the domain and the data models along with the architecture of our program. In fact, our program is just a method applied to the domain and acting upon the data. So our architecture specifies how the constituent pieces compose the entire system by interaction.

By means of devising a systems architecture, we provide a schematic view for the design of the system that describes the domain models. One part of the process is the functional modeling of our desired system, where a typical software development process that goes bottom-down begins with an eventual functional relationship and tries to decompose it in ever small constituent functions that (preferably exactly) compose to the original top-level function. The flip side is architectural modeling of the system. TODO

Functional requirements specify the desired behavior of a function in terms of a functional relationship between stated inputs and outputs. Recall that in mathematics, a relation \(R \subseteq A \times B\) is called a function if it is left-total and right-unique (aka “functional”), where left totality means that for each element \(a\) in the input set \(A\) there exists an element \(b\) in the output set \(B\) such that \(aRb\), and right uniqueness means that we can map any element of \(A\) to only (i.e., at most) one element of \(B\), or formally, for every choice of elements \(a \in A\) and \(b,c \in B\), if \(aRb\) and \(aRc\), then \(b=c\) (in words: for any pair of elements in \(B\), if they are related to the same element in \(A\), then they must coincide, i.e., “we can map any \(a \in A\) to only one \(b \in B\)”).

There are many approaches at a concrete realization of these notions. But generally speaking, a domain model includes the model of the business processes and functions, whereas the system model includes the program logic and the requirements.

2.1 Views and Viewpoints on the System

If we partition our system by concerns of stakeholders, we will arrive at what is known as viewpoints, amounting to separation of concerns, and they determine the views, i.e., an interface to (or the representation of) the system that is limited to the viewpoint — those aspects are rendered that are of interest to the particular class of stakeholders. This includes the various structural and architectural,functional, behavioral and procedural perspectives on the system.

2.2 Requirements

A requirement is the prescription of a condition, capability, or state of the system that must be met or achieved. For our purposes we only need to distinguish between the functional and nonfunctional requirements, which are often conflated with other types of requirements.

Functional requirements prescribe individual behaviors, procedures, effects and activities the system must exhibit, as given by a set of input and output pairs. The implementation of functional requirements is laid out in the corresponding system design intertwined with the app architecture. Behavioral business use cases go hand in hand with functional requirements.

Nonfunctional requirements prescribe structural, operational, or architectural aspects or characteristics that the system must possess and adhere to, i.e., property constraints; they specify the states, qualities, properties, constraints, and anything technical that is unrelated to behavioral aspects of the system, e.g., costs, reliability, reproducibility, etc., but also performance. The implementation of nonfunctional requirements is laid out in the corresponding (technical and organizational) system architecture.

Nonfunctional requirements can be categorized in

  • qualities of the execution process (run-time): availability, reliability, stability, usability, performance (like latency and throughput), security, correctness (adherence to the algorithms and the spec, numerical stability of the implementation, robustness), maintainability, testability, usability (UX)
  • evolutionary qualities: interaction with the ambient environment, maintainability, portability, flexibility (standards compliance), scalability, satisfaction of constraint (cloud cost limits, runtime limits, but also deadlines).
  • nontechnical aspects such as (legal) compliance with laws (privacy) and licensing, or meeting the budget constraints on a cloud service.

All these aspects are noted individually, but in reality they all belong together. They become distinguished when we consider different viewpoints, as described above.

I prefer to consider performance requirements separately, which are validated by benchmarking. For example, in high-frequency trading1 we always strive for low-latency and high-throughput systems, with important pieces embedded in tiny-memory network interface cards, whereas in not-so-high-frequency automated trading we focus more on system correctness and may want to invest more in formal verification. Contrast this with discretionary trading, where the UX tends to outweigh all other concerns — in this situation an average end-user will usually values the comfort of the GUI over top-grade performance, whereas in the HFT domain the GUI is only of value to analysts and managers, who make discretionary decisions, and the performance concern lies at the heart of the HFT business model.

2.3 Testing

Testing is hard, testing is expensive. There is a hierarchy of testing approaches based on the detail and the depth of the tests. By first testing shallow cases that cover the most fundamental functionality or properties of the unit, integration, or the system, which still need be, we can save the cost from passing on faulty code that would falsify or at least taint any further test results. In the CI/CD, testing can be seen as a “sub-pipeline” essentially.

2.3.1 Functional Tests

Functional testing heuristics are aimed at ensuring behavioral properties of the system. They may refer to any level in the hierarchy: unit, integration, system, or acceptance testing.

2.3.1.1 Shallow testing: build, smoke, and sanity tests

  • Build and smoke tests: must cover the most basic functionality of the unit or the system, aimed at ensuring that further more detailed testing down the CI pipeline is feasible, as often different teams run different tests or perhaps different functionality is tested in different environments.

  • Sanity tests: also a shallow test, which amounts to weeding out corner cases, obviously false input-ouput pairs.

2.3.1.2 Regression Tests

Every update or a bug fix may result in unintended consequences that break a feature that used to work correctly prior to the change (feature regression), or may affect the performance detrimentally (performance regression). It has only etymologically the same roots as regression models in statistics (the name stems from “regression to the mean” in the analysis of the average height of men in a study). The regression can be localized to a single unit, a module, or it can affect the product at the integration level. Another typical case is when the regression is due to a running a previous incorrect execution path. Recall that for a program to be correct, each of its execution paths must be correct.

To mitigate such issues, regression tests are deployed.

2.3.1.3 Usability testing

2.3.2 Nonfunctional Tests

Nonfunctional testing heuristics are aimed at ensuring overall qualities of the system. Similarly, they may appear at each level of the hierarchy.

  • docs testing
  • load and stress testing
  • performance testing
  • security testing
  • usability testing: qualities pertaining to the product.

Much can be said about each of these types of tests, and about many more which will remain beyond the scope of this blog post.

3 The Setting

Haskell is renowned for its testing facilities.

There is an ambiguity in the usage of the term “framework,” related to different contexts and scopes of the notion. For our purposes, we will call libraries such as HUnit and QuickCheck, but also Hspec, that provide an idiosyncratic formalism for the expression of tests, a “testing library,” whereas test-framewok, HTF, and tasty, but also Hspec, that introduce a unifying interface to more than one testing library, a “testing framework.” Apparently, Hspec belongs to both categories, for it comprises other testing libraries and at the same time provides own formalism for expressing tests.

Some major popular testing libraries (also call themselves “frameworks”, in which case the next group can be termed “meta-frameworks”):

  • Doctest: an adaptation of Python’s doctest package.
  • Hspec (cf. RSpec); a test format in its own right, but also a framework unifying HUnit, QuickCheck, SmallCheck)
  • HUnit: modeled after JUnit
  • QuickCheck: property-based black-box testing
  • SmallCheck: a simplification of QuickCheck
  • Hedgehog:

Some major testing frameworks that provide a unifying layer or interface atop of the testing libraries:

  • Hspec:
    • hspec-hedgehog
    • hspec-laws:
    • hspec-smallcheck
    • hspec-leancheck
    • hspec-slow
    • hspec-test-framework (and hspec-test-framework-th)
    • hspec-multicheck
    • hspec-wai and -wai-json
    • hspec-megaparsec, -parsec, -attoparsec
  • test-framework: comprises HUnit tests and QuickCheck properties in a single interface.
    • relatively actively maintained, as of writing this.
    • test-generator (aka test-framework-th) was last updated in 2012.
  • HTF (Haskell Testing Framework): automatically collects individual unit tests (hspec-discover only finds modules)
    • unit tests (HUnit)
    • QuickCheck properties
    • black-box tests
    • custom preprocessor gathering test definitions automatically; failure with exact file name and line number.
  • tasty
    • tasty-laws
    • tasty-lens
    • tasty-th
    • tasty-tmux
    • tasty-wai
    • tasty-travis
    • tasty-stats
    • tasty-hunit
    • tasty-hspec
    • tasty-hedgehog and -hedgehog-coverage

Essentially, Hspec and Tasty are the two major frameworks we will be concerned with here.

Benchmarking:

  • Criterion

We first will investigate unit testing with HUnit and property testing with QuickCheck. After this, we will cover the frameworks that comprise both approaches.

4 Testing and Quality Assurance

4.1 Fundamental Notions

  • Unit test (HUnit, tasty-hunit): see above.
  • Golden test (tasty-golden): unit tests whose results are stored in files, and a test is passed if that output file matches a reference file.
  • Property test (QuickCheck, SmallCheck, Hedgehog, LeanCheck): specify a property a function must adhere to, and test whether this is the case, in a black-box fashion.

A framework is aimed at gathering these individual groups of tests in a single uniform test expression that the compiler will evaluate for us, to inform us whether everything’s fine. It is crucial especially when introducing changes to pristine code. Anything could go wrong. Recall that Haskell’s advanced type system makes most bugs travel back in time and emerge at compile-time rather than the runtime. This will save our users users from making unintended experience with our software. We want to make them happy and solve their problems, make their lives easier. Any runtime bug is an encumbrance on a user. It’s critical to reduce runtime bugs. So from the project budgeting perspective, if sufficient funds are available for testing, it’s very reasonable to give a very high priority to testing the runtime experience.

4.2 Typical Workflow

So we have a two-step approach to ensuring this high quality of our products and services:

  1. use Haskell to impose a high level of mandatory consistency, soundness, and correctness of our programs — nothing will compile that doesn’t meet that requirement (this imposes syntactic and logical correctness as expressed by types); and then additionally
  2. test our programs thoroughly to ensure that we translated our ideas correctly into code — the compiler, GHC, will alert us whenever we specify that we want something different than what actually gets computed as a result (this imposes semantic and translational correctness as expressed by our test specifications).

The Haskell ecosystem provides top-notch world-class quality assurance and testing facilities. It comes closest to formal verification, without the excessive burden and cost. It is a pragmatic gold standard of great quality of software. And should we indeed need to formally verify critical pieces of code, we’ve set the stage for it already. With Template Haskell, we can even attain automated test generation, including all the boilerplate potentially necessary. Moreover, we can use Haskell’s testing frameworks for other languages and platforms. We can also embed other languages in Haskell code and use it as our glue.

4.3 Examples


  1. Since Michael Lewis’ book “Flashboys” at latest, the term “high-frequency trading” founds its way into mainstream vocabulary, albeit in a very slanted meaning, affected by a host of logical fallacies, most prominent of which is the hasty generalization from the misconduct of a few actors to the entire industry, and at the same time promoting the then-incepted IEX (“Inverstors’ Exchange”). This smells like too much marketing for a new competitor. But that’s long been a regular practice. Still, even an average buy-side retail trader should learn about HFT but from other sources. If interested, ask me for a couple suggestions.↩︎