1 Differential Equations

A general note on differential equations. There are three classical approaches to differential equations:

  1. geometric or qualitative for the study of long-term behavior of a system modeled by such an equation (mathematical physics),
  2. analytic of quantitative for the solution of the equation and estimation of the solutions (functional analysis),
  3. numerical for evaluation, approximation, and interpolation of solutions (numerical analysis).

In practice, only a handful of differential equations admit an analytic solution, i.e., an exact closed-form solution, i.e., expressible in terms of simple functions. Thus, for actual computations with differential equations we need numerical methods. For the study of long-term behavior of the solutions, differential geometry provides us with an extraordinarily rich set of tools, which are usually used for the purpose of dimensionality reduction of the problem. All three are intertwined. Partial differential equations (PDE) are much harder. We will constrain ourselves first to the deterministic ordinary differential equations (ODE), as opposed to stochastic differential equations (SDE, SPDE). More on this later.

We first consider only the simplest case of continuous real-valued solutions \(x : \mathbb{T} \to \mathbb{R}\) defined on compact intervals \(\mathbb{T} := [a,b]\) of the real line \(\mathbb{R}\).

An exact solution is the solution without rounding errors, otherwise it is called an approximate solution. Each initial value gives rise to another solution. The vector field \(f\) defines the flow in the phase space of the problem.

2 Initial-value problems

As a side-note, an IVP is just an ODE with an initial-value condition such as \(x(a) = x_0\) for a number \(x_0 \in \mathbb{R}\). The reason for considering such a constraint is that for the problem to be well-posed in the sense of Hadamard, it must admit a unique solution that continuously depends on the data. And the “data” is everything that is “given” as the input to the problem. In its simplest general form, an IVP is stated as \[\dot{x}(t) = f\big(t, x(t)\big), \quad x(a) = x_0\] for \(t \in \mathbb{T}\) and some \(x_0 \in X\), where \(\mathbb{T} := [a,b]\) with \(a < b\) and \(a,b \in \mathbb{R}\), often \(a=0\) and \(b=T\), for some \(T \in \mathbb{R}\), and where \(f : \mathbb{T} \times X \to \mathbb{R}\) is “suitably smooth” and \(X \subseteq \mathbb{R}^\mathbb{T}\) is a vector space of functions \(x : \mathbb{T} \to \mathbb{R}\). The notion of “suitable smoothness” refers here to the existence and uniqueness theorems guaranteeing continuous depends of the solution on the data, e.g., Picard–Lindelöf theorems. We simply assume that \(f\) is uniformly Lipschitz, which means that, there exists a number \(L > 0\) such that for all \(t \in \mathbb{T}\) and for all \(x,y \in X\), \[\left| f(t,x) - f(t,y) \right| \le L \lVert x - y\rVert,\] where the norm on the right-hand side is understood.

There are also weaker and stronger conditions. We don’t need to be concerned with these details right now, though.

3 Boundary-value problems

The distinctive characteristic of an initial-value condition is that it is taken at some point of the domain of the solution that is considered “initial”, i.e., such that all other points in the domain are ordered in sequence after it. A directed set would suffice to capture this notion, and each interval of \(\mathbb{R}\) is a directed set.

A boundary-value condition differs in that the values constraining the problem so as to enable the existence of a unique solution, are prescribed on the boundary of the domain of solutions, which in our simplified case are the two boundary points \(a\) and \(b\) of an interval \([a,b]\), with real \(a < b\).

This is a more general case. Indeed, we can specify various boundary-value conditions.

In the case of PDEs the conditions are specified as directional derivatives with respect to a normal vector. In fact, in \(\mathbb{R}^d\), a normal space at a point \(p\) is the orthogonal complement of a tangent space at \(p\), which leads to a minimization problem. More on this later.

4 Integral Equations

Each IVP \(\dot{x}(t) = f\big(t,x(t)\big)\) with \(x(a) = x_0\) can be written in form of an integral equation \[x(t) = x_0 + \int_a^b f\big(t,x(t)\big) dt.\]

Both equations are equivalent, i.e., admit the same solution.

5 Stochastic Differential Equations

This flavor of differential equations involves a stochastic component. Indeed, a differential equation as stated above is determined completely by the vector field \(f\). The vector field \(f\) gives rise to a differential geometric quantity known as the vector flow of the equation, which describes in dependence on the initial data \(x_0\) the behavior of the solution.

Any component of this constellation can be randomized. The equation can be given a stochastic additive term, the vector flow can be made stochastic, and so on. In other words, just make any of the components to the IVP dependent on an \(\omega \in \Omega\), where \((\Omega, \mathfrak{A}, \mathsf{P})\) is a probability space, and the differential equation becomes essentially a stochastic differential equation.

So far this is only an heuristic manipulation of syntax. To make sense, such equations have different theories in which they can be given semantics. Two major compatible approaches are the Ito and the Stratanovich calculi, the former more widespread in mathematical finance, while the latter more widespread in (mathematical) physics.

We will tough this topic sometime later. This requires an introduction to stochastic calculus. A wonderful blog-style exposition is given by … Gowers TODO. Have a look if you’re curious.

SDEs are semantically correctly stated in the integral form but are often written heuristically in the differential form. Consider for instance the Ito diffusion process that models … TODO geometric Brownian motion used often in some popular formulations of theories in mathematical finance.

SDEs are used to model real-world phenomena where the influence of certain factors remains unknown, cannot be captured in detail, and is assumed to be “random”, whatever this word may mean. The idea is that the random influence is considered to be “noise” and due to lack of information on the precise evolution of such behavior, we simply average out the fluctuations, i.e., we take them into account but, depending on the admissible probability model, we average them out (by integrating against the model measure). This is a good approach in general, justified whenever we have no access to more detailed information or the details observable would make the problem too complex to solve in reasonable time (intractable). Much more on this later.

6 Applications of Differential Equations

The theory of differential equations is widely applied throughout all fields of science. They are used to model real-world phenomena in terms of the rates of change of the quantities of interest in a given scientific problem. Such quantities are modeled as the so-called “features” in machine learning.1

Given a set of data points from a series of observations, a scientist tries to establish equational relations between the studied quantities that are deemed to explain their rates of change, where the quantities may depend on each other and even reflexively on themselves, maybe at different points in time.

In a next step, the researcher needs to verify the model. Does it conform to the data? In-sample and out-of-sample data. In machine learning, a family of models is given, and the task of choosing the model is reduced to automatically fitting the data. In conventional frequentist statistics, a model is a family of probability distributions or maybe some more general measures, depending on a parameter \(\vartheta\) is given, and the task is to determine the specific distribution, i.e., the specific value, range of values of the parameter or a statistic, such that the result best fits the data. In Bayesian statistics, we are given a prior distribution and … TODO, the task is then to optimize such that … TODO. In fact, the maximum likelihood estimator amounts to an optimization problem by definition.

In many cases of Bayesian statistics, for instance, the computation of the integral in the denominator that is used to marginalize the joint density, is a hard problem. One therefore uses alternative methods, such as the Monte Carlo sampling methods, which amount to simulations of the distribution. And even though we can compute such integrals efficiently numerically, the problem becomes intractable as soon as we have to do this for too large a set of points! So simulation may be in fact our only hope in such a case.

In general, simulations are a very significant tool serving as an indicator for the consistency of the assumptions. If the simulated behavior fits well with the observed, the model may be right. We will never really know, unless the system at hand is contrived. But in the scope of our findings and our human mind, this may be just enough. Certain implicit assumptions should only be made explicit by noting that this resulting model and all predictions based upon it, depend on the observed behavior, while the phenomenon of interest may exhibit different behavior under different ambient conditions, on which it intrinsically depends. In other words, there is only so much that we can observe, assert, and analyze; and there is usually no continuity to be expected, i.e., tiny shifts may make the results contrary to the prediction. A good expert will know how to deal with such limitations of predictions. This is why we need numerical, analytical, and stochastic estimates.

This is studied in terms of so-called hidden or latent variables in the model. A well-known model is the hidden Markov model. TODO

Now, my plan is to link this all together in an exposition that is aimed to elucidate the applications of pure mathematics by means of applied mathematics to real-world problems, such as

  • the modeling of real-estate markets,
  • the modeling of financial securities,
  • the modeling of biological systems,

We will even write a few trading algorithms. Algotrading ahoy!

As I’m also interested in natural language processing (NLP), we will develop chatbots with language models based on differential equations.

But first of all, this should be fun!


  1. The term “feature” stems from the literature on pattern recognition in images: features of an object pictured in an image.↩︎