Mridul Gupta

In 2019, I didn't want to work in AI. I started working on security, and ended up in AI accidentally because my M.Tech. supervisor wanted me to work on analysing security attacks using AI methods. So, naturally, my knowledge of artificial intelligence was... artificial, to say. When I joined my Ph.D. in AI, I didn't want to repeat the same mistake, and decided to cover the foundational mathematical concepts necessary to understand machine learning thoroughly.

I found out that I needed to build my foundations in Axiomatic Probability Theory, Measure Theory, Topology, Linear Algebra, Optimization, and Calculus to name a few. I would later learn that I need a basic foundation of Logic as well to be a good researcher. I quickly found great YouTube playlists from esteemed institutes like Stanford, MIT, etc. on these topics, and went through them as fast as I could. But it made no sense to me, and it didn't make the math behind ML any easier for me. What I've come to appreciate now is that ML is complicated, and not all of it makes sense, but it has the capability to produce amazing results.

After three years, it finally came to me that maybe I should read what the pioneers of the respective fields wrote about them in their seminal works and monographs. Turns out, I should have done it all along, because after struggling to understand what do each of these theories bring to the table, giving up, and trying again several times, I finally understood (perhaps superficially) what these theories were about! And my own learnings from these are what I summarise here. More precisely, the following are notes extracted from a few of these areas that I find are relevant to an AI researcher so that they can read ML papers without getting stuck in notation. Also, I feel that this is mostly about probability theory and its pre-requisites. You must understand that there are a few other areas (like optimization and statistics), the understanding of which is also a pre-requisites to effectively reading (and writing) an ML paper.

1 Kolmogorov's Axiomatic Theory of Probability

A. N. Kolmogorov created a beautiful axiomatic theory of probability which defines what are the computations allowed in the context of 'probabilities' and what are the objects that take part in these computations. I naïvely expected that I'd suddenly understand everything about families of probability distributions, conditional distributions, multi-dimensional probability distributions, moment generating functions, Bayes' theorem and its twisted applications among others. However, I now realise, that Kolmogorov's theory is not about the applications of probability, but rather about defining what can be probability. What I think this helps in is making it easier to read the complicated notation one finds in papers, and finding logical flaws, if any, by understanding the notation and what it stands for.

Kolmogorov defined 6 axioms of probability:

$E$ is a set of all possible elementary events and $\mathcal{F}$ is a field of sets.
$\mathcal{F}$ contains the set $E$.
$P$ is a function that assigns a non-negative number to each set $A$ in $\mathcal{F}$. This is called the probability of $A$.
$P(E)=1$.
$A\cap B=\varnothing\implies P(A\cup B)=P(A)+P(B)$ (additivity)

For infinite probability fields, Kolmogorov defined one more axiom:

For a decreasing sequence of events $A_1\supset A_2 \supset A_3\dots$ of $\mathcal{F}$ for which $$\bigcap_n A_n=\varnothing,$$ the following holds $$\lim_{n\rightarrow\infty} P(A_n)=0.$$

That's it. Anything with the above structure can be considered a probability space and can be used to perform probabilistic computations. However, please note that the axioms of probability do not assign any meaning to the members of the set $E$, or to the function $P$. It is an abstraction of probability.

Kolmogorov also defines what is a random variable. A random variable is a real single-valued function $x:E\rightarrow\mathbb{R}$ such that the pre-image of $\{x\mid x\lt a\}$ for any possible $a$ is in $\mathcal{F}$.

I had been so confused by the more abstract axioms presented elsewhere (like on Wikipedia) because they are also abstract. But it helps to understand the real-world concepts that they are an abstracction of. For example, why does the probability function need to satisfy the axiom of additivity? I'll talk about this in the section about measure theory.

What I want to talk about right now is one of the things that really confused me: why do we have a "field of sets" or "set-algebras" or "$\sigma$-algebras". This became clear to me when I realized that, in discrete math, a "field" is a set equipped with sufficient structure to perform algebraic operations such as addition, multiplication, division, and subtraction, all within the set. Similarly, a "set-algebra" or a "field of sets" is a set of sets that allows you to do algebraic operations on sets, specifically, the operations of union, intersection, and complementation without leaving the set. Moreover, $\sigma$ in $\sigma$-algebra is a relic from the notation used by Hausdorff to mean "countable infinite summation (or set unions)". A $\sigma$-algebra is a set-algebra where you can perform countably infinite set unions without leaving the set-algebra. This becomes useful when talking about infinite probability spaces.

2 Measure Theory

While measure theory is vast and its applications varied, I am interested in what it adds to the theory of probability. Kolmogorov off-loaded most of the work to measure theory. In fact, the function $P$ from axiom 3 is actually a special case of the central idea of measure theory known as measure. M. Lebesgue created the concept of measures to facilitate the generatlisation of the Reimann integral. Lebesgue was inspired by the idea of measurements in the physical world, and wanted to abstract out this idea of "measurements". Taking insights from how real-world measurements like length, area, volume, mass, etc. act in the real world, Lebesgue determined that for a function to be a measure on a field of sets $\mathcal{F}$, it must

Assign a non-negative number to every set in $\mathcal{F}$, and in particular, it must assign $0$ to the empty set.
The measure of a set that is the union of two non-intersecting sets is the sum of the measures of the two subsets. For the intuition, imagine you could divide an object with mass $M$ into two non-overlapping parts with mass $m_1$ and $m_2$, then we know that $M=m_1+m_2$.

Notice that the 2nd axiom is the same as the 5th axiom in the probability theory.

Now, in a measure space, Lebesgue defined the integral of a function (which agrees to the Reimann integral for many functions seen in the day-to-day life of an ML engineer). The Lebesgue integral of a measurable function is denoted as $\int f(x)d\mu(x)$, where $\mu$ is the measure. (This is syntactic sugar for $\sum_i\mu(\{x\mid f(x)\in (y_i,y_{i+1})\})y_i$, where the range is divided into partitions $\{y_i\}_{i\in I}$.)

If $(X,\mathcal{F},\mu)$ is a measure space, then a function $f:X\rightarrow[-\infty,\infty]$ is said to be measurable if the set $$f^{-1}((a,\infty])=\{x\in X\mid f(x)\gt a\}$$ is measurable for each $a\in\mathbb{R}$, that is, $f^{-1}((a,\infty])\in\mathcal{F}$. Notice the analogy between a measurable function and a random variable in the probability theory.

What I have observed across multiple ML papers is that the authors prefer using the Lebesgue integral syntax to refer to common integrals in probability. One of the most common integral in probability is the expectation of a random variable in a continous and infinite probability field (note Lebesgue integral for discrete probability spaces is not defined since the probability mass only occurs at points of discontinuity). Thus, $\mathbb{E}(x)= \int_E xdP$ for example is used to mean $\sum_i P(\{e\mid x(e)\in(a_i,a_{i+1})\})\times a_i$.

3 Topology

Topology generalises the notion of neighborhoods, closeness, and convergences. Topology doesn't really add a lot to the discussion of probability, except that it provides a $\sigma$-algebra on the real numbers. This $\sigma$-algebra is induced by the collection of Borel sets on the real numbers. That is all I want to say about topology.

4 Bringing it all together in the Universal Approximation Theorem

The Universal Approximation Theorem states that a feedforward network with a linear output layer and at least one hidden layer with any squashing activation function (like sigmoid or tanh) can approximate any Borel measurable function from one-finite dimensional space to another with any desired nonzero amount of error, provided that the network is given enough hidden units. Here, a Borel measure is any measure on the
$\sigma$-algebra generated from Borel sets. What this means is that a feedforward networks are capable of learning any probability distribution.

Conclusion

The above list of topics that I've picked is neither complete nor exhaustive; it's just my personal notes on areas that now make sense to me. The broader takeaway is that in ML papers every one is faking it, and the notation is much more complicated than it needs to be. Then why all this? Mathematicians smarter than me created the rigorous frameworks, some of which I've unfaithfully mentioned above, to reason about these concepts without falling into paradoxes. And while this rigor is good, I believe that it is overkill to use such rigorous notation in ML papers that are mostly application oriented.