web statisticsRealtime website tracking software

Exploring Minimax-type Optimization Problems

Posted on July 5, 2020

Keywords: estimator, minimax, statistical machine learning, sequential decision-making

Suppose we have a continuous random variable \(X\) that is distributed according to the probability density function \(p(X)\), not to be confused with probability mass distribution \(P(X)\) which is generally used for discrete random variables. We are interested in knowing the density which will often be a closed-form expression in terms of some parameters. We could denote the list of all those parameters by \(\theta\). Let us take a leap of faith and assume that the closed-form expression you had in mind, which when substituted with \(\theta=\theta_0\), exactly describes the desired density. So \(\theta_0\) is the ground-truth value for the parameter of interest. Now an estimator is something that computes an appropriate value for the ground-truth, say \(\hat{\theta}\). One may have at their disposal a variety of estimators \(\hat{e}_1,\cdots,\hat{e}_n\) that produce different estimates of the desired parameter, viz. \(\hat{\theta}_1,\cdots,\hat{\theta}_n\). Which estimator should we choose? This problem can be modeled as a minmax problem.

Your Loss, My Risk

We will borrow some definitions from the machine learning literature.1 A straightforward way to pick estimators would be to evaluate the loss function on different parameters estimates and choose the estimator that minimizes the loss function. So suppose a loss function is defined as follows

\[L(\theta_0,\hat{\theta})=(\theta_0-\hat{\theta})^2\]

And then the optimal estimator determination problem reduces to the following minimization problem

\[\min\limits_{i=1\cdots n}\quad L(\theta_0,\hat{\theta_i})\]

But this naively assumes that the ground-truth is known. So, researchers have developed a more sophisticated and practical approach. We could compute the expected value of the loss function and we call it the risk \(R(\theta_0,\hat{\theta})\). Recall that expected value is defined as follows

\[\mathop{\mathbb{E}}\limits_{x\sim p_X}[X]=\int\limits_xx\,p_X(x)\,dx\]

Note that if someone gave you all the ingredients in the L.H.S., you could have blindly written down the R.H.S. by placing \(x,p,X\) according to their proper places as shown above. Also, \(x\) refers to the real-number values that the function \(X\) can take, thus \(X\) can also be viewed as a function in the measure-theoretic sense. In most texts, the risk function is generally defined as follows

\[R(\theta_0,\hat{\theta})=\mathop{\mathbb{E}}\limits_{\theta_0}(L(\theta_0,\hat{\theta}))\]

And we want to write this expectation in integral form. If we were to simply follow the formal equation for the expectation operator given above, we would erroneously write down

\[\mathop{\mathbb{E}}\limits_{\theta_0}(L(\theta_0,\hat{\theta}))=\int\limits_{\theta_0} L\, \fbox{?}\,d{\theta_0}\]

And it is obvious that the above equation makes no sense. \(\theta_0\) is a constant so an integration over it does not even make any sense. Also note that the standard definition of the risk function does not even contain a density function. That is the problem with shoddy notation. The definition of the risk function is a shorthand and it has to be unpacked as follows. The only density function that we can use is that of \(X\) which is \(p\). So we use \(\theta_0\) to determine the density function, hence \(p_X(x|\theta_0)\) which is abbreviated to \(p_{\theta_0}(x)\). Thus, the risk function can be more fully written down as follows

\[R(\theta_0,\hat{\theta})=\mathop{\mathbb{E}}\limits_{x\sim p_{\theta_0} }[L(\hat{\theta}[X],\theta_0)]=\int\limits_xLp_{\theta_0}(x)\,dx\]

Note that \(L\) as defined above explicitly shows it to be a function of the random variable, \(f(X)\). We make a change of \(\hat{\theta}\) from a constant to a function of random variable because the expectation operator \(\mathop{\mathbb{E}}\) can only be evaluated on a random variable or a function of random variable. Basically, we are combining the estimator \(\hat{e}\) and the estimate \(\hat{\theta}\) into one entity denoted by \(\hat{\theta}[X]\). If we restrict ourselves to viewing \(\hat{\theta}\) as belonging to the same data-type as the constant \(\theta\), then \(L(\theta,\hat{\theta})\) is not a function of any random variable. \(\hat{\theta}\) should now not be seen as an estimate for the ground-truth but as a procedure performed on the data sampled from \(X\). After all, \(X\) can be viewed as a black-box which is triggered multiple times to obtain data.

Elaborating the Expectation Operator

As a simple example, if we take the loss function to be \(\left[\hat{\theta}(X)-\theta_0\right]\), we obtain the following definition of a popular term in estimator literature.

\[Bias(\theta_0,\hat{\theta})=\mathop{\mathbb{E}}\limits_{x\sim p_{\theta_0} }[\hat{\theta}(X)]-\theta_0\]

We generally minimize the square of bias or its absolute value. If we take the loss function defined right in the beginning we refer to the corresponding risk function as mean squared error (MSE). Variance is another operator like the expectation operator and requires either a random variable or a function of a random variable as input. Above equation serves as a self-explanatory starting point for deriving the variance and bias relationship which is a standard elementary concept introduced in machine learning.

We could have avoided the trouble of clarifying the conflicting notations used for the expectation operator by sweeping the mess under definitions. But here are two more examples showing why using detailed and consistent notation can be helpful. The law of iterated expectation is generally written as

\[\mathop{\mathbb{E}}(X)=\mathop{\mathbb{E}}(\mathop{\mathbb{E}}(X|Y))\]

However, it is unclear whether the three expectation operators are the same or not. In fact, all three are different and though each operator can assume only one possible role which is easy to assign based on logical deduction, it is still a good practice to explicitly state them fully as follows

\[\mathop{\mathbb{E}}\limits_{x\sim p_X}[X]=\mathop{\mathbb{E}}\limits_{y\sim p_Y}[\mathop{\mathbb{E}}\limits_{x\sim p_{X|Y}}[X]]\]

In fact, this form allows the reader to easily prove the statement as every detail is explicitly laid out. Now consider the definition of value function in reinforcement learning literature 2.

\[v_{\pi}(s)=\mathop{\mathbb{E}}_{\pi}[G_t|S_t=s]\]

Now a cursory glance would suggest that there is one expectation operator whose random variable is not explicitly defined. That is not a big deal becasue we can always infer the random variable from the density \(\pi\). However, if we fully expand the above formula as shown below, it is clear that there are actually two expectation operators at work3.

\[v_{\pi}(s)=\mathop{\mathbb{E}}\limits_{a\sim\pi}\mathop{\mathbb{E}}\limits_{s'\sim p}[G_t|S_t=s]\]

The Minimax Estimator

After having considered the mechanics of the risk function, we can again tweak the formula a little bit to read as \(R(\theta,\hat{\theta})\). This form is much more useful than the previous definition because we generally do not know the ground-truth. This function needs two inputs. The first one is simply \(\theta\) referring to all possible values for the parameter, thus we do not care what the ground-truth is, as we should not, because we seldom have a way of knowing it. Thus, \(\theta\) is simply a fixed constant. It may be a scalar or vector.

Having defined the risk function, how do we choose the best estimator among many? This is the optimal estimator selection problem. We simply solve the following optimization problem

\[\min\limits_{\hat{\theta}}\max\limits_\theta\,R(\theta,\hat{\theta})\]

What we are simply doing is that we first find the parameter value which will induce the worst performance in the estimators. Then we find the estimator \(\hat{\theta}\) that performs the best in this worst case.

Decision-Making Example

Let us shift gear from the above statistical learning example to an example in sequential decision-making4.

Consider an agent operating against the nondeterministic nature. Thus, how the nature behaves cannot be quantified by using probabilistic numbers. So we will not be able to make use of the expectation operator in this case. The agent moves in a state space given by the graph \(G=(V,E)\). The agent has a total of \(K\) moves to get from an initial node \(v_1\) to a goal node \(v_G\).

In the classical scenario, if the agent takes an action \(a(v_i)=a_i\) from node \(v_i\) then it incurs a cost \(C(v_i,a_i)\). But nature may simultaneously play its own move \(\theta_i\) against the agent's moves, so the actual cost incurred by the agent will be \(C(v_i,a_i,\theta_i)\). After \(K\) moves, the agent will finally land at node \(v_K\) and this itself incurs a terminal cost \(C_G\) if \(v_{K+1}\neq v_G\). So the net cost incurred by the agent for executing any \(K-\)hop plan from \(v_1\) will be given by

\[Val_1(v_1)=\sum_{i=1}^{K} C\left(v_{i}, a_{i}, \theta_{i}\right)+C_{G}\]

Now assume that the agent follows an optimal plan for which the total cost incurred \(Val_1^{\ast}(v_1)\)is minimum. There is also an \(\textit{optimality}\) principle which suggests that if a plan is optimal, then its sub-plans are also optimal. This means that if the optimal plan is

\[v_1 \rightarrow v_2 \rightarrow \cdots \rightarrow v_G\]

Then the \((K-1)-\)hop path from \(v_2\) to \(v_G\) as indicated above is also optimal. Suppose the agent is about to make its last move. Then the total cost incurred for the optimal plan will simply be given by

\[Val_{K}^{*}\left(x_{K}\right)=\min\limits_{a_{K}} \max\limits_{\theta_{K}}\left\{C\left(v_{K}, a_{K}, \theta_{K}\right)\right\}+C_G=\min\limits_{a_{K}} \max\limits_{\theta_{K}}\left\{C\left(v_{K}, a_{K}, \theta_{K}\right)+C_G\right\}\]

The logic behind the first equation is as in the case of Minimax estimator: \(\textit{agent wants to play the best move for the nature's worst move}\). We shall iterate backwards from this step in subsequent iterations. We could thus write the equation for any \(k\) as follows

\[Val_{k}^{*}\left(x_{k}\right)=\min\limits_{a_{k}\cdots a_{K}} \max\limits_{\theta_{k}\cdots\theta_{K}}\left\{\sum\limits_{i=k}^{K} C\left(v_{i}, a_{i}, \theta_{i}\right)\right\}+C_G\]

\[Val_{k}^{*}\left(x_{k}\right)=\min\limits_{a_{k}} \max\limits_{\theta_{k}} \min\limits_{a_{k+1}} \max\limits_{\theta_{k+1}} \cdots \min\limits_{a_{K}} \max\limits_{\theta_{K}}\left\{\sum\limits_{i=k}^{K} C\left(v_{i}, a_{i}, \theta_{i}\right)+C_{G}\right\}\]

This can be rewritten as follows

\[Val_{k}^{*}\left(v_{k}\right)=\min _{a_{k}} \max _{\theta_{k}}\left\{\min _{a_{k+1}} \max _{\theta_{k+1}} \cdots \min _{a_{K}} \max _{\theta_{K}}\left\{C\left(v_{k}, a_{k}, \theta_{k}\right)+\sum_{i=k+1}^{K} C\left(v_{i}, a_{i}, \theta_{i}\right)+C_{G}\right\}\right\}\]

\[ Val_{k}^{*}\left(v_{k}\right)=\min _{a_{k}} \max _{\theta_{k}}\left\{C\left(v_{k}, a_{k}, \theta_{k}\right)+\min _{a_{k+1}} \max _{\theta_{k+1}} \cdots \min _{a_{K}} \max _{\theta_{K}}\left\{\sum_{i=k+1}^{K} C\left(v_{i}, a_{i}, \theta_{i}\right)+C_{G}\right\}\right\}\]

And finally we have

\[Val_{k}^{*}\left(x_{k}\right)=\min\limits_{a_{k}}\left\{\max\limits_{\theta_{k}}\left\{C\left(v_{k}, a_{k}, \theta_{k}\right)+Val_{k+1}^{*}\left(x_{k+1}\right)\right\}\right\}\]

This is the so-called Bellman's recursion formula for backward value iteration and it pivots around a minmax problem. This simple equation can be used to derive an algorithmic recipe to compute the optimal cost to reach a goal and the actions to perform to reach the goal optimally based on the number of hops allowed from any location in the state space.

In summary, we discussed two \(\textit{minmax}\) problems that find extensive engineering applications. Now consider the concept of duality in mathematical programming which generates considerably more academic interest relative to plain optimization problems. In simplistic terms, duality can be viewed as a pair of optimization problems: one of which does maximization and the other does minimization. However, Minimax goes a step further than duality because it is a single optimization problem combining both minimization and maximization procedures. A motivation behind this article is to collect other interesting manifestations of the \(\textit{minmax}\) rule in nature. Feel free to share your insights and point out any typo. Hopefully, future articles will contain a healthy mix of algorithmic research and nondeterministic philosophy because my interests span both.

The motivation behind this article is to collect other interesting manifestations of the \(\textit{minmax}\) rule in nature. John von Neumann's minimax theorem5 for zero-sum games and its graph-theoretical applications come to mind. Upon proving his theorem, von Neumann had this to say, \(``\)As far as I can see, there could be no theory of games [...] without that theorem [...] I thought there was nothing worth publishing until the Minimax Theorem was proved.\("\) However, the Minimax Theorem is closely related to linear programming duality. On one occasion, von Neumann amazed Dantzig, the pioneer of linear programming, by mentioning that \(``\)I don't want you to think that I am pulling all this out of my sleeve like a magician. I have recently completed a book with Morgenstern on the theory of games. What I am doing is conjecturing that the two problems are equivalent. The theory that I am outlining is an analogue to the one we have developed for games.\("\) So feel free to share your insights and point out any typo. Hopefully, future articles will contain a healthy mix of algorithmic research and nondeterministic philosophy because my interests span both.

References

  1. Kevin P. Murphy, Machine learning: A probabilistic perspective. MIT press, 2012.(back)

  2. Richard S. Sutton, and Andrew G. Barto, Reinforcement learning: An introduction. MIT press, 2018.(back)

  3. Though the two underlying probability distributions, viz. policy \([\pi(s|a)]\) and state-transitions \([p(s^{\prime}|s,a)]\), become amply evident from the context in which they are used, it is still a better idea to clarify them notationally. Here we tacitly assumed that if next state \((s^{\prime})\), current state \((s)\), and action taken \((a)\) are given, then the reward received is deterministically determined, otherwise a third probability distribution \([r(R|s^{\prime},s,a)]\) over the rewards also kicks in.(back)

  4. Steven M. LaValle, Planning algorithms. Cambridge university press, 2006.(back)

  5. John v. Neumann, The theory of board games. Mathematical Annals, 1928.(back)