Lagrange and KKT Multipliers
Introduction
All optimization problems are related to minimizing a function (usually termed loss, cost or error function) or maximizing a function (such as the likelihood) with respect to some variable $\mathbf{x}$. If there are constraints in the possible values of $\mathbf{x}$, the method of Lagrange Multipliers can restrict the search of solutions in the feasible set of values of $\mathbf{x}$. Let us limit the constraints to equalities, for now. The problem can be formulated as: $$ \mathbf{x^\star} = \underset{\mathbf{x}}{\textrm{argmin}} f(\mathbf{x}) $$ $$\textrm{subject to } h_i(\mathbf{x}) = 0, \forall i=1,..,m$$ In words, find the solution that minimizes $f(\mathbf{x})$, as long as all equalities $h_i(\mathbf{x}) = 0$ hold. It is easy to see that any equality constraint can be defined, so long as all terms are in the left side of the equation. The method of Lagrange Multipliers works as follows: Put the cost function as well as the constraints in a single minimization problem, but multiply each constraint by a factor $\lambda_i$ (the lagrange multipliers). In our example, we would have $m$ lagrange multipliers. Hence the expression for the optimization problem becomes: $$ \mathbf{x^\star} = \underset{\mathbf{x}}{\textrm{argmin }} L(\mathbf{x}, \mathbf{\lambda}) = \underset{\mathbf{x}}{\textrm{argmin }} f(\mathbf{x}) + \underset{i=1}{\overset{m}{\sum}} \lambda_i h_i(\mathbf{x}), $$ where $L(\mathbf{x}, \mathbf{\lambda})$ is the Lagrangian and depends also on $\mathbf{\lambda}$, which is a vector of all multipliers.
As usual, we find the roots of the gradient of the loss function with respect to $\mathbf{x}$ to find the minimum (in case of a convex function) or the maximum (in case of a concave function). That will give us the optimality condition: $$ \nabla_{\mathbf{x}}L(\mathbf{x}, \lambda) = \nabla_{\mathbf{x}}f(\mathbf{x}) + \sum_i\lambda_i \nabla_{\mathbf{x}}h_i(\mathbf{x}) = 0. $$ We have number of variables equal to the elements in $\mathbf{x}$ (say $k$) plus the number of lagrange multipliers ($m$), and, as of now, we only have $k$ equations coming from the gradient with respect to $\mathbf{x}$. If we derive the loss function with respect to each lagrange multiplier, we have $m$ more equations and each new equation is: $$ \frac{\partial L(\mathbf{x}, \lambda)}{\partial \lambda_i} = h_i(\mathbf{x}) = 0, $$ which means that by forcing their derivates to zero we are also limiting the possible solutions to the ones that meet the constraints. Having the same number of equations and variables makes the problem determined and can be solved. Since in the solution, the terms multiplied by the multipliers are zero, they have no contribution to the value of the loss function.
A graphical explanation
There is also an intuitive graphical explanation for the method. Referring to the next figure, the extremum points of the function subject to the constraint (two different examples in the figure) are points at which the gradient of function $f(x,y)$ points in the same direction as the gradient of function $h(x,y)$ (apart from a multiplication factor).
Starting from the extremum without constraints, we can keep increasing $f(\mathbf{x})$ and enlarge their contours until the contour reach our constraint. For all sets of values in that contour, only the one where $g(\mathbf{x}) = 0$ is valid. Furthermore, at that same point, the counter line tangent is the same as the tangent of the constraint. That point is said to be stationary. Given a certain stationary point $\mathbf{x}$, any direction $\mathbf{v}$ that will not change the value of $f(\mathbf{x})$ is $$f(\mathbf{x} + \mathbf{v}) = f(\mathbf{x})$$ $$f(\mathbf{x}) + \nabla_\mathbf{x} f(\mathbf{x}) \mathbf{v} = f(\mathbf{x})$$ $$\nabla_\mathbf{x} f(\mathbf{x}) \mathbf{v} = 0$$ The directions that keep the point stationary are perpendicular to $\nabla_\mathbf{x} f(\mathbf{x})$. On the other hand, the same rule applies to the keep the constraints valid: $$g(\mathbf{x} + \mathbf{v}) = g(\mathbf{x}) = 0$$ $$g(\mathbf{x}) + \nabla_\mathbf{x} g(\mathbf{x}) \mathbf{v} = g(\mathbf{x})$$ $$\nabla_\mathbf{x} g(\mathbf{x}) \mathbf{v} = 0$$ Which means the directions that keep the point within the constraints are perpendicular to $\nabla_\mathbf{x} g(\mathbf{x})$. Then, at stationary points, if the possible directions must be perpendicular to both $\nabla_\mathbf{x} f(\mathbf{x})$ and $\nabla_\mathbf{x} g(\mathbf{x})$ and there tangents are the same, then they must point in the same direction (apart from a multiplicative factor).
Then, we can impose for the solution: $$\nabla_{\mathbf{x}}f(\mathbf{x}) = \lambda \nabla_{\mathbf{x}}h(\mathbf{x})$$ $$\nabla_{\mathbf{x}}f(\mathbf{x}) + \lambda \nabla_{\mathbf{x}}h(\mathbf{x}) = 0$$ This is the same result as taking the gradient of the Lagrange function with respect to $\mathbf{x}$.
Graphical explanation for the method of lagrange multipliers
However, there are still an infinite number of points where the equation above may be true. We still have to find the multiplication factor $\lambda$ in order to solve the problem. As before, we impose the constraints (equivalent to taking the gradient of the Lagrangian with respect to $\lambda$) to get more functions and have a determined system.
A second order sufficient condition
We assume that $f: \mathbb{R}^2 \rightarrow \mathbb{R}$ and the constraint function $h: \mathbb{R}^2 \rightarrow \mathbb{R}$ are of class $C^2$. We denote by $x_0,y_0,\lambda_0$ one solution of the minimization problem: $$ (x_0,y_0) = \underset{(x,y)}{\textrm{argmin}} f(x,y) $$ $$\textrm{subject to } h(x,y) = 0 $$ We set the following Lagrangian function: $$ L_0(x,y)=f(x,y)\lambda_0 h(x,y) $$ One way to know if a solution is a local minimum solution we use the following result.
$\bf{Proposition}$ (Sufficient condition of second order). Let $(u,v)\neq (0,0)$ such that $au+bv=0$ with $a=\dfrac{\partial h}{\partial x}(x_0,y_0)$ and $b=\dfrac{\partial h}{\partial y}(x_0,y_0)$
Apply this result on $\min_{(x,y)} f(x,y)=xy$ with $h(x,y)=x^2+y^21=0$.
The PseudoInverse of a Matrix
Let's say we want to minimize the $\ell_2$norm of a variable, so long as it obeys to a constraint: $$ \mathbf{x^\star} = \underset{\mathbf{x}}{\textrm{argmin}} \mathbf{x}^2 $$ $$ \textrm{subject to } \mathbf{y}  A\mathbf{x} = 0. $$ Assuming that the number of observations in $\mathbf{y}$ are less than the number of variables in $\mathbf{x}$, the problem as an infinite number of solutions, and we are interested in finding the one with least $\ell_2$ norm. Setting the problem in Lagrangian form: $$ \mathbf{x^\star} = \underset{\mathbf{x}}{\textrm{argmin }} L(\mathbf{x}, \mathbf{\lambda}) = \underset{\mathbf{x}}{\textrm{argmin }} \mathbf{x}^2 + \mathbf{\lambda}^T (\mathbf{y}  A\mathbf{x}), $$ we have a number of lagrange multipliers as the number of elements in $\mathbf{y}$. The gradient with respect to $\mathbf{x}$ is: $$ \begin{equation}2\mathbf{x}  A^T \mathbf{\lambda} = 0.\label{eq:1}\end{equation} $$ You can see $\mathbf{x}$ depending on $\mathbf{\lambda}$. From the gradient of the function with respect to $\mathbf{\lambda}$ we get: $$ \mathbf{y} = A\mathbf{x} $$ which in this case is pretty obvious. Premultiply it by $A$ to get: $$2A\mathbf{x} = A A^T \mathbf{\lambda}$$ $$2\mathbf{y} = A A^T \mathbf{\lambda}$$ $$\mathbf{\lambda} = 2(A A^T)^{1}\mathbf{y} $$ We now replace the lagrange multipliers back in $\eqref{eq:1}$ to get: $$2\mathbf{x} = A^T 2(A A^T)^{1}\mathbf{y}$$ $$\mathbf{x} = A^T (A A^T)^{1}\mathbf{y}$$ It turns out that the expression $A^T (A A^T)^{1}$ is called the pseudoinverse of A and is a closedform solution to this problem.
Exercise [Pseudoinverse example]
We want to find the minimum value of the following function:
$$
J\left(x_{1},x_{2}\right)=\left(x_{1}^{2}4x_{1}+4\right)\left(x_{1}^{2}+4x_{1}+2\right)
$$
 Calculate the minimum values of this function.

Draw the level sets of the function. For this, you can use the following Python code:
import numpy as np import matplotlib.pyplot as plt """ Level sets""" def g(x1,x2): res = (x1**24*x1+4)*(#TO DO) return res # level set drawing of the function X1,X2 = np.meshgrid(np.linspace(4,4,401),np.linspace(3,2.5,401)) Z = g(X1,X2) graphe = plt.contour(X1,X2,Z,120) plt.show()
Regularity
Let's look at this regularity issue with an example. Imagine the following problem: $$\min f(x_1, x_2) = x_1 + x_2$$ $$\mbox{subject to }h_1(x_1,x_2) = (x_11)^2 + x_2^2 1 = 0$$ $$\mbox{subject to }h_1(x_1,x_2) = (x_12)^2 + x_2^2 4 = 0$$ The constraints are circles with different radius, both crossing the origin $(x_1,x_2) = (0,0)$, which is the optimal solution. The optimality condition is: $$\nabla_{\mathbf{x}}f(\mathbf{x}) + \lambda_1 \nabla_{\mathbf{x}}h_1(\mathbf{x}) + \lambda_2 \nabla_{\mathbf{x}}h_2(\mathbf{x}) = 0$$ However, the gradients at the origin are $\nabla_{\mathbf{x}}f(0,0) = (1,1)$, $\nabla_{\mathbf{x}}h_1(0,0) = (2,0)$, $\nabla_{\mathbf{x}}h_2(0,0) = (4,0)$. Since the $x_2$ component is zero in both gradients of the constraints and not in the gradient of the function, the latter can never be formed as a linear combination of the formers. Therefore, there are no Lagrange multipliers that enforce the optimality condition.
But what if the function to be minimized is: $$\min f(x_1,x_2) = x_1$$ In this case, the optimal solution still is $(0,0)$ and the gradient of $f(x_1,x_2)$ at the optimal point is $\nabla_{\mathbf{x}}f(0,0) = (1,0)$. Then, the gradient of the function can be formed as a linear combination of the gradients of the constraints and there are Lagrange multipliers that enforce the optimality condition.
As a recap:
 A feasible solution is regular? The Lagrange multipliers exist and they are unique.
 A feasible solution is not regular? The Lagrange multipliers may or may not exist, depending if the gradient of the function can be represented as a linear combination of the gradients of the constraints.
And what if we have inequalities as constraints, instead of equalities? The KarushKuhnTucker (KKT) conditions extend the method of Lagrange Multipliers to allow inequalities and the KKT conditions are the necessary conditions for optimality.
References
 D. P. Bertsekas (1999), Nonlinear Programming.
 J. Barzilai, J.M. Borwein (1988), TwoPoint Step Size Gradient Methods, IMA Journal of Numerical Analysis (8), pp. 141148.