import cvxpy as cp
import rsw
import numpy as np
Some notes while reading Barratt, Angeris, and Boyd (2021).
Equality Loss
Constrain \(x = f_{\text{des}}\) so let \(l(x, f_{des}) = +\infty\) for \(x \neq f_{des}\) and \(0\) when \(x = f_{des}\). Here \(l\) represents our loss function.
Convex Optimization Version
\[\min \frac{1}{\lambda}\sum_{i=1}^{m}( \hat{f}_{i}-f^{i})^{2} = \frac{1}{\lambda}||\hat{f}-f||^{2}_{2}\] \[\text{subject to:}\quad \hat{f}_{i} = f^{i}_{\text{des}},\quad i=1,\ldots,m\]
Proximal Version
\(\mathbf{prox}_{\lambda l}(f_{\text{des}}) = argmin_{x} \left( l(x, f_{des}) + \frac{1}{2\lambda}||x-f_{\text{des}}||^{2}_{2} \right)\)
So we have to set \(x_{i} = f^{i}_{\text{des}}\)for each \(i\) in order to avoid the loss of \(\infty\) which we incur for any other choice.
Thus, \(\mathbf{prox}_{\lambda l}(f_{\text{des}})_{i} = f^{i}_{\text{des}}\) is the \(i\)-th component of the minimizer \(x\).
# Number of "marginals" we want to match.
= 10
m
# Expected values of these functions under the induced distribution.
# There is one for each marginal that we want to match.
= np.random.randn(m)
f
# The desired or target values we want to match by applying weights
# to f.
= np.random.randn(m)
fdes
= 1 lam
The following demonstrates that the \(\hat{f}\) obtained via Convex optimization is the same as the minimizer \(x\) obtained using the Proximal route.
= cp.Variable(m)
fhat 1 / lam * cp.sum_squares(fhat - f)),
cp.Problem(cp.Minimize(== fdes]).solve()
[fhat
= rsw.EqualityLoss(fdes)
equality np.testing.assert_allclose(fhat.value, equality.prox(f, lam))
Inequality Loss
Constrain \(x\) to a range around \(f_{\text{des}}\) so let \(l(x, f_{des}) = 0\) for \(x \in [f_{des}+\text{lower}, f_{des}+\text{upper}]\) and \(+\infty\) otherwise. Here the lower and upper refer to some acceptable range around each entry in \(f_{des}.\)
Convex Optimization Version
\[\min \frac{1}{\lambda}||\hat{f}-f||^{2}_{2}\] \[\text{subject to:}\quad f^{i}_{\text{des}} + \text{lower}^{i} \leq \hat{f}_{i} \leq f^{i}_{\text{des}} + \text{upper}^{i},\quad i=1,\ldots,m\]
So ideally we want to set each \(\hat{f}_{i}\) to \(f_{i}\) to minimize the objective function. But what if \(f_{i} > f^{i}_{\text{des}} + \text{upper}^{i}\)? In this case, accepting such an \(\hat{f}_{i}\) leads to infeasibility and to avoid this we will clip \(\hat{f}_{i}\) to \(f^{i}_{\text{des}} + \text{upper}^{i}\). We cannot clip any lower because then we would be incurring additional loss (over and above that resulting from setting to \(f^{i}_{\text{des}} + \text{upper}^{i}\)) in the objective function unnecessarily.
Similary, we clip \(\hat{f}_{i}\) to \(f^{i}_{\text{des}} + \text{lower}^{i}\) if we find that \(f_{i} < f^{i}_{\text{des}} + \text{lower}^{i}\).
Proximal Version
\(\mathbf{prox}_{\lambda l}(f_{\text{des}}) = argmin_{x} \left( l(x, f_{des}) + \frac{1}{2\lambda}||x-f_{\text{des}}||^{2}_{2} \right)\)
Thus, \(\mathbf{prox}_{\lambda l}(f_{\text{des}})_{i} = \begin{cases} f^{i}_{\text{des}} + \text{lower}^{i}, \text{if } f_{i} \leq f^{i}_{\text{des}} + \text{lower}^{i}\\ f^{i}_{\text{des}} + \text{upper}^{i}, \text{if } f_{i} \geq f^{i}_{\text{des}} + \text{upper}^{i}\\ f_{i}, \text{ otherwise}\end{cases}\)
So we see that the Proximal version when passed the input (vector) \(f\) and the Convex optimization version will both end up with the same minimizer.
= np.array([-.3])
lower = np.array([.3])
upper
= cp.Variable(m)
fhat 1 / lam * cp.sum_squares(fhat - f)),
cp.Problem(cp.Minimize(<= fhat - fdes, fhat - fdes <= upper]).solve()
[lower
= rsw.InequalityLoss(fdes, lower, upper)
inequality np.testing.assert_allclose(fhat.value, inequality.prox(f, lam))
Boolean Loss
\(l(x) = \begin{cases}0 \quad x \in \{0, 1/k\}^{n} \\ \infty \quad \text{Otherwise}\end{cases}\)
We want \(x_{i} = 1/k\) for \(k < n\) samples (and \(0\) for the others). Choosing any other value results an infinite loss (think infeasibility).
\[\mathbf{prox}_{\lambda l}(f_{\text{des}}) = argmin_{x}( l(x) + \frac{1}{2\lambda}||x - f_{des}||^{2}_{2} )\]
The proximal operator of \(l\) is the projection of \(f_{des}\) onto the (nonconvex) set \(\{x \in \{0, 1/k\}^{n} | \mathbf{1}^{T}x=1\}\). If each component of \(x\) is either \(0\) or \(1/k\) then the constraint \(\mathbf{1}^{T}x=1\) means that exactly \(k\) of them are set to \(1/k\) and the remaining are set to \(0\) (as was desired).
The solution here is to set the largest \(k\) entries of \(f_{des}\) to \(1/k\) and the remaining to \(0\).
= rsw.BooleanRegularizer(3)
boolean 1/3, 0, 0, 1/3, 1/3]), boolean.prox(np.array([5, 1, 0, 2, 4]), lam)) np.testing.assert_allclose(np.array([
(Weighted) Least Squares Loss
\[\min \frac{1}{2}\sum_{i=1}^{m}d^{2}_{i}(\hat{f}_{i} - f^{i}_{des})^2 + \frac{1}{2\lambda}||\hat{f} - f_{des}||^{2}_{2}\]
Taking the derivative with respect to \(\hat{f}_{i}\) and setting equal to zero we get:
\[d^{2}_{i}(\hat{f}_{i} - f^{i}_{des}) + \frac{1}{\lambda}(\hat{f}_{i} - f^{i}_{des}) = 0\]
\[\hat{f}_{i}( d^{2}_{i} + \frac{1}{\lambda} ) = \frac{f^{i}_{des}}{\lambda} + d^{2}_{i}f^{i}_{des}\]
Finally we get that \(\hat{f}_{i} = \frac{ d^{2}_{i}f^{i}_{des} + \frac{f^{i}_{des}}{\lambda} }{ ( d^{2}_{i} + \frac{1}{\lambda} ) }\)
The \(d\)’s are called the diagonals in the code and are used to weight each sample (row) in the data. Imagine a matrix with zeros everywhere but for the diagonal entries.
= np.random.uniform(0, 1, size=m)
d
= cp.Variable(m)
fhat 1 / 2 * cp.sum_squares(cp.multiply(d, fhat - fdes)) +
cp.Problem(cp.Minimize(1 / (2 * lam) * cp.sum_squares(fhat - f))).solve()
= rsw.LeastSquaresLoss(fdes, d)
lstsq np.testing.assert_allclose(fhat.value, lstsq.prox(f, lam))
Entropy Loss
\[\min \sum_{i=1}^{m}\hat{f}_{i}\ln\hat{f}_{i} + \frac{1}{2\lambda}||\hat{f} - f||^{2}_{2}\]
Taking the derivative with respect to \(\hat{f}_{i}\) and setting equal to zero we get:
\[1 + \ln\hat{f}_{i} + \frac{1}{\lambda}(\hat{f}_{i} - f_{i}) = 0\]
Rearranging terms: \[ \ln\hat{f}_{i} + \frac{\hat{f}_{i}}{\lambda} = \frac{f_{i}}{\lambda} - 1 \]
Exponentiating both sides and then dividing both sides by \(\lambda\): \[\frac{\hat{f}_{i}}{\lambda}e^{\frac{\hat{f}_{i}}{\lambda}} = \frac{1}{\lambda}e^{\frac{f_{i}}{\lambda} - 1 }\]
Applying the Lambert W function on both sides: \[W(\frac{\hat{f}_{i}}{\lambda}e^{\frac{\hat{f}_{i}}{\lambda}}) = W(\frac{1}{\lambda}e^{\frac{f_{i}}{\lambda} - 1 })\]
Finally, \(\hat{f}_{i} = \lambda W(\frac{1}{\lambda}e^{\frac{f_{i}}{\lambda} - 1 })\).
Just for fun, suppose that each instance of \(\lambda\) and \(f_{i}\) in the above result were replaced by \(0.5\lambda\) and \(f_{i} + 0.5\lambda \ln f^{i}_{des}\) respectively then:
\(\hat{f}_{i} = 0.5\lambda W(\frac{1}{0.5\lambda}e^{\frac{f_{i} + 0.5\lambda \ln f^{i}_{des}}{0.5\lambda} - 1 }) = 0.5\lambda W(\frac{f^{i}_{des}}{0.5\lambda}e^{\frac{f_{i}}{0.5\lambda} - 1 })\)
= np.random.uniform(0, 1, size=m)
f /= f.sum()
f
= cp.Variable(m)
fhat sum(-cp.entr(fhat)) +
cp.Problem(cp.Minimize(cp.1 / (2 * lam) * cp.sum_squares(fhat - f))).solve()
np.testing.assert_allclose(=1e-5) fhat.value, rsw.losses._entropy_prox(f, lam), atol
Entropy Regularizer
Minimize \(r(w)\) where
\[r(w) = \begin{cases} \sum_{i=1}^{n} w_{i}\ln w_{i}, \quad (1/(\kappa n))\mathbf{1} \leq w \leq (\kappa/n )\mathbf{1} \\ \infty \quad \text{Otherwise}\end{cases}\]
Here \(\kappa > 1\) is a hyper-parameter (also called limit in the code). Observe that there is a constraint on \(w\) that says they must lie within \([1/(\kappa n), \kappa/n]\) otherwise the loss is infinite (i.e., we have an infeasible solution). If we interpret \(w_i\) as a weight then the constraint says that no item can be down-weighted by less than \(1/\kappa\) or up-weighted by more than \(\kappa\).
The final solution is \(\mathbf{prox}_{\lambda r}(w_{\text{des}})_{i} = \begin{cases} 1/(\kappa n), \text{if } \hat{w}_{i} \leq 1/(\kappa n)\\ \kappa/n, \text{if } \hat{w}_{i} \geq \kappa/n\\ \hat{w}_{i}, \text{ otherwise}\end{cases}\) where \(\hat{w} = \lambda W(\frac{1}{\lambda}e^{\frac{w^{des}_{i}}{\lambda} - 1 })\) from the Entropy Loss section (\(w^{des}\) is some desired weight vector, provided as input, which we want \(\hat{w}\) to be close to).
KL Loss
\[\min \frac{1}{2}\left[ \sum_{i=1}^{m}\hat{f}_{i}\ln\hat{f}_{i} - \sum_{i=1}^{m}\hat{f}_{i}\ln f^{i}_{des} \right] + \frac{1}{2\lambda}||\hat{f} - f||^{2}_{2}\]
Taking the derivative with respect to \(\hat{f}_{i}\) and setting equal to zero we get:
\[\frac{1}{2}\left[ 1 + \ln\hat{f}_{i} - \ln f^{i}_{des} \right] + \frac{1}{\lambda}(\hat{f}_{i} - f_{i}) = 0\]
Multiplying both sides by \(2\) and then rearrange terms to obtain: \[\ln\hat{f}_{i} + \frac{2\hat{f}_{i}}{\lambda} = \frac{2f_{i}}{\lambda} -1 + \ln f^{i}_{des}\]
Next, exponentiate both sides to get: \[\hat{f}_{i}e^{\frac{\hat{f}_{i}}{0.5\lambda}} = f^{i}_{des}e^{\frac{f_{i}}{0.5\lambda} -1}\]
Divide both sides by \(0.5\lambda\) and then apply the Lambert W function to get:
\[W(\frac{\hat{f}_{i}}{0.5\lambda}e^{\frac{\hat{f}_{i}}{0.5\lambda}}) = W(\frac{f^{i}_{des}}{0.5\lambda}e^{\frac{f_{i}}{0.5\lambda} -1})\]
Thus, \(\hat{f}_{i} = 0.5\lambda W(\frac{f^{i}_{des}}{0.5\lambda}e^{\frac{f_{i}}{0.5\lambda} -1}).\)
Hopefully, this reminds you of the last expression in the Entropy Loss section.
= np.random.uniform(0, 1, size=m)
f /= f.sum()
f
= np.random.uniform(0, 1, size=m)
fdes /= fdes.sum()
fdes
= rsw.KLLoss(fdes, scale=.5)
kl = cp.Variable(m, nonneg=True)
fhat .5 * (cp.sum(-cp.entr(fhat) - cp.multiply(fhat, np.log(fdes)))) +
cp.Problem(cp.Minimize(1 / (2 * lam) * cp.sum_squares(fhat - f))).solve()
=1e-5) np.testing.assert_allclose(fhat.value, kl.prox(f, lam), atol