Kaushik Sinha - Losses

Some notes while reading Barratt, Angeris, and Boyd (2021).

import cvxpy as cp
import rsw
import numpy as np

Equality Loss

Constrain \(x = f_{\text{des}}\) so let \(l(x, f_{des}) = +\infty\) for \(x \neq f_{des}\) and \(0\) when \(x = f_{des}\). Here \(l\) represents our loss function.

Convex Optimization Version

\[\min \frac{1}{\lambda}\sum_{i=1}^{m}( \hat{f}_{i}-f^{i})^{2} = \frac{1}{\lambda}||\hat{f}-f||^{2}_{2}\] \[\text{subject to:}\quad \hat{f}_{i} = f^{i}_{\text{des}},\quad i=1,\ldots,m\]

Proximal Version

\(\mathbf{prox}_{\lambda l}(f_{\text{des}}) = argmin_{x} \left( l(x, f_{des}) + \frac{1}{2\lambda}||x-f_{\text{des}}||^{2}_{2} \right)\)

So we have to set \(x_{i} = f^{i}_{\text{des}}\)for each \(i\) in order to avoid the loss of \(\infty\) which we incur for any other choice.

Thus, \(\mathbf{prox}_{\lambda l}(f_{\text{des}})_{i} = f^{i}_{\text{des}}\) is the \(i\)-th component of the minimizer \(x\).

# Number of "marginals" we want to match.
m = 10

# Expected values of these functions under the induced distribution.
# There is one for each marginal that we want to match.
f = np.random.randn(m) 

# The desired or target values we want to match by applying weights
#  to f.
fdes = np.random.randn(m)

lam = 1

The following demonstrates that the \(\hat{f}\) obtained via Convex optimization is the same as the minimizer \(x\) obtained using the Proximal route.

fhat = cp.Variable(m)
cp.Problem(cp.Minimize(1 / lam * cp.sum_squares(fhat - f)),
            [fhat == fdes]).solve()

equality = rsw.EqualityLoss(fdes)   
np.testing.assert_allclose(fhat.value, equality.prox(f, lam))

Inequality Loss

Constrain \(x\) to a range around \(f_{\text{des}}\) so let \(l(x, f_{des}) = 0\) for \(x \in [f_{des}+\text{lower}, f_{des}+\text{upper}]\) and \(+\infty\) otherwise. Here the lower and upper refer to some acceptable range around each entry in \(f_{des}.\)

Convex Optimization Version

\[\min \frac{1}{\lambda}||\hat{f}-f||^{2}_{2}\] \[\text{subject to:}\quad f^{i}_{\text{des}} + \text{lower}^{i} \leq \hat{f}_{i} \leq f^{i}_{\text{des}} + \text{upper}^{i},\quad i=1,\ldots,m\]

So ideally we want to set each \(\hat{f}_{i}\) to \(f_{i}\) to minimize the objective function. But what if \(f_{i} > f^{i}_{\text{des}} + \text{upper}^{i}\)? In this case, accepting such an \(\hat{f}_{i}\) leads to infeasibility and to avoid this we will clip \(\hat{f}_{i}\) to \(f^{i}_{\text{des}} + \text{upper}^{i}\). We cannot clip any lower because then we would be incurring additional loss (over and above that resulting from setting to \(f^{i}_{\text{des}} + \text{upper}^{i}\)) in the objective function unnecessarily.

Similary, we clip \(\hat{f}_{i}\) to \(f^{i}_{\text{des}} + \text{lower}^{i}\) if we find that \(f_{i} < f^{i}_{\text{des}} + \text{lower}^{i}\).

Proximal Version

\(\mathbf{prox}_{\lambda l}(f_{\text{des}}) = argmin_{x} \left( l(x, f_{des}) + \frac{1}{2\lambda}||x-f_{\text{des}}||^{2}_{2} \right)\)

Thus, \(\mathbf{prox}_{\lambda l}(f_{\text{des}})_{i} = \begin{cases} f^{i}_{\text{des}} + \text{lower}^{i}, \text{if } f_{i} \leq f^{i}_{\text{des}} + \text{lower}^{i}\\ f^{i}_{\text{des}} + \text{upper}^{i}, \text{if } f_{i} \geq f^{i}_{\text{des}} + \text{upper}^{i}\\ f_{i}, \text{ otherwise}\end{cases}\)

So we see that the Proximal version when passed the input (vector) \(f\) and the Convex optimization version will both end up with the same minimizer.

lower = np.array([-.3])
upper = np.array([.3])

fhat = cp.Variable(m)
cp.Problem(cp.Minimize(1 / lam * cp.sum_squares(fhat - f)),
            [lower <= fhat - fdes, fhat - fdes <= upper]).solve()

inequality = rsw.InequalityLoss(fdes, lower, upper)           
np.testing.assert_allclose(fhat.value, inequality.prox(f, lam))

Boolean Loss

\(l(x) = \begin{cases}0 \quad x \in \{0, 1/k\}^{n} \\ \infty \quad \text{Otherwise}\end{cases}\)

We want \(x_{i} = 1/k\) for \(k < n\) samples (and \(0\) for the others). Choosing any other value results an infinite loss (think infeasibility).

\[\mathbf{prox}_{\lambda l}(f_{\text{des}}) = argmin_{x}( l(x) + \frac{1}{2\lambda}||x - f_{des}||^{2}_{2} )\]

The proximal operator of \(l\) is the projection of \(f_{des}\) onto the (nonconvex) set \(\{x \in \{0, 1/k\}^{n} | \mathbf{1}^{T}x=1\}\). If each component of \(x\) is either \(0\) or \(1/k\) then the constraint \(\mathbf{1}^{T}x=1\) means that exactly \(k\) of them are set to \(1/k\) and the remaining are set to \(0\) (as was desired).

The solution here is to set the largest \(k\) entries of \(f_{des}\) to \(1/k\) and the remaining to \(0\).

boolean = rsw.BooleanRegularizer(3)
np.testing.assert_allclose(np.array([1/3, 0, 0, 1/3, 1/3]), boolean.prox(np.array([5, 1, 0, 2, 4]), lam))

(Weighted) Least Squares Loss

\[\min \frac{1}{2}\sum_{i=1}^{m}d^{2}_{i}(\hat{f}_{i} - f^{i}_{des})^2 + \frac{1}{2\lambda}||\hat{f} - f_{des}||^{2}_{2}\]

Taking the derivative with respect to \(\hat{f}_{i}\) and setting equal to zero we get:

\[d^{2}_{i}(\hat{f}_{i} - f^{i}_{des}) + \frac{1}{\lambda}(\hat{f}_{i} - f^{i}_{des}) = 0\]

\[\hat{f}_{i}( d^{2}_{i} + \frac{1}{\lambda} ) = \frac{f^{i}_{des}}{\lambda} + d^{2}_{i}f^{i}_{des}\]

Finally we get that \(\hat{f}_{i} = \frac{ d^{2}_{i}f^{i}_{des} + \frac{f^{i}_{des}}{\lambda} }{ ( d^{2}_{i} + \frac{1}{\lambda} ) }\)

The \(d\)’s are called the diagonals in the code and are used to weight each sample (row) in the data. Imagine a matrix with zeros everywhere but for the diagonal entries.

d = np.random.uniform(0, 1, size=m)

fhat = cp.Variable(m)
cp.Problem(cp.Minimize(1 / 2 * cp.sum_squares(cp.multiply(d, fhat - fdes)) +
                        1 / (2 * lam) * cp.sum_squares(fhat - f))).solve()

lstsq = rsw.LeastSquaresLoss(fdes, d)                        
np.testing.assert_allclose(fhat.value, lstsq.prox(f, lam))

Entropy Loss

\[\min \sum_{i=1}^{m}\hat{f}_{i}\ln\hat{f}_{i} + \frac{1}{2\lambda}||\hat{f} - f||^{2}_{2}\]

Taking the derivative with respect to \(\hat{f}_{i}\) and setting equal to zero we get:

\[1 + \ln\hat{f}_{i} + \frac{1}{\lambda}(\hat{f}_{i} - f_{i}) = 0\]

Rearranging terms: \[ \ln\hat{f}_{i} + \frac{\hat{f}_{i}}{\lambda} = \frac{f_{i}}{\lambda} - 1 \]

Exponentiating both sides and then dividing both sides by \(\lambda\): \[\frac{\hat{f}_{i}}{\lambda}e^{\frac{\hat{f}_{i}}{\lambda}} = \frac{1}{\lambda}e^{\frac{f_{i}}{\lambda} - 1 }\]

Applying the Lambert W function on both sides: \[W(\frac{\hat{f}_{i}}{\lambda}e^{\frac{\hat{f}_{i}}{\lambda}}) = W(\frac{1}{\lambda}e^{\frac{f_{i}}{\lambda} - 1 })\]

Finally, \(\hat{f}_{i} = \lambda W(\frac{1}{\lambda}e^{\frac{f_{i}}{\lambda} - 1 })\).

Just for fun, suppose that each instance of \(\lambda\) and \(f_{i}\) in the above result were replaced by \(0.5\lambda\) and \(f_{i} + 0.5\lambda \ln f^{i}_{des}\) respectively then:

\(\hat{f}_{i} = 0.5\lambda W(\frac{1}{0.5\lambda}e^{\frac{f_{i} + 0.5\lambda \ln f^{i}_{des}}{0.5\lambda} - 1 }) = 0.5\lambda W(\frac{f^{i}_{des}}{0.5\lambda}e^{\frac{f_{i}}{0.5\lambda} - 1 })\)

f = np.random.uniform(0, 1, size=m)
f /= f.sum()

fhat = cp.Variable(m)
cp.Problem(cp.Minimize(cp.sum(-cp.entr(fhat)) +
                        1 / (2 * lam) * cp.sum_squares(fhat - f))).solve()
np.testing.assert_allclose(
    fhat.value, rsw.losses._entropy_prox(f, lam), atol=1e-5)

Entropy Regularizer

Minimize \(r(w)\) where

\[r(w) = \begin{cases} \sum_{i=1}^{n} w_{i}\ln w_{i}, \quad (1/(\kappa n))\mathbf{1} \leq w \leq (\kappa/n )\mathbf{1} \\ \infty \quad \text{Otherwise}\end{cases}\]

Here \(\kappa > 1\) is a hyper-parameter (also called limit in the code). Observe that there is a constraint on \(w\) that says they must lie within \([1/(\kappa n), \kappa/n]\) otherwise the loss is infinite (i.e., we have an infeasible solution). If we interpret \(w_i\) as a weight then the constraint says that no item can be down-weighted by less than \(1/\kappa\) or up-weighted by more than \(\kappa\).

The final solution is \(\mathbf{prox}_{\lambda r}(w_{\text{des}})_{i} = \begin{cases} 1/(\kappa n), \text{if } \hat{w}_{i} \leq 1/(\kappa n)\\ \kappa/n, \text{if } \hat{w}_{i} \geq \kappa/n\\ \hat{w}_{i}, \text{ otherwise}\end{cases}\) where \(\hat{w} = \lambda W(\frac{1}{\lambda}e^{\frac{w^{des}_{i}}{\lambda} - 1 })\) from the Entropy Loss section (\(w^{des}\) is some desired weight vector, provided as input, which we want \(\hat{w}\) to be close to).

KL Loss

\[\min \frac{1}{2}\left[ \sum_{i=1}^{m}\hat{f}_{i}\ln\hat{f}_{i} - \sum_{i=1}^{m}\hat{f}_{i}\ln f^{i}_{des} \right] + \frac{1}{2\lambda}||\hat{f} - f||^{2}_{2}\]

Taking the derivative with respect to \(\hat{f}_{i}\) and setting equal to zero we get:

\[\frac{1}{2}\left[ 1 + \ln\hat{f}_{i} - \ln f^{i}_{des} \right] + \frac{1}{\lambda}(\hat{f}_{i} - f_{i}) = 0\]

Multiplying both sides by \(2\) and then rearrange terms to obtain: \[\ln\hat{f}_{i} + \frac{2\hat{f}_{i}}{\lambda} = \frac{2f_{i}}{\lambda} -1 + \ln f^{i}_{des}\]

Next, exponentiate both sides to get: \[\hat{f}_{i}e^{\frac{\hat{f}_{i}}{0.5\lambda}} = f^{i}_{des}e^{\frac{f_{i}}{0.5\lambda} -1}\]

Divide both sides by \(0.5\lambda\) and then apply the Lambert W function to get:

\[W(\frac{\hat{f}_{i}}{0.5\lambda}e^{\frac{\hat{f}_{i}}{0.5\lambda}}) = W(\frac{f^{i}_{des}}{0.5\lambda}e^{\frac{f_{i}}{0.5\lambda} -1})\]

Thus, \(\hat{f}_{i} = 0.5\lambda W(\frac{f^{i}_{des}}{0.5\lambda}e^{\frac{f_{i}}{0.5\lambda} -1}).\)

Hopefully, this reminds you of the last expression in the Entropy Loss section.

f = np.random.uniform(0, 1, size=m)
f /= f.sum()

fdes = np.random.uniform(0, 1, size=m)
fdes /= fdes.sum()

kl = rsw.KLLoss(fdes, scale=.5)
fhat = cp.Variable(m, nonneg=True)
cp.Problem(cp.Minimize(.5 * (cp.sum(-cp.entr(fhat) - cp.multiply(fhat, np.log(fdes)))) +
                        1 / (2 * lam) * cp.sum_squares(fhat - f))).solve()
np.testing.assert_allclose(fhat.value, kl.prox(f, lam), atol=1e-5)

References

Barratt, Shane, Guillermo Angeris, and Stephen Boyd. 2021. “Optimal Representative Sample Weighting.” Statistics and Computing 31 (2): 1–14.