from fastai.imports import *
from fastai.torch_imports import *
from fastai.torch_core import *
Motivation
Someone walks up to you and hands over a simple data set of x and y coordinates. You eyeball it and think, “hmm seems like we can fit a quadratic function to this”. This post walks through how this can be done using gradient descent (and follows the treatment in Chapter 4 of [1]).
= 20
num_points = torch.arange(0,num_points).float()
x x
tensor([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13.,
14., 15., 16., 17., 18., 19.])
= 0.75
d = 9.5
e = 1 + torch.randn(num_points)*3
noise = d*(x-e)**2 + noise
y -2*d*e, d*(e**2) y,
(tensor([68.6996, 53.2421, 43.1072, 35.0240, 18.7474, 20.2123, 4.7285, 7.6421,
0.6463, -0.4163, -3.0223, 5.8641, 4.2702, 12.0058, 16.8083, 20.2149,
32.4927, 41.8890, 59.8105, 63.5625]),
-14.25,
67.6875)
Since \(d(x-e)^{2} = d*(x^{2} -2ex + e^{2}) = dx^{2} - 2dex + d*e^{2}\) we expect to see the parameters \(d\), \(-2*d*e\) and \(\approx 1+d*e^{2}\) being uncovered. The last term is because we are adding a \(1\) in the noise.
The data then looks like the following
plt.scatter(x,y)
Recipe
Initialize the parameters
We need three parameters to describe a quadratic \(ax^{2}+bx +c\)
= torch.randn(3).requires_grad_() params
Calculate the predictions
We create a function where we can plug in our input features (\(x\) coordinate values in this case) and get a prediction for the \(y\). The function below will do just that.
def f(x, params):
=params
a,b,creturn a*(x**2) + b*x + c
The predictions, yhat, can then be obtained using
= f(x, params)
yhat yhat
tensor([-1.0763e-01, -9.4094e-01, -4.0345e+00, -9.3884e+00, -1.7002e+01,
-2.6877e+01, -3.9011e+01, -5.3406e+01, -7.0062e+01, -8.8977e+01,
-1.1015e+02, -1.3359e+02, -1.5928e+02, -1.8724e+02, -2.1746e+02,
-2.4994e+02, -2.8467e+02, -3.2167e+02, -3.6093e+02, -4.0245e+02],
grad_fn=<AddBackward0>)
Compare predictions to the data
plt.scatter(x, y)= 'red') plt.scatter(x, to_np(yhat), color
Calculate the loss
We will use mean squared error for this.
def mse(yhat, y):
return ((yhat-y)**2).mean()
= mse(yhat, y)
loss loss
tensor(44301.6484, grad_fn=<MeanBackward0>)
Calculate the gradients
loss.backward()
params.grad
tensor([-70097.9297, -4488.5391, -324.2715])
Take a gradient descent step
Set a learning rate, lr, and then do a step like so
= 1e-5
lr -= lr*params.grad.data
params.data = None params.grad
Get a new prediction and plot again.
plt.scatter(x, y)= 'red') plt.scatter(x, to_np(f(x,params)), color
All together now
Not much changed when we took a single gradient step so we will want to do this for multiple iterations. Before doing that we make a simple function, called step
, that will apply all of the ingredients of the recipe.
def step(params, lr=1e-5, prn=True):
= f(x, params)
yhat = mse(yhat, y)
loss
loss.backward()
-= lr*params.grad.data
params.data = None
params.grad
if prn: print(loss.item(), params)
return yhat
And then another function that will call the step
function as many times as we desire.
def learner(num_iterations, record_interval, lr):
= []
preds_list = []
params_list for i in range(num_iterations):
= i % record_interval == 0
prn = step(params, lr, prn)
preds if prn:
preds_list.append(preds)
params_list.append(params.detach().clone())return preds_list, params_list
We will run \(1,000,000\) iterations and record results every \(100,000\) iterations. The first item is the loss and the second are the parameters \(a,b\) and \(c\) that define the quadratic we want to uncover/recover.
= learner(1000000, 100000, 1e-5) preds_list, params_list
8900.236328125 tensor([-0.1242, 0.3612, -0.1029], requires_grad=True)
382.91522216796875 tensor([ 0.2762, -3.2625, 15.3551], requires_grad=True)
231.8618621826172 tensor([ 0.3839, -5.7625, 27.3517], requires_grad=True)
141.83514404296875 tensor([ 0.4671, -7.6927, 36.6108], requires_grad=True)
88.12383270263672 tensor([ 0.5312, -9.1819, 43.7649], requires_grad=True)
56.144981384277344 tensor([ 0.5809, -10.3335, 49.2795], requires_grad=True)
37.080955505371094 tensor([ 0.6190, -11.2180, 53.5353], requires_grad=True)
25.659448623657227 tensor([ 0.6487, -11.9067, 56.8358], requires_grad=True)
18.909128189086914 tensor([ 0.6714, -12.4334, 59.3597], requires_grad=True)
14.805198669433594 tensor([ 0.6891, -12.8450, 61.3470], requires_grad=True)
Next we plot what our best estimate of the quadratic function looked like at every \(100,000\) iterations.
= len(preds_list)
num_preds = plt.subplots(1, num_preds, figsize=(24,3))
_, axs for j in range(num_preds):
= params_list[j]
a,b,c = a
dhat = -1*b/(2*a)
ehat
axs[j].scatter(x, y)='red')
axs[j].scatter(x, to_np(preds_list[j]), color
f'd={dhat:3.3f} e={ehat:3.3f}')
axs[j].set_title(
plt.tight_layout()
The captions on each plot provide an estimate of \(d\) and \(e\). At the end of one million iterations we have \(\hat{d} = 0.689\) and \(\hat{e}=9.32\).
Recall that we generated the input data using \(d=0.75\) and \(e=9.5\).