Function Approximation using Neural Networks
Approximating Non-linear Functions
Imagine a tool so versatile that it can learn to recognize patterns, make decisions, and even mimic human-like reasoning. This tool isn’t a product of science fiction, but a reality in the world of computer science known as a neural network.
At its core, a neural network is inspired by the intricate web of neurons in our brains. But instead of processing thoughts and memories, it processes data and learns patterns.
What makes neural networks so powerful? They are universal function approximators, meaning they can learn to approximate virtually any continuous function. They achieve this by combining linear transformations (simple operations like scaling and shifting) with non-linear activation functions (which introduce curves and bends). This combination allows them to model intricate patterns and relationships in data, making them a cornerstone of deep learning.
But why is this capability so significant? In the vast realm of data-driven tasks, from voice recognition to predicting weather patterns, the underlying relationships are often complex and non-linear. Traditional linear models fall short in capturing these intricacies. Neural networks, with their layered architecture and non-linear activation functions, rise to the challenge, offering a flexible and powerful approach to model these relationships.
To make this concrete, let’s walk through a hands-on example. We’ll try to approximate a simple function and see firsthand why non-linearity matters.
Suppose we want to approximate the function $f(x)=x^{2}$ using a neural network. This is a simple non-linear function. If we use only linear layers, our network won’t be able to approximate this function well. But by introducing non-linearity, we can achieve a good approximation.
1. Using only Linear Layers
Let’s first try to approximate $f(x)=x^{2}$ using only linear layers.
Creating the data:
1import torch
2import torch.nn as nn
3
4# seed for reproducibility
5torch.manual_seed(42)
6
7# create data
8x = torch.unsqueeze(torch.linspace(-2, 2, 1000), dim=1)
9y = x.pow(2)
Visualizing the data for the function $f(x)=x^{2}$:
1from implicitnet.plotting import plot_function
2
3plot_function(x, y)

Defining the linear neural network:
1# linear model with one hidden layer
2class LinearModel(nn.Module):
3 def __init__(self, input_dim, hidden_dim, output_dim):
4 super(LinearModel, self).__init__()
5 self.fc1 = nn.Linear(input_dim, hidden_dim)
6 self.fc2 = nn.Linear(hidden_dim, output_dim)
7
8 def forward(self, x):
9 x = self.fc1(x)
10 x = self.fc2(x)
11 return x
Training the neural network:
1# model training function
2def train_model(
3 x,
4 y,
5 model,
6 criterion,
7 optimizer,
8 epochs,
9 epoch_print_freq: int = 10,
10):
11 losses = []
12 preds = [torch.zeros_like(y).numpy()]
13 for epoch in range(epochs + 1):
14 outputs = model(x)
15 loss = criterion(outputs, y)
16 optimizer.zero_grad() # clear old gradients
17 loss.backward() # compute new gradients
18 optimizer.step() # update weights
19 losses.append(loss)
20 preds.append(model(x).detach().numpy())
21 if epoch % epoch_print_freq == 0:
22 print(f"Epoch [{epoch}/{epochs}], Loss: {loss.item():.4f}")
23 return losses, preds
1# instantiate model and params
2linear_model = LinearModel(input_dim=1, hidden_dim=20, output_dim=1)
3criterion = nn.MSELoss()
4optimizer = torch.optim.Adam(linear_model.parameters(), lr=0.01)
5epochs = 100
6
7# train the model
8losses, preds = train_model(x, y, linear_model, criterion, optimizer, epochs)
9
10# model predictions
11with torch.no_grad():
12 final_pred = linear_model(x).numpy()
Epoch [0/100], Loss: 3.2222
Epoch [10/100], Loss: 1.4375
Epoch [20/100], Loss: 1.5482
Epoch [30/100], Loss: 1.4290
Epoch [40/100], Loss: 1.4431
Epoch [50/100], Loss: 1.4289
Epoch [60/100], Loss: 1.4291
Epoch [70/100], Loss: 1.4285
Epoch [80/100], Loss: 1.4279
Epoch [90/100], Loss: 1.4280
Epoch [100/100], Loss: 1.4279
Model predictions:
1from implicitnet.plotting import plot_model
2
3plot_model(x=x, y=y, predicted=final_pred, title="Linear Model - Epoch 100")

1from implicitnet.plotting import plot_model_predictions
2
3
4plot_model_predictions(
5 x=x,
6 y=y,
7 predictions=preds,
8 title="Linear Model",
9 function_name="y = x^2",
10 figsize=(24, 8),
11 rows_cols=(1, 3),
12)

Creating the visual animation:
1from implicitnet.plotting import plot_animation
2
3predictions = {}
4n_iters = list(range(0, 101))
5for epoch in n_iters:
6 predictions[epoch] = preds[epoch]
7
8plot_animation(
9 x=x,
10 y=y,
11 preds=predictions,
12 file_name="linear",
13 folder_name="linear_plots",
14)
1from implicitnet.plotting import create_gif
2
3create_gif(folder_name="linear_plots", file_name="linear", n_iters=n_iters)
Visualizing the learning process:
1from IPython.display import Image
2
3Image(filename="../animations/linear_animation.gif")

We observe that the linear model doesn’t approximate the function $f(x)=x^{2}$ well, even after 100 epochs. This is because linear transformations are great for scaling, rotating, and translating data. However, no matter how many linear layers we stack together, the final transformation will always be linear. This means that the expressive power of the network remains limited.
2. Introducing Non-linearity
Now, let’s introduce a non-linear activation function (ReLU) between the linear layers. These non-linearities allow the network to model complex, non-linear relationships in the data.
Defining the Non-linear network:
1# define a non-linear model with ReLU activation
2class NonLinearModel(nn.Module):
3 def __init__(self, input_dim, hidden_dim, output_dim):
4 super(NonLinearModel, self).__init__()
5 self.fc1 = nn.Linear(input_dim, hidden_dim)
6 self.fc2 = nn.Linear(hidden_dim, output_dim)
7
8 def forward(self, x):
9 hidden = self.fc1(x)
10 relu = torch.relu(hidden)
11 output = self.fc2(relu)
12 return output
Training the non-linear model:
1# instantiate model and params
2nonlinear_model = NonLinearModel(input_dim=1, hidden_dim=20, output_dim=1)
3criterion = nn.MSELoss()
4optimizer = torch.optim.Adam(nonlinear_model.parameters(), lr=0.01)
5epochs = 1000
6
7# train the model
8losses, preds = train_model(
9 x, y, nonlinear_model, criterion, optimizer, epochs, epoch_print_freq=100
10)
11
12# model predictions
13with torch.no_grad():
14 nonlinear_pred = nonlinear_model(x).numpy()
Epoch [0/1000], Loss: 4.6958
Epoch [100/1000], Loss: 0.0604
Epoch [200/1000], Loss: 0.0102
Epoch [300/1000], Loss: 0.0038
Epoch [400/1000], Loss: 0.0017
Epoch [500/1000], Loss: 0.0010
Epoch [600/1000], Loss: 0.0007
Epoch [700/1000], Loss: 0.0005
Epoch [800/1000], Loss: 0.0004
Epoch [900/1000], Loss: 0.0004
Epoch [1000/1000], Loss: 0.0004
1plot_model_predictions(
2 x=x,
3 y=y,
4 predictions=preds,
5 title="Non-Linear Model",
6 function_name="y = x^2",
7 figsize=(24, 8),
8 rows_cols=(1, 3),
9)

1predictions = {}
2n_iters = list(range(0, 50))
3n_iters += list(range(50, 100, 2))
4n_iters += list(range(100, 300, 5))
5n_iters += list(range(300, 1001, 20))
6for epoch in n_iters:
7 predictions[epoch] = preds[epoch]
8
9file_name = "nonlinear"
10folder_name = "nonlinear_plots"
11plot_animation(
12 x=x,
13 y=y,
14 preds=predictions,
15 file_name=file_name,
16 folder_name=folder_name,
17 model_name="Non-Linear Model",
18)
1from implicitnet.plotting import create_gif
2
3create_gif(folder_name=folder_name, file_name=file_name, n_iters=n_iters)
Visualizing the learning process:
1Image(filename="../animations/nonlinear_animation.gif")

With the introduction of non-linearity, the network can now approximate the function $f(x)=x^{2}$ much better.
3. Combining Linear and Non-linear Layers
When we combine linear layers with non-linear activation functions, the magic happens:
Expressive Power: By interleaving multiple linear layers with non-linear activations, the network gains the capability to approximate complex functions. Each successive layer refines and builds upon the features extracted by the previous layer.
Hierarchical Feature Learning: Deep networks learn hierarchical features. Initial layers tend to learn simple patterns (such as edges in images), whereas deeper layers synthesize these simple features into more complex representations (like entire objects or shapes).
Universal Approximation Theorem: This theorem states that a feed-forward network with just a single hidden layer containing a finite number of neurons can approximate any continuous function, under mild assumptions on the activation function. In simpler terms, even a basic neural network can learn to mimic virtually any smooth pattern, given enough neurons. The depth and width of the network determine its capacity to approximate complex functions.
To illustrate this, let’s consider an even more complex non-linear function $f(x)=sin(x)$. Let’s also create a deeper feedforward neural network and observe how it can approximate the non-linear function. The network architecture consists of a sequence of linear layers interspersed with activation functions. Here’s a step-by-step break down:
Creating data:
1# create a sine function dataset
2x = torch.unsqueeze(torch.linspace(-3.5 * torch.pi, 3.5 * torch.pi, 1000), dim=1)
3y = torch.sin(x)
4
5# plot the sine function
6plot_function(x, y, ylim=[-1.5, 1.5], function_name="y = sin(x)", figsize=(14, 6))

Defining a deeper non-linear model with SiLU (Swish) activation:
Note: the Swish activation function is defined as:
$$ \large \text{Swish}(x) = x * \text{sigmoid}(x) $$
Swish is a smooth, non-monotonic function that’s bounded below but unbounded above, meaning it approaches a constant value as $x$ approaches negative infinity but approaches infinity as $x$ approaches positive infinity. Unlike ReLU, which is piecewise linear, Swish is a smooth, continuous function. This smoothness helps gradients (the signals used to update the network’s weights) flow more effectively during training.
1class NeuralNetwork(nn.Module):
2 def __init__(self, input_dim, hidden_dim, output_dim):
3 super(NeuralNetwork, self).__init__()
4 self.model = nn.Sequential(
5 nn.Linear(input_dim, hidden_dim),
6 nn.SiLU(),
7 nn.Linear(hidden_dim, hidden_dim),
8 nn.SiLU(),
9 nn.Linear(hidden_dim, hidden_dim),
10 nn.SiLU(),
11 nn.Linear(hidden_dim, output_dim),
12 )
13
14 def forward(self, x):
15 return self.model(x)
Training the deep neural network:
1# instantiate model and params
2model = NeuralNetwork(input_dim=1, hidden_dim=50, output_dim=1)
3criterion = nn.MSELoss()
4optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=1e-4)
5epochs = 200
6
7# train the model
8losses, preds = train_model(
9 x, y, model, criterion, optimizer, epochs, epoch_print_freq=20
10)
11
12# model predictions
13with torch.no_grad():
14 model_pred = model(x).numpy()
Epoch [0/200], Loss: 0.6867
Epoch [20/200], Loss: 0.3592
Epoch [40/200], Loss: 0.2334
Epoch [60/200], Loss: 0.1103
Epoch [80/200], Loss: 0.0382
Epoch [100/200], Loss: 0.0417
Epoch [120/200], Loss: 0.0031
Epoch [140/200], Loss: 0.0012
Epoch [160/200], Loss: 0.0009
Epoch [180/200], Loss: 0.0008
Epoch [200/200], Loss: 0.0009
1from implicitnet.plotting import plot_model
2
3plot_model(
4 x=x,
5 y=y,
6 predicted=model_pred,
7 title="Neural Network Model",
8 ylim=[-1.5, 1.5],
9 function_name="y = sin(x)",
10 figsize=(14, 6),
11 linewidth=3,
12)

1from implicitnet.plotting import plot_model_predictions
2
3
4plot_model_predictions(
5 x=x,
6 y=y,
7 predictions=preds,
8 title="Neural Network Model",
9 ylim=[-1.5, 1.5],
10 function_name="y = sin(x)",
11 figsize=(10, 12),
12)

Visualizing the learning process:
1from implicitnet.plotting import plot_animation
2
3predictions = {}
4n_iters = list(range(0, 201))
5for epoch in n_iters:
6 predictions[epoch] = preds[epoch]
7
8file_name = "neuralnet"
9folder_name = "neuralnet_plots"
10plot_animation(
11 x=x,
12 y=y,
13 preds=predictions,
14 file_name=file_name,
15 folder_name=folder_name,
16 model_name="Neural Network",
17 ylim=[-1.5, 1.5],
18 function_name="y = sin(x)",
19 figsize=(14, 6),
20 linewidth=3,
21)
1from implicitnet.plotting import create_gif
2
3create_gif(folder_name=folder_name, file_name=file_name, n_iters=n_iters)
Visualizing the learning process:
1Image(filename="../animations/neuralnet_animation.gif")

The function $f(x)=sin(x)$ has curves and non-linear relationships. The combination of linear layers and activation functions in the neural network allows it to “learn” these curves. By adjusting its weights through training, the network finds the best way to “bend” and “shape” its transformations to get as close as possible to this function across the input domain you’re interested in.
Intuition
Imagine trying to fit data points with just straight lines (linear functions). You’d be quite limited. Now, introduce curves (non-linearities) to your toolkit. Suddenly, you can fit a much wider variety of shapes and patterns. In essence, the combination of linear transformations and non-linear activations gives neural networks the flexibility to “bend” and “shape” their output to approximate any given function.
Takeaways
The combination of linear and non-linear layers allows the neural network to form complex decision boundaries and represent non-linear relationships. In our examples, the non-linear model can bend and adjust its shape to fit the curve of the non-linear functions $f(x)=x^{2}$ and $f(x)=sin(x)$, while the purely linear model can only produce straight lines and cannot fit the curve.
While the examples provided might seem basic, they serve a crucial purpose. Truly grasping the concept that neural networks can approximate any continuous function offers a deeper and more profound understanding of why deep learning models are so versatile and powerful.
I hope these visualizations help shed light on the underlying principles of neural networks. They were certainly helpful to me.