Fitting Models to Data Graphically¶

Author: S.K. Cheng¶

2024-08-04¶

Introduction¶

In mathematical modeling, we will always face the problem of analyzing data for different purposes. One example is when we have already analyzed the distance required to bring a car to a safe stop once the brakes are applied, our assumptions led to a submodel of the form $$ d_b = Cv^2 $$ where $d_b$ is the distance required to stop the car, $v$ is the velocity of the car at the time the brakes are applied, and $C$ is some arbitrary constant of proportionality. At this point we can collect and analyze sufficient data to determine the value of $C$ and provide a reasoning for the model.

In [ ]:
# Suppose the model of the form d_b = C * v^2
# where d_b is the braking distance, v is the velocity, and C is a constant.
import numpy as np
import matplotlib.pyplot as plt

# We make up some data for d_b and v
v = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
d_b = np.array([2, 8, 18, 32, 50, 72, 98, 128, 162, 200])

# We can plot the data
plt.plot(v, d_b, 'o')
plt.title('Braking distance vs velocity (Raw Data)')
plt.xlabel('Velocity (m/s)')
plt.ylabel('Braking distance (m)')
plt.show()

# We can see that the data is quadratic, so we can fit a quadratic model to it
# We can use the polyfit function to fit a polynomial of degree 2 to the data
coefficients = np.polyfit(v, d_b, 2)
print('Coefficients:', coefficients)

# We can use the polyval function to evaluate the polynomial at the data points
d_b_fit = np.polyval(coefficients, v)

# We can plot the data and the fit
plt.plot(v, d_b, 'o', label='Data')
plt.plot(v, d_b_fit, label='Fit')
plt.title('Braking distance vs velocity (Quadratic Fit)')
plt.xlabel('Velocity (m/s)')
plt.ylabel('Braking distance (m)')
plt.legend()
plt.show()
No description has been provided for this image
Coefficients: [ 2.00000000e-02  4.24758863e-15 -1.71187836e-13]
No description has been provided for this image

As we can seen in the output of the code below, the data is not linear, so we can't use a linear model to fit the data. We can use a quadratic model to fit the data. One may wonder the model does not contain coefficients for the linear and constant terms; however, there is still coefficients for the linear and constant terms for the output but with the value approximately equal to zero(with values of $10^{-13}, 10^{-15}$).

Though we have already assumed the model to be quadratic, we still need to verify the assumption by plotting the data and the model. This is the main purpose of this notebook.

Analytic Methods of Model Fitting¶

In this section we investigate several criteria for fitting curves to a collection of data points. Each criterion suggests a method for selecting the best curve from a given family so that according to the criterion, the curve most accurately represents the data. The criteria are:

  • Chebyshev Approximation Criterion
  • Minimizing the Sum of the Absolute Deviations
  • Least-Squares Criterion

Chebyshev Approximation Criterion¶

In the preceding section we graphically fit lines to a given collection of data points. One of the best-fit criteria used was to minimized the largest distance from the line to any corresponding data point. Let us analyze this geometric construction.

Given a collection of $m$ data points $(x_1, y_1), (x_2, y_2), \ldots, (x_m, y_m)$, fit the collection to the line $y=ax+b$, determined by the parameters $a$ and $b$, that minimizes the distance between any data point $(x_i,y_i)$ and its corresponding data point on the line $(x_i, ax_i+b)$. This distance is given by $$ d_i = |y_i - ax_i - b| $$ Now, let us generalize this criterion. Given some function type $y=f(x)$ and a collection of $m$ data points $(x_i,y_i)$, minimize the largest absolute deviation $|y_i - f(x_i)|$ over the entire collection. That is, determine the paramters of the function type $y=f(x)$ that minimizes the number $$ \operatorname{Maximum} |y_i - f(x_i)| \quad i=1,2,\dots,m $$ This significant criterion is often called the Chebyshev Approximation Criterion. The difficulty with the Chebyshev criterion is that it is often complicated to apply in practice, at least using only elementary calculus. However, it is a useful criterion for understanding the nature of the best-fit curve.

Problem¶

  1. For the following data, formulate the mathematical model that minimizes the largest deviation between the data and the model $y=c_1x^2+c_2x+c_3$. If a computer is available, solve for the estimates of $c_1, c_2, c_3$.
$x$ $0.1$ $0.2$ $0.3$ $0.4$ $0.5$
$y$ $0.06$ $0.12$ $0.36$ $0.65$ $0.95$

Sol: The Chebyshev Approximation Criterion is to minimize the maximum deviation between the data and the model. The deviation is given by $$ d_i = |y_i - c_1x_i^2 - c_2x_i - c_3| $$ The maximum deviation is given by $$ \operatorname{Maximum} |y_i - c_1x_i^2 - c_2x_i - c_3| \quad i=1,2,\dots,m $$ The model that minimizes the maximum deviation is the best-fit model.

In [ ]:
# Based on the problem, we have the following
x = np.array([0.1, 0.2, 0.3, 0.4, 0.5])
y = np.array([0.06,0.12,0.36,0.65,0.95])

# We can plot the data
plt.plot(x, y, 'o')
plt.title('y vs x (Raw Data)')
plt.xlim(0, 0.6)
plt.ylim(0, 1)
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# We can see that the data is quadratic, so we can fit a quadratic model to it
# We can use the polyfit function to fit a polynomial of degree 2 to the data
coefficients = np.polyfit(x, y, 2)
print('Coefficients:', coefficients)

# We can use the polyval function to evaluate the polynomial at the data points
y_fit = np.polyval(coefficients, x)

# We can plot the data and the fit
plt.plot(x, y, 'o', label='Data')
plt.plot(x, y_fit, label='Fit')
plt.title('y vs x (Quadratic Fit)')
plt.xlim(0, 0.6)
plt.ylim(0, 1)
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()

# Calculate the maximum absolute deviation
max_abs_deviation = np.max(np.abs(y - y_fit))
print('Maximum absolute deviation:', max_abs_deviation)
No description has been provided for this image
Coefficients: [3.78571429e+00 3.85714286e-02 6.03566116e-16]
No description has been provided for this image
Maximum absolute deviation: 0.039142857142857534

Least-Squares Criterion¶

Currently, the most frequently used curve-fitting criterion is the Least-Squares Criterion. If we use the same notation shown earlier, the problem is to determine the parameters of the function type $y=f(x)$ to minimize the sum $$ \sum_{i=1}^m (y_i - f(x_i))^2 $$ Part of the popularity of this criterion stems from the ease with which the resulting optimization problem can be solved using only the calculus of several variables. However, relatively recent advances in mathematical programming techniques (such as the Simplex Method for solving many applications of the Chebyshev criterion) and advances in numerical methods for approximating solutions to the Chebyshev criterion promise to dissipate this advantage. The justification for the use of the least-squares method increases when we consider probabilitic arguments that assume the errors are distributed normally.

In [ ]:
# We regards to the same problem

# Minimize the sum
min_sum = np.sum((y - y_fit)**2)
print('Minimize the sum:', min_sum)
Minimize the sum: 0.003005714285714285

Applying the Least-Squares Criterion¶

Project¶

  1. Lumber Cutters: Lumber cutters wish to use readily available measurements to estimate the number of board feet of lumber in a tree. Assume they measure the diameter of the tree in inches at waist height. Develop a model that predicts board feet as a function of diameter in inches.

Use the following data for your test:

$x$ $17$ $19$ $20$ $23$ $25$ $28$ $32$ $38$ $39$ $41$
$y$ $19$ $25$ $32$ $57$ $71$ $113$ $123$ $252$ $259$ $294$

The variable $x$ is the diameter of a ponderosa pine in inches, and $y$ is the number of board feet divided by $10$.

  • Consider two separate assumptions, allowing each to lead to a model. Completely analyze each model.
    • Assume that all trees are right-circular cylinders and are approximately the same height.
    • Assume that all trees are right-circular cylinders and that the height of the tree is proportional to the diameter.
  • Which model appears to be better? Why? Justify your conclusions.
In [ ]:
# Assume that all trees are right-circular cylinders and are approximately the same height.
# The objective is to use the diameter of the tree to predict the number of board feet of lumber that can be obtained from the tree.
# The data is as follows:
x = np.array([17,19,20,23,25,28,32,38,39,41])
y = np.array([19,25,32,57,71,113,123,252,259,294])

# We can plot the data
plt.plot(x, y, 'o')
plt.title('The Number of Board Feet of Lumber vs Diameter of the Tree (Raw Data)')
plt.xlabel('Diameter of the Tree (in)')
plt.ylabel('The Number of Board Feet of Lumber')
plt.show()

# It is not easy to see the relationship between x and y, so we can take the logarithm of y and plot it against x again
y_log = np.log(y)
plt.plot(x, y_log, 'o')
plt.title('The Number of Board Feet of Lumber vs Diameter of the Tree (Logarithm of y)')
plt.xlabel('Diameter of the Tree (in)')
plt.ylabel('Log(The Number of Board Feet of Lumber)')
plt.show()

# We fit a quadratic model to the data
coefficients = np.polyfit(x, y_log, 2)
print('Coefficients:', coefficients)

# We can use the polyval function to evaluate the polynomial at the data points
y_log_fit = np.polyval(coefficients, x)

# Apply the Least-Square Criterion
min_sum = np.sum((y_log - y_log_fit)**2)
print('Minimize the sum:', min_sum)

# We can plot the data and the fit
plt.plot(x, y_log, 'o', label='Data')
plt.plot(x, y_log_fit, label='Fit')
plt.title('The Number of Board Feet of Lumber vs Diameter of the Tree (Quadratic Fit)')
plt.xlabel('Diameter of the Tree (in)')
plt.ylabel('Log(The Number of Board Feet of Lumber)')
plt.legend()
plt.show()

# We can convert the logarithm of y back to y
y_fit = np.exp(y_log_fit)

# Apply the Least-Square Criterion
min_sum = np.sum((y - y_fit)**2)
print('Minimize the sum:', min_sum)

# We can plot the data and the fit
plt.plot(x, y, 'o', label='Data')
plt.plot(x, y_fit, label='Fit')
plt.title('The Number of Board Feet of Lumber vs Diameter of the Tree (Quadratic Fit)')
plt.xlabel('Diameter of the Tree (in)')
plt.ylabel('The Number of Board Feet of Lumber')
plt.legend()
plt.show()
No description has been provided for this image
No description has been provided for this image
Coefficients: [-0.00295156  0.28393066 -1.02696457]
Minimize the sum: 0.08456348023205189
No description has been provided for this image
Minimize the sum: 1288.4760286362534
No description has been provided for this image

Another question is what about using cubic model in fitting. The cubic model is given by $$ y = c_1x^3 + c_2x^2 + c_3x + c_4

In [ ]:
# We use the same data as before but with cubic model
# We fit a cubic model to the data
coefficients_cubic = np.polyfit(x, y_log, 3)
print('Coefficients:', coefficients_cubic)

# We can use the polyval function to evaluate the polynomial at the data points
y_log_fit_cubic = np.polyval(coefficients_cubic, x)

# Apply the Least-Square Criterion
min_sum = np.sum((y_log - y_log_fit_cubic)**2)
print('Minimize the sum:', min_sum)

# Transform the logarithm of y back to y
y_fit_cubic = np.exp(y_log_fit_cubic)

# Apply the Least-Square Criterion
min_sum = np.sum((y - y_fit_cubic)**2)
print('Minimize the sum:', min_sum)

# We can plot the data and the fit
plt.plot(x, y, 'o', label='Data')
plt.plot(x, y_fit, label='Quadratic Fit')
plt.plot(x, y_fit_cubic, label='Cubic Fit')
plt.title('The Number of Board Feet of Lumber vs Diameter of the Tree (Quadratic and Cubic Fit)')
plt.xlabel('Diameter of the Tree (in)')
plt.ylabel('The Number of Board Feet of Lumber')
plt.legend()
plt.show()
Coefficients: [ 1.56744915e-04 -1.64372710e-02  6.53678409e-01 -4.25135994e+00]
Minimize the sum: 0.06331130861072956
Minimize the sum: 1115.8476339863835
No description has been provided for this image

One may see that by using the Least-Square Criterion, the cubic model is better than the quadratic model. However, one may need to consider the complexity of the model. The cubic model has more parameters than the quadratic model, so it is more complex. In this case, the quadratic model is better than the cubic model. To sum up, we may consider the following factors when choosing the best model:

  • The complexity of the model
  • The accuracy of the model

Conclusion¶

In this notebook, we have discussed the Chebyshev Approximation Criterion and the Least-Squares Criterion. We have also applied the Least-Squares Criterion in order to compare the quadratic and cubic model since they are the most common models in fitting data. We have also discussed the factors that need to be considered when choosing the best model.