3.1. Linear regressions#

During preliminary design and, above all, when optimising solutions, it is useful to have simple equations for calculating the main characteristics of the component. We will now look at how to transform data, coming from catalogs, into algebraic equations using regression.

Simulation Models

For interested readers, more information can be found in the following document (Chapter 4 – Metamodels for model-based-design of mechatronic systems):

Budinger, M. (2014). Preliminary design and sizing of actuation systems (HDR dissertation, UPS Toulouse). Link

3.1.1. Statistical data analysis: main steps#

The main steps to follow in order to obtain an estimation model using data are:
1– Obtaining data: components’ data can be provided by catalogues or design models on which we will have simulated a set of well chosen design parameters (using design of experiments for instance).
2– Choosing parameters representating primary characteristics: the objective here is to reduce the number of inputs for the model. In order to do that, we can detect and choose the most important parameters using a correlation analysis. A dimensional analysis can also help to reduce the number of coefficients to be considered.
3– Choosing the mathematical model type: this means to choose the mathematical form of the equation on which the regression will be applied (polynoms, linear sum of functions, power products,…).
4– Apply the regression: the objective is to minimize the disparity between data and the mathematical equation of the model.

3.1.2. Gearbox Data#

Examples of this notebook will be given using gear reducers data coming from manufacturers.

Simulation Models

These data are stored in excel files, which can be read here using the pandas package in dataframe format. The database contains a wide range of information: mass, torsional stiffness, efficiency, dimensions, etc.

import pandas as pd

# Read the meaning of the different parameters
df_head = pd.read_excel("GearBoxHead.xlsx")
df_head
Constructor size number stage ratio torque output rate input efficiency torsional stiffness backlash inertia input mass Power Diameter Length
0 (-) (-) (-) (-) (Nm) (tr/min) (%) (Nm/arcmin) (arcmin) (kg cm2) (kg) (W) (mm) (mm)
1 Constructor size Ns N Tout Win Eta K Dteta J M P D L
# Read the data
df = pd.read_excel("GearBoxData.xlsx")

# Print the 5 first lines of the database
df.head()
Constructor size Ns N Tout Win Eta K Dteta J M P D L
0 NEUGART PLS 70 HP-4 1 4 110.0 1700 98 7.0 3.0 0.42 2.6 4895.648552 75 122.0
1 NEUGART PLS 70 HP-5 1 5 110.0 2000 98 7.0 3.0 0.37 2.6 4607.669225 75 122.0
2 NEUGART PLS 90 HP-4 1 4 220.0 1350 98 10.0 3.0 1.05 4.0 7775.441818 100 138.5
3 NEUGART PLS 90 HP-5 1 5 220.0 1600 98 10.0 3.0 0.85 4.0 7372.270760 100 138.5
4 NEUGART PLS 115 HP-4 1 4 520.0 800 98 22.0 3.0 2.30 7.5 10890.854532 130 189.0

3.1.3. Choosing parameters#

We will assume that we want to construct a gearbox mass estimation model. We therefore first need to determine which parameters are most important in estimating this. Using the catalogue data, we can apply a correlation analysis and display it in the form of a table showing the correlation between each parameter. This table is symmetrical.

Statistics Basic: Classical statistic tools can be used to characterize a data serie and the links that exist between two data series:
-The mean of each data series.
\(\bar{x}=\frac{1}{n}\sum_{i=1}^{n} x_i\)
-The standard deviation which gives the dispersion of a series of values, around their mean value.
\(\sigma=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n} (x_i-\bar{x})^2}\)
-The covariance which is a number that gives the possibility to evaluate the variation way of two variables and , that way, define if these variables are independant or not. That way, if 2 quantities vary the same way, or in opposite ways, the covariance value will be higher.
\(cov(x,y)=\frac{1}{n-1}\sum_{i=1}^{n} (x_i-\bar{x})(y_i-\bar{y})\)
-The correlation, also related to the link between two variables. It’s actually the Covariance, normalized on a scale from -1 to 1.
\(cor(x,y)=\frac{1}{n-1}\frac{\sum_{i=1}^{n} (x_i-\bar{x})(y_i-\bar{y})}{s_xs_y}\)

Dataframes can be used to calculate the correlation matrix directly and simply.

df.corr()
/tmp/ipykernel_2203/1134722465.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.corr()
Ns N Tout Win Eta K Dteta J M P D L
Ns 1.000000 0.354501 -0.204966 0.530513 -0.734179 -0.236948 0.680305 -0.238785 -0.103814 -0.203182 -0.263935 0.293952
N 0.354501 1.000000 0.063755 0.040779 -0.608730 0.079991 0.182424 0.057529 -0.000207 -0.244010 0.043081 -0.098758
Tout -0.204966 0.063755 1.000000 -0.436716 0.088456 0.916671 -0.320480 0.885686 0.800741 0.290815 0.798191 0.350260
Win 0.530513 0.040779 -0.436716 1.000000 -0.272927 -0.437381 0.639898 -0.301804 -0.442977 -0.127484 -0.652336 -0.139995
Eta -0.734179 -0.608730 0.088456 -0.272927 1.000000 0.063514 -0.505211 0.082149 0.073080 0.357831 0.156639 0.000155
K -0.236948 0.079991 0.916671 -0.437381 0.063514 1.000000 -0.332987 0.775942 0.853291 0.219011 0.820234 0.280267
Dteta 0.680305 0.182424 -0.320480 0.639898 -0.505211 -0.332987 1.000000 -0.256920 -0.288968 -0.167897 -0.465782 -0.023637
J -0.238785 0.057529 0.885686 -0.301804 0.082149 0.775942 -0.256920 1.000000 0.587566 0.347930 0.640275 0.176444
M -0.103814 -0.000207 0.800741 -0.442977 0.073080 0.853291 -0.288968 0.587566 1.000000 0.295261 0.791580 0.592270
P -0.203182 -0.244010 0.290815 -0.127484 0.357831 0.219011 -0.167897 0.347930 0.295261 1.000000 0.245700 0.355627
D -0.263935 0.043081 0.798191 -0.652336 0.156639 0.820234 -0.465782 0.640275 0.791580 0.245700 1.000000 0.428592
L 0.293952 -0.098758 0.350260 -0.139995 0.000155 0.280267 -0.023637 0.176444 0.592270 0.355627 0.428592 1.000000

This matrix can then be graphically formatted for easier analysis.

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(16, 6)) 

# define the mask to set the values in the upper triangle to Truemask 
mask = np.triu(np.ones_like(df.corr(), dtype=bool)) # one_like = 1 matrix with df.corr size + triu = Upper triangle of an array.

heatmap = sns.heatmap(df.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Triangle Correlation Heatmap', fontdict={'fontsize':18}, pad=16);
/tmp/ipykernel_2203/2861457890.py:8: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  mask = np.triu(np.ones_like(df.corr(), dtype=bool)) # one_like = 1 matrix with df.corr size + triu = Upper triangle of an array.
/tmp/ipykernel_2203/2861457890.py:10: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  heatmap = sns.heatmap(df.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')
../../_images/80399a46dda9a819a267cbbe5256944c172ac9158f89d30fff86236067560e82.png

We can see that :

  • the reduction ratio and the number of stages have little influence on the mass

  • the most important and primary parameter is the output torque.

Note: the correlation matrix can also be produced in Excel using the analysis toolpack.

3.1.4. Linear regression#

Parametric regressions are based on models requiring the estimation of a finite number of parameters \(\theta\) that express the effects of basis functions on the response surface. With polynomial interpolation or regression, the basis functions are \({1, x, x^2, …, x^p}\). The idea is to obtain a surface that is differentiable and continuous. For a polynomial development of order 2 (p = 2), the expression is:

\(y=f(\underline{x}, \underline{\theta})=\theta_0 + \sum_{i=1}^{k} \theta_0 x_i + \sum_{i=1, j=1}^{k} \theta_{ij} x_i x_j \)

with \(\underline{x}=\begin{pmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{k} \end{pmatrix}\)

which, for the \(n\) experiments, can be written with a matrix representation: \(\underline{Y}=\underline{X}.\underline{\theta}+\underline{\varepsilon}\)

with \(\underline{Y}=\begin{pmatrix} y^{(1)} \\ \vdots \\ y^{(n)} \end{pmatrix}\) , \(\underline{x}=\begin{pmatrix} 1, x_{1}^{(1)}, ..., x_{k}^{2(1)} \\ \vdots \\ 1, x_{1}^{(n)}, ..., x_{k}^{2(n)} \end{pmatrix}\), \(\underline{x}=\begin{pmatrix} \theta_0 \\ \vdots \\ \theta_{k_{theta}} \end{pmatrix}\), \(\underline{x}=\begin{pmatrix} \varepsilon^{(1)} \\ \vdots \\ y^{(n)} \end{pmatrix}\)

where :

  • \(k_x\) is the number of design parameters;

  • \(n\) is the number of experiments of the DoE;

  • \(p\) is the order of the polynomial function;

  • \(k_\theta\) is the size of vector θ.

If \(n=k_\theta\), the matrix \(X\) is square and the interpolation is possible. In the most common case, where \(k_\theta\) is lower than \(n\), interpolation is impossible. The objective of regression is to find an approximation showing an error of zero mean and a minimum standard deviation. The errors are expressed by:
\(\underline{\varepsilon}=\underline{Y}-\underline{X}.\underline{\theta}\)

The standard deviation is a function of the sum of the quadratic error \(\underline{\varepsilon}^t\underline{\varepsilon}\) and is minimum for:

\(\underline{\theta}=(\underline{X}^t\underline{X})^{-1} \underline{X}^t\underline{Y}\)

3.1.5. Mass and inertia regression#

The aim is to perform regression on part of the data: the Sumitomo cyclo-reducers. First, only the Sumitomo are selected:

# Data Filtering
# Keeping only SUMITOMO type
df_S = df[df["Constructor"] == "SUMITOMO"]
df_S.head()
Constructor size Ns N Tout Win Eta K Dteta J M P D L
613 SUMITOMO 106 1 11 25.0 500 95 1.5 3.0 0.15 1.2 118.999722 85 130.0
614 SUMITOMO 106 1 17 25.0 500 95 1.6 3.0 0.14 1.2 76.999820 85 130.0
615 SUMITOMO 106 1 29 25.0 500 95 2.6 3.0 0.14 1.2 45.137825 85 130.0
616 SUMITOMO 106 1 43 25.0 500 95 2.8 3.0 0.14 1.2 30.441789 85 130.0
617 SUMITOMO 108 1 11 75.0 500 95 5.0 3.0 1.30 4.3 356.999165 118 180.0

The data shows a dispersion due to several types of reducers. They differ in the type of output shaft. We will now only included the lightest ones here, i.e. the FC range.

# Plot the data
plt.plot(df_S["Tout"], df_S["M"], "bo")
plt.xlabel("Low speed torque (N.m)")
plt.ylabel("Mass (kg)")
plt.grid()
plt.show()
../../_images/3b96e275dc67232888248ce0d091dbdd7a1a7c2f3583005886cf27535ef17fdd.png
# Data Filtering
# Keeping only FC range type
df_S = df_S[df_S["size"].astype(str).str[0:2] == "FC"]
df_S.head()
Constructor size Ns N Tout Win Eta K Dteta J M P D L
634 SUMITOMO FC-A15(G) 1 59 149.0 1500 95 20.0 2.0 0.313 2.7 396.692632 120 60.0
635 SUMITOMO FC-A15(G) 1 89 149.0 1500 95 20.0 2.0 0.310 2.7 262.976014 120 60.0
636 SUMITOMO FC-A25(G) 1 29 349.0 1500 95 53.0 2.0 1.380 5.2 1890.372131 150 75.0
637 SUMITOMO FC-A25(G) 1 59 349.0 1500 95 70.0 2.0 1.340 5.2 929.165963 150 75.0
638 SUMITOMO FC-A25(G) 1 89 349.0 1500 95 70.0 2.0 1.330 5.2 615.963953 150 75.0
# Plot the data
plt.plot(df_S["Tout"], df_S["M"], "bo")
plt.xlabel("Low speed torque (N.m)")
plt.ylabel("Mass (kg)")
plt.grid()
plt.show()
../../_images/41a7f4ae1a11c2ccbb69c4e730f328e2c3561c939825965ce7cd9b32c885043a.png

Scikit-learn is a free machine learning package and can be used for linear regression. It features also various classification, regression and clustering algorithms including support-vector machines, random forests, gradient boosting, k-means, … Below an example of use of this package for linear regression of the gear reducers data.

# Import packages
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt

# Data
x = df_S["Tout"].values # Torque input
y = df_S["M"].values # Mass output

# Matrix X and Y
n=1 # order of polynomial regression

poly = PolynomialFeatures(degree=2, include_bias=False) # 1 vector is nt necessary with sklearn
X =poly.fit_transform(x.reshape(-1, 1)) #  reshape(-1,1) transforms our numpy array x from a 1D array to a 2D array

Y = y.reshape(-1, 1) 
      
# Create a new object for the linear regression
reg_M = linear_model.LinearRegression()

reg_M.fit(X, Y)

# Y vector prediction
M_est = reg_M.predict(X)

# M regression Parameters
# ----
coef = reg_M.coef_
intercept = float(reg_M.intercept_)
r2 = r2_score(Y, M_est)


# Plot the data
plt.plot(x, Y, "o", label="Reference data")
plt.plot(x, M_est, "-g", label="Data prediction")
plt.xlabel("Torque (N.m)")
plt.ylabel("Mass (kg)")
plt.title("Comparison of reference data and regression")
plt.legend()
plt.grid()
plt.show()

print(f"Mass estimation model : M = {intercept:.2f} + {coef[0,0]:.2f} * Tout with R2={r2:.3f}")
../../_images/c32b0e8c7a76eebbadd6546ab880e4dbc701b317dc58cd60d24218ede97f37ca.png
Mass estimation model : M = 1.54 + 0.01 * Tout with R2=0.999

Note: the linear regression can also be produced in Excel using the analysis toolpack or with trendline option on excel graphs.

Homework: Adapt previous code to set up an inertia estimation model. Two types of models can be used and compared: an order 2 polynom and a power law.

Note: A product of power laws, can be linearized by a logarithmic transformation.
\(y=a_0 \prod_i x_i^{a_i} \longrightarrow log(y)=log(a_0) + \sum_i a_i log(x_i) \)