Linear regressions

3.1. Linear regressions#

During preliminary design and, above all, when optimising solutions, it is useful to have simple equations for calculating the main characteristics of the component. We will now look at how to transform data, coming from catalogs, into algebraic equations using regression.

Simulation Models

For interested readers, more information can be found in the following document (Chapter 4 – Metamodels for model-based-design of mechatronic systems):

Budinger, M. (2014). Preliminary design and sizing of actuation systems (HDR dissertation, UPS Toulouse). Link

3.1.1. Statistical data analysis: main steps#

The main steps to follow in order to obtain an estimation model using data are:
1– Obtaining data: components’ data can be provided by catalogues or design models on which we will have simulated a set of well chosen design parameters (using design of experiments for instance).
2– Choosing parameters representating primary characteristics: the objective here is to reduce the number of inputs for the model. In order to do that, we can detect and choose the most important parameters using a correlation analysis. A dimensional analysis can also help to reduce the number of coefficients to be considered.
3– Choosing the mathematical model type: this means to choose the mathematical form of the equation on which the regression will be applied (polynoms, linear sum of functions, power products,…).
4– Apply the regression: the objective is to minimize the disparity between data and the mathematical equation of the model.

3.1.2. Gearbox Data#

Examples of this notebook will be given using gear reducers data coming from manufacturers.

Simulation Models

These data are stored in excel files, which can be read here using the pandas package in dataframe format. The database contains a wide range of information: mass, torsional stiffness, efficiency, dimensions, etc.

import pandas as pd

# Read the meaning of the different parameters
df_head = pd.read_excel("GearBoxHead.xlsx")
df_head

	Constructor	size	number stage	ratio	torque output	rate input	efficiency	torsional stiffness	backlash	inertia input	mass	Power	Diameter	Length
0	(-)	(-)	(-)	(-)	(Nm)	(tr/min)	(%)	(Nm/arcmin)	(arcmin)	(kg cm2)	(kg)	(W)	(mm)	(mm)
1	Constructor	size	Ns	N	Tout	Win	Eta	K	Dteta	J	M	P	D	L

# Read the data
df = pd.read_excel("GearBoxData.xlsx")

# Print the 5 first lines of the database
df.head()

	Constructor	size	Ns	N	Tout	Win	Eta	K	Dteta	J	M	P	D	L
0	NEUGART	PLS 70 HP-4	1	4	110.0	1700	98	7.0	3.0	0.42	2.6	4895.648552	75	122.0
1	NEUGART	PLS 70 HP-5	1	5	110.0	2000	98	7.0	3.0	0.37	2.6	4607.669225	75	122.0
2	NEUGART	PLS 90 HP-4	1	4	220.0	1350	98	10.0	3.0	1.05	4.0	7775.441818	100	138.5
3	NEUGART	PLS 90 HP-5	1	5	220.0	1600	98	10.0	3.0	0.85	4.0	7372.270760	100	138.5
4	NEUGART	PLS 115 HP-4	1	4	520.0	800	98	22.0	3.0	2.30	7.5	10890.854532	130	189.0

3.1.3. Choosing parameters#

We will assume that we want to construct a gearbox mass estimation model. We therefore first need to determine which parameters are most important in estimating this. Using the catalogue data, we can apply a correlation analysis and display it in the form of a table showing the correlation between each parameter. This table is symmetrical.

Statistics Basic: Classical statistic tools can be used to characterize a data serie and the links that exist between two data series:
-The mean of each data series.
\(\bar{x}=\frac{1}{n}\sum_{i=1}^{n} x_i\)
-The standard deviation which gives the dispersion of a series of values, around their mean value.
\(\sigma=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n} (x_i-\bar{x})^2}\)
-The covariance which is a number that gives the possibility to evaluate the variation way of two variables and , that way, define if these variables are independant or not. That way, if 2 quantities vary the same way, or in opposite ways, the covariance value will be higher.
\(cov(x,y)=\frac{1}{n-1}\sum_{i=1}^{n} (x_i-\bar{x})(y_i-\bar{y})\)
-The correlation, also related to the link between two variables. It’s actually the Covariance, normalized on a scale from -1 to 1.
\(cor(x,y)=\frac{1}{n-1}\frac{\sum_{i=1}^{n} (x_i-\bar{x})(y_i-\bar{y})}{s_xs_y}\)

Dataframes can be used to calculate the correlation matrix directly and simply.

df.corr()

/tmp/ipykernel_2203/1134722465.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.corr()

	Ns	N	Tout	Win	Eta	K	Dteta	J	M	P	D	L
Ns	1.000000	0.354501	-0.204966	0.530513	-0.734179	-0.236948	0.680305	-0.238785	-0.103814	-0.203182	-0.263935	0.293952
N	0.354501	1.000000	0.063755	0.040779	-0.608730	0.079991	0.182424	0.057529	-0.000207	-0.244010	0.043081	-0.098758
Tout	-0.204966	0.063755	1.000000	-0.436716	0.088456	0.916671	-0.320480	0.885686	0.800741	0.290815	0.798191	0.350260
Win	0.530513	0.040779	-0.436716	1.000000	-0.272927	-0.437381	0.639898	-0.301804	-0.442977	-0.127484	-0.652336	-0.139995
Eta	-0.734179	-0.608730	0.088456	-0.272927	1.000000	0.063514	-0.505211	0.082149	0.073080	0.357831	0.156639	0.000155
K	-0.236948	0.079991	0.916671	-0.437381	0.063514	1.000000	-0.332987	0.775942	0.853291	0.219011	0.820234	0.280267
Dteta	0.680305	0.182424	-0.320480	0.639898	-0.505211	-0.332987	1.000000	-0.256920	-0.288968	-0.167897	-0.465782	-0.023637
J	-0.238785	0.057529	0.885686	-0.301804	0.082149	0.775942	-0.256920	1.000000	0.587566	0.347930	0.640275	0.176444
M	-0.103814	-0.000207	0.800741	-0.442977	0.073080	0.853291	-0.288968	0.587566	1.000000	0.295261	0.791580	0.592270
P	-0.203182	-0.244010	0.290815	-0.127484	0.357831	0.219011	-0.167897	0.347930	0.295261	1.000000	0.245700	0.355627
D	-0.263935	0.043081	0.798191	-0.652336	0.156639	0.820234	-0.465782	0.640275	0.791580	0.245700	1.000000	0.428592
L	0.293952	-0.098758	0.350260	-0.139995	0.000155	0.280267	-0.023637	0.176444	0.592270	0.355627	0.428592	1.000000

This matrix can then be graphically formatted for easier analysis.

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(16, 6)) 

# define the mask to set the values in the upper triangle to Truemask 
mask = np.triu(np.ones_like(df.corr(), dtype=bool)) # one_like = 1 matrix with df.corr size + triu = Upper triangle of an array.

heatmap = sns.heatmap(df.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Triangle Correlation Heatmap', fontdict={'fontsize':18}, pad=16);

/tmp/ipykernel_2203/2861457890.py:8: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  mask = np.triu(np.ones_like(df.corr(), dtype=bool)) # one_like = 1 matrix with df.corr size + triu = Upper triangle of an array.
/tmp/ipykernel_2203/2861457890.py:10: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  heatmap = sns.heatmap(df.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')

../../_images/80399a46dda9a819a267cbbe5256944c172ac9158f89d30fff86236067560e82.png

We can see that :

the reduction ratio and the number of stages have little influence on the mass
the most important and primary parameter is the output torque.

Note: the correlation matrix can also be produced in Excel using the analysis toolpack.

3.1.4. Linear regression#

Parametric regressions are based on models requiring the estimation of a finite number of parameters \(\theta\) that express the effects of basis functions on the response surface. With polynomial interpolation or regression, the basis functions are \({1, x, x^2, …, x^p}\). The idea is to obtain a surface that is differentiable and continuous. For a polynomial development of order 2 (p = 2), the expression is:

\(y=f(\underline{x}, \underline{\theta})=\theta_0 + \sum_{i=1}^{k} \theta_0 x_i + \sum_{i=1, j=1}^{k} \theta_{ij} x_i x_j \)

with \(\underline{x}=\begin{pmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{k} \end{pmatrix}\)

which, for the \(n\) experiments, can be written with a matrix representation: \(\underline{Y}=\underline{X}.\underline{\theta}+\underline{\varepsilon}\)

with \(\underline{Y}=\begin{pmatrix} y^{(1)} \\ \vdots \\ y^{(n)} \end{pmatrix}\) , \(\underline{x}=\begin{pmatrix} 1, x_{1}^{(1)}, ..., x_{k}^{2(1)} \\ \vdots \\ 1, x_{1}^{(n)}, ..., x_{k}^{2(n)} \end{pmatrix}\), \(\underline{x}=\begin{pmatrix} \theta_0 \\ \vdots \\ \theta_{k_{theta}} \end{pmatrix}\), \(\underline{x}=\begin{pmatrix} \varepsilon^{(1)} \\ \vdots \\ y^{(n)} \end{pmatrix}\)

where :

\(k_x\) is the number of design parameters;
\(n\) is the number of experiments of the DoE;
\(p\) is the order of the polynomial function;
\(k_\theta\) is the size of vector θ.

If \(n=k_\theta\), the matrix \(X\) is square and the interpolation is possible. In the most common case, where \(k_\theta\) is lower than \(n\), interpolation is impossible. The objective of regression is to find an approximation showing an error of zero mean and a minimum standard deviation. The errors are expressed by:
\(\underline{\varepsilon}=\underline{Y}-\underline{X}.\underline{\theta}\)

The standard deviation is a function of the sum of the quadratic error \(\underline{\varepsilon}^t\underline{\varepsilon}\) and is minimum for:

\(\underline{\theta}=(\underline{X}^t\underline{X})^{-1} \underline{X}^t\underline{Y}\)

3.1.5. Mass and inertia regression#

The aim is to perform regression on part of the data: the Sumitomo cyclo-reducers. First, only the Sumitomo are selected:

# Data Filtering
# Keeping only SUMITOMO type
df_S = df[df["Constructor"] == "SUMITOMO"]
df_S.head()

	Constructor	size	Ns	N	Tout	Win	Eta	K	Dteta	J	M	P	D	L
613	SUMITOMO	106	1	11	25.0	500	95	1.5	3.0	0.15	1.2	118.999722	85	130.0
614	SUMITOMO	106	1	17	25.0	500	95	1.6	3.0	0.14	1.2	76.999820	85	130.0
615	SUMITOMO	106	1	29	25.0	500	95	2.6	3.0	0.14	1.2	45.137825	85	130.0
616	SUMITOMO	106	1	43	25.0	500	95	2.8	3.0	0.14	1.2	30.441789	85	130.0
617	SUMITOMO	108	1	11	75.0	500	95	5.0	3.0	1.30	4.3	356.999165	118	180.0

The data shows a dispersion due to several types of reducers. They differ in the type of output shaft. We will now only included the lightest ones here, i.e. the FC range.

# Plot the data
plt.plot(df_S["Tout"], df_S["M"], "bo")
plt.xlabel("Low speed torque (N.m)")
plt.ylabel("Mass (kg)")
plt.grid()
plt.show()

../../_images/3b96e275dc67232888248ce0d091dbdd7a1a7c2f3583005886cf27535ef17fdd.png

# Data Filtering
# Keeping only FC range type
df_S = df_S[df_S["size"].astype(str).str[0:2] == "FC"]
df_S.head()

	Constructor	size	Ns	N	Tout	Win	Eta	K	Dteta	J	M	P	D	L
634	SUMITOMO	FC-A15(G)	1	59	149.0	1500	95	20.0	2.0	0.313	2.7	396.692632	120	60.0
635	SUMITOMO	FC-A15(G)	1	89	149.0	1500	95	20.0	2.0	0.310	2.7	262.976014	120	60.0
636	SUMITOMO	FC-A25(G)	1	29	349.0	1500	95	53.0	2.0	1.380	5.2	1890.372131	150	75.0
637	SUMITOMO	FC-A25(G)	1	59	349.0	1500	95	70.0	2.0	1.340	5.2	929.165963	150	75.0
638	SUMITOMO	FC-A25(G)	1	89	349.0	1500	95	70.0	2.0	1.330	5.2	615.963953	150	75.0

# Plot the data
plt.plot(df_S["Tout"], df_S["M"], "bo")
plt.xlabel("Low speed torque (N.m)")
plt.ylabel("Mass (kg)")
plt.grid()
plt.show()

../../_images/41a7f4ae1a11c2ccbb69c4e730f328e2c3561c939825965ce7cd9b32c885043a.png

Scikit-learn is a free machine learning package and can be used for linear regression. It features also various classification, regression and clustering algorithms including support-vector machines, random forests, gradient boosting, k-means, … Below an example of use of this package for linear regression of the gear reducers data.

# Import packages
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt

# Data
x = df_S["Tout"].values # Torque input
y = df_S["M"].values # Mass output

# Matrix X and Y
n=1 # order of polynomial regression

poly = PolynomialFeatures(degree=2, include_bias=False) # 1 vector is nt necessary with sklearn
X =poly.fit_transform(x.reshape(-1, 1)) #  reshape(-1,1) transforms our numpy array x from a 1D array to a 2D array

Y = y.reshape(-1, 1) 
      
# Create a new object for the linear regression
reg_M = linear_model.LinearRegression()

reg_M.fit(X, Y)

# Y vector prediction
M_est = reg_M.predict(X)

# M regression Parameters
# ----
coef = reg_M.coef_
intercept = float(reg_M.intercept_)
r2 = r2_score(Y, M_est)


# Plot the data
plt.plot(x, Y, "o", label="Reference data")
plt.plot(x, M_est, "-g", label="Data prediction")
plt.xlabel("Torque (N.m)")
plt.ylabel("Mass (kg)")
plt.title("Comparison of reference data and regression")
plt.legend()
plt.grid()
plt.show()

print(f"Mass estimation model : M = {intercept:.2f} + {coef[0,0]:.2f} * Tout with R2={r2:.3f}")

../../_images/c32b0e8c7a76eebbadd6546ab880e4dbc701b317dc58cd60d24218ede97f37ca.png

Mass estimation model : M = 1.54 + 0.01 * Tout with R2=0.999

Note: the linear regression can also be produced in Excel using the analysis toolpack or with trendline option on excel graphs.

Homework: Adapt previous code to set up an inertia estimation model. Two types of models can be used and compared: an order 2 polynom and a power law.

Note: A product of power laws, can be linearized by a logarithmic transformation.
\(y=a_0 \prod_i x_i^{a_i} \longrightarrow log(y)=log(a_0) + \sum_i a_i log(x_i) \)