## Statsmodels Python general statistical model library

Understanding oneself 2020-11-13 10:09:26
statsmodels python general statistical model

See before sklearn The linear model doesn't have R Fang ,F test , Regression coefficient T Test and other indicators , So I saw statsmodels This library , Looking at the output of the library is really nostalgic ..

# 1 install

``````pip install statsmodels
``````

But it's possible to report a mistake ：

``````ImportError: cannot import name 'factorial' from 'scipy.misc'
(E:\Anaconda3.7\lib\site-packages\scipy\misc\__init__.py)
``````

with scipy Version mismatch , The author deleted the previous `pip uninstall statsmodels`, Just reinstall it again ：

``````pip install --pre statsmodels -i https://pypi.tuna.tsinghua.edu.cn/simple
``````

# 2 Introduction to relevant models

Relevant documents can be found in ：https://www.statsmodels.org/stable/examples/index.html

The models included are ：

## 2.2 Discrete choice model (Discrete Choice Model, DCM)

Reference resources ： Discrete choice model (Discrete Choice Model, DCM) brief introduction —— One of

Discrete choice model （Discrete Choice Model, DCM） It's widely used in economics and sociology .
for example , When consumers buy a car, they usually compare several different brands , Like Ford 、 Honda 、 The public , wait .
If consumers choose Ford as Y=1, Choose Honda as Y=2, Choose Volkswagen as Y=3; So when studying what kind of car brand consumers choose , Because the dependent variable is not a continuous variable （Y=1, 2, 3）, The traditional linear regression model has some limitations （ see DCM Series article No 2 piece ）.
Another example , In the field of traffic safety research , The severity of traffic accidents is usually divided into 3 Categories: ：

• （1） Only property damage （Property Damage Only, PDO）,
• （2） injured （Injury）,
• （3） Death （Fatality）;
Studying all kinds of factors （ Like the slope of the road 、 Curve curvature, etc 、 Age of car 、 light 、 Weather conditions, etc ） When it affects the severity of the accident , Because of the dependent variable （ The severity of the accident ） It's a discrete variable （ only 3 An option ）, Using discrete selection model can provide an effective modeling approach .

# 3 The related model demo

## 3.1 linear regression model

May refer to ：https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html

``````# Linear model
import statsmodels.api as sm
import numpy as np
x = np.linspace(0,10,100)
y = 3*x + np.random.randn()+ 10
# Fit and summarize OLS model
mod = sm.OLS(y,X)
result = mod.fit()
print('Parameters: ', result .params)
print('Standard errors: ', result .bse)
print('Predicted values: ', result .predict())
print(result.summary())
# Forecast data
print(result.predict(X[:5]))
``````

The output is super familiar .

• `result.params` It's the regression coefficient
• `result.summary()` Print out the correlation coefficients of the model
among , Prediction time , If no parameters are given `result.predict()`, Default is X

## 3.2 Generalized linear model ——GLM

Reference resources ：https://www.statsmodels.org/stable/examples/notebooks/generated/glm.html

``````import statsmodels.formula.api as smf
formula = 'SUCCESS ~ LOWINC + PERASIAN + PERBLACK + PERHISP + PCTCHRT + \
PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF'
dta = star98[['NABOVE', 'NBELOW', 'LOWINC', 'PERASIAN', 'PERBLACK', 'PERHISP',
'PCTCHRT', 'PCTYRRND', 'PERMINTE', 'AVYRSEXP', 'AVSALK',
'PERSPENK', 'PTRATIO', 'PCTAF']].copy()
endog = dta['NABOVE'] / (dta['NABOVE'] + dta.pop('NBELOW'))
del dta['NABOVE']
dta['SUCCESS'] = endog
mod1 = smf.glm(formula=formula, data=dta, family=sm.families.Binomial()).fit()
mod1.summary()
mod1.predict(dta)
``````

`formula` It's a regular formula , All that is X/Y The data are all in one dataframe In .

``````print('Total number of trials:', data.endog[0].sum())
print('Parameters: ', res.params)
print('T-values: ', res.tvalues)
``````

Including the regression coefficient ,T Test value

Reference resources ：https://www.statsmodels.org/stable/examples/notebooks/generated/robust_models_0.html

``````nsample = 50
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, (x1-5)**2))
sig = 0.3 # smaller error variance makes OLS<->RLM contrast bigger
beta = [5, 0.5, -0.0]
y_true2 = np.dot(X, beta)
y2 = y_true2 + sig*1. * np.random.normal(size=nsample)
y2[[39,41,43,45,48]] -= 5 # add some outliers (10% of nsample)
X2 = X[:,[0,1]]
res2 = sm.OLS(y2, X2).fit()
print(res2.params)
print(res2.bse)
resrlm2 = sm.RLM(y2, X2).fit()
print(resrlm2.params)
print(resrlm2.bse)
print(resrlm2.summary())
``````

# 4 other

## 4.1 What is the result of the model CSV export ？

Can pass `as_csv()` Export the model

``````resrlm2 = sm.RLM(y, x).fit()
resrlm2.summary()
with open( 'model_rlm.csv', 'w') as fh:
fh.write(resrlm2.summary().as_csv())
``````

But the format of the export is strange ：

## 4.2 Draw model pictures and save

``````import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
# Prepare the data
x = np.linspace(0,10,100)
y = 3*x + np.random.randn()+ 10
# Fit and summarize OLS model
res = sm.OLS(y,x).fit()
print(res.params)
print(res.summary())
resrlm = sm.RLM(y, x).fit()
# drawing
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(x, y, 'o', label="truey ")
ax.plot(x, res.predict(), 'o', label="ols") # res2.predict(X2) == res2.predict()
ax.plot(x, resrlm.predict(), 'b-', label="rlm")# resrlm2.predict(X2) == resrlm2.predict()
legend = ax.legend(loc="best")
# Figure saving
plt.savefig( 'image.jpg')
``````

## 4.3 Get model output parameters quickly ：P test 、F test 、P statistic

``````def get_model_param(res2,name = 'all'):
model_param_dict = {'name':name, # The name of the model
'rsquared':res2.rsquared, # R Fang
'fvalue':res2.fvalue, # F value , The whole model
'f_pvalue':res2.f_pvalue, # P value , The whole model
'params':res2.params[0], # Regression coefficient
'pvalues':res2.pvalues[0], # Regression coefficient P test 0.000
'tvalues':res2.tvalues[0]} # Regression coefficient T test 276.571
return model_param_dict
``````