author ：Jason Brownlee translate ： Wang Khan proofreading ：wwl
this paper about 7000 word , Recommended reading 16 minute
This article shows you how to python Develop and evaluate blended integrated learning in , And how to use it in classification and regression problems .
Hybrid is a machine learning algorithm based on ensemble learning .
It's a colloquial name for stack or stack Integration , under these circumstances , The meta model is not based on the prediction of the basic model , It's based on the prediction of the retained sample data set .
In value 100 Thousands of dollars in Netflix In the machine learning contest ,“ blend ” It is used to describe the stack model which is composed of hundreds of prediction models by competitors and in the fierce competition of machine learning circle , such as Kaggle Community , Still a popular technology and name .
In this tutorial , You will learn how to python Develop and evaluate blended integrated learning in .
After completing this tutorial , You will know ：
Hybrid integration is a kind of superposition , In this superposition , The meta model is fitted by the prediction of a reserved validation data set rather than the prediction of an external data set .
How to develop a hybrid integration , It includes the function of training model and forecasting new data .
How to evaluate hybrid integration for classification and regression prediction modeling problems .
Let's get started .
Tutorial overview
This tutorial is divided into four parts , They are ：
Hybrid integration
Develop hybrid integration
Hybrid integration for classification
Hybrid integration for regression
Hybrid integration
Hybrid is an integrated machine learning technology , It uses a machine learning model to learn how to best combine predictions from multiple member models .
Broadly speaking , In terms of itself , Blending and stacking generalizations ( That is superposition ) It's the same . In the same paper or model description , Mixing and stacking are often used alternately .
“ Many machine learning practitioners have successfully used superposition and correlation techniques to improve prediction accuracy , Beyond the level of any single model . In some cases , Stacking is also called mixing , Here we're going to use these two terms alternately .”
Feature weighted linear superposition ,2009 year .
The architecture of a stacked model consists of two or more basic models ( Often referred to as 0 Level model ) And a metamodel ( It combines the predictions of the basic model , be called 1 Level model ). The meta model is trained according to the prediction of the basic model to the data outside the sample .
0 Level model ( Basic model ): Based on the training data, the training model is used to predict .
1 Level model ( Metamodel ): Learn how to best combine basic models with predictive models .
However , Mixing has specific implications for how to build a stack integration model .
Mixing may suggest developing a stacked integrated learning model , The basic model is any kind of machine learning model , The metamodel is a “ blend ” The linear model of the basic model .
for example , The linear regression model for predicting values or the logistic regression model for predicting class labels will calculate the weighted sum of the predictions made by the basic model , And will be considered a mixture of predictions .
Hybrid integration : Use linear models , Such as linear regression or logistic regression , As a meta model in overlay Integration .
stay 2009 Year of Netflix In the bonus game “ blend ” It's a term for stack Integration . The award includes seeking to compare Netflix Native algorithm better movie recommendation prediction team , Performance improvement 10% Our team will get 100 $10000 bonus .
“ our RMSE=0.8643^2 The solution is 100 A linear mixture of multiple results .…… In the description of the method , We highlighted the specific predictors involved in the final hybrid solution .”
 2008 year Netflix Grand Prix BellKor Solution
therefore , Mixing is a popular term , It refers to integrated learning with stack type architecture model . In addition to the content related to competitive machine learning , It's rarely used in textbooks or academic papers .
The most common is , blend （blending） Used to describe specific applications of stacking , under these circumstances , The meta model is trained according to the prediction made by the basic model on a fixed validation data set . The stack （stacking） Is reserved for the metamodel , In the process of cross validation, the training scenario is based on the prediction of the out of the box .
blend （blending）: Stack type integration , The meta model is trained according to the prediction of the retained validation data set .
The stack （stacking）: Stack Integration , stay kfold In the process of cross validation , The meta model is trained according to the prediction .
What's the difference Kaggle It's very common in the competitive machine learning community .
“Blend( blend ) The word is from Netflix The winners of the . It's very close to stack generalization , But it's simpler , The risk of information leakage is smaller .… Use mix , Instead of creating out of fold data set predictions for the training set , You create a small reserved dataset , such as 10% Training set of . Then the stack model only runs on this reserved set .”
《Kaggle Ensemble Guide》,MLWave, 2015.
We'll use the latter hybrid definition .
Next , Let's see how to achieve mixing .
Develop hybrid integration
At the time of writing ,scikitlearn Libraries don't inherently support mixing . but , We can use scikitlern The model implements it itself .
First , We need to create some basic models . For regression or classification problems , These models can be any model we like . We can define a function get_models(), It returns a list of models , Each model is defined as a tuple of classifiers or regressors with name and configuration .
for example , For a question of classification , We can use logistic regression 、kNN、 Decision tree 、 Support vector machines and naive Bayesian models .
# get a list of base models
def get_models():
models = list()
models.append(('lr', LogisticRegression()))
models.append(('knn', KNeighborsClassifier()))
models.append(('cart', DecisionTreeClassifier()))
models.append(('svm', SVC(probability=True)))
models.append(('bayes', GaussianNB()))
return models
Next , We need to fit the hybrid model . Think about it , The basic model is used to fit the training data set . The meta model is based on each basic model to fit the prediction results of the reserved data set . First , We can list the models , Then train each model in turn on the training data set . Also in this cycle , We can use the trained model to predict the retained validation data set , And store predictions for the future .
...
# fit all models on the training set and predict on hold out set
meta_X = list()
for name, model in models:
# fit in training set
model.fit(X_train, y_train)
# predict on hold out set
yhat = model.predict(X_val)
# reshape predictions into a matrix with one column
yhat = yhat.reshape(len(yhat), 1)
# store predictions as input for blending
meta_X.append(yhat)
We now have a representation of the input data that can be used to train the metamodel “meta_X”. Each column or feature represents the output of a basic model .
Each row represents an example from the reserved dataset . We can use hstack() Function to ensure that the data set is a... As expected by the machine learning model 2D numpy Array .
...
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
We can now train our metamodel . This can be any machine learning model we like , Such as logistic regression of classification .
...
# define blending model
blender = LogisticRegression()
# fit on predictions from base models
blender.fit(meta_X, y_val)
We can integrate all of this into one called fit_ensemble() The function of , This function uses a training dataset and a validation dataset to train the hybrid model .
The next step is to use hybrid integration to predict new data . It's a twostep process . The first step is to use each base model for forecasting . And then gather these predictions together , As input to the hybrid model to make the final prediction . We can use the same loop structure as when training the model . in other words , We can gather the predictions of each basic model into the training dataset , Stack predictions together , And use this meta level dataset in blender Call... On the model predict(). Below predict_ensemble() The function implements this .
# fit the blending ensemble
def fit_ensemble(models, X_train, X_val, y_train, y_val):
# fit all models on the training set and predict on hold out set
meta_X = list()
for name, model in models:
# fit in training set
model.fit(X_train, y_train)
# predict on hold out set
yhat = model.predict(X_val)
# reshape predictions into a matrix with one column
yhat = yhat.reshape(len(yhat), 1)
# store predictions as input for blending
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# define blending model
blender = LogisticRegression()
# fit on predictions from base models
blender.fit(meta_X, y_val)
return blender
List of basic models for a given training 、 Trained blender Sets and datasets ( Like test data sets or new data ), It will return a set of predictions for the dataset .
# make a prediction with the blending ensemble
def predict_ensemble(models, blender, X_test):
# make predictions with base models
meta_X = list()
for name, model in models:
# predict with base model
yhat = model.predict(X_test)
# reshape predictions into a matrix with one column
yhat = yhat.reshape(len(yhat), 1)
# store prediction
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# predict
return blender.predict(meta_X)
Now? , We already have all the elements needed to implement hybrid integration of classification or regression prediction modeling problems .
Hybrid integration for classification
In this section , We'll look at how to use blending to solve the classification problem .
First , We can use make_classification() Function to create a 10,000 Two examples and 20 A binary classification problem for input properties .
Here's a complete example .
# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# summarize the dataset
print(X.shape, y.shape)
Running the example creates a dataset and summarizes the dimensions of the input and output .

Next , We need to decompose the data set , First, it is decomposed into training set and test set , Then the training set is decomposed into subsets for training basic model and subsets for training meta model .
In this case , We will use... For training and test sets 5050 Division , Then use... For training and validation sets 6733 Division
...
# split dataset into train and test sets
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
# split training set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.33, random_state=1)
# summarize data split
print('Train: %s, Val: %s, Test: %s' % (X_train.shape, X_val.shape, X_test.shape))
then , We can use the get_models() Function to create the classification model used in the integration .
Then you can call fit_ensemble() Function to train hybrid ensemble on training and validation data sets , and predict_ensemble() Function can be used to predict a reserved dataset .
...
# create the base models
models = get_models()
# train the blending ensemble
blender = fit_ensemble(models, X_train, X_val, y_train, y_val)
# make predictions on test set
yhat = predict_ensemble(models, blender, X_test)
Last , We can evaluate the performance of the hybrid model by reporting the classification accuracy on the test data set .
...
# evaluate predictions
score = accuracy_score(y_test, yhat)
print('Blending Accuracy: %.3f' % score)
Put these together , The following is a complete example of evaluating hybrid integration on the binary classification problem .
# blending ensemble for classification using hard voting
from numpy import hstack
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
# get the dataset
def get_dataset():
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
return X, y
# get a list of base models
def get_models():
models = list()
models.append(('lr', LogisticRegression()))
models.append(('knn', KNeighborsClassifier()))
models.append(('cart', DecisionTreeClassifier()))
models.append(('svm', SVC()))
models.append(('bayes', GaussianNB()))
return models
# fit the blending ensemble
def fit_ensemble(models, X_train, X_val, y_train, y_val):
# fit all models on the training set and predict on hold out set
meta_X = list()
for name, model in models:
# fit in training set
model.fit(X_train, y_train)
# predict on hold out set
yhat = model.predict(X_val)
# reshape predictions into a matrix with one column
yhat = yhat.reshape(len(yhat), 1)
# store predictions as input for blending
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# define blending model
blender = LogisticRegression()
# fit on predictions from base models
blender.fit(meta_X, y_val)
return blender
# make a prediction with the blending ensemble
def predict_ensemble(models, blender, X_test):
# make predictions with base models
meta_X = list()
for name, model in models:
# predict with base model
yhat = model.predict(X_test)
# reshape predictions into a matrix with one column
yhat = yhat.reshape(len(yhat), 1)
# store prediction
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# predict
return blender.predict(meta_X)
# define dataset
X, y = get_dataset()
# split dataset into train and test sets
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
# split training set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.33, random_state=1)
# summarize data split
print('Train: %s, Val: %s, Test: %s' % (X_train.shape, X_val.shape, X_test.shape))
# create the base models
models = get_models()
# train the blending ensemble
blender = fit_ensemble(models, X_train, X_val, y_train, y_val)
# make predictions on test set
yhat = predict_ensemble(models, blender, X_test)
# evaluate predictions
score = accuracy_score(y_test, yhat)
print('Blending Accuracy: %.3f' % (score*100))
Run the example first to report the training 、 Verify and test the size of the data set , Then there's the integration on the test data set MAE.
notes : Due to the randomness of algorithm or calculation process , Or differences in numerical accuracy , Your results may be different . Consider running the example a few times , And compare the average results .
under these circumstances , We can see that the classification accuracy has been improved to 98.240%.
Train: (3350, 20), Val: (1650, 20), Test: (5000, 20)
Blending Accuracy: 97.900
In the previous example ,crisp The prediction of class labels is combined using a hybrid model . It's a hard vote .
Another way is to have each model predict the class probability , Then use the metamodel to mix the probabilities . It's a soft vote , Better performance can be achieved in some cases .
First , We have to configure the model to return the probability , such as SVM Model .
# get a list of base models
def get_models():
models = list()
models.append(('lr', LogisticRegression()))
models.append(('knn', KNeighborsClassifier()))
models.append(('cart', DecisionTreeClassifier()))
models.append(('svm', SVC(probability=True)))
models.append(('bayes', GaussianNB()))
return models
Next , We have to change the underlying model to predict probability , Instead of a clear class label .
This can be done by calling fit_ensemble() Function predict_proba() Function to implement .
...
# fit all models on the training set and predict on hold out set
meta_X = list()
for name, model in models:
# fit in training set
model.fit(X_train, y_train)
# predict on hold out set
yhat = model.predict_proba(X_val)
# store predictions as input for blending
meta_X.append(yhat)
This means that the set of metadata used to train the metamodel will have n Column , among n It's the number of classes in the prediction problem , In our case, there are two .
When using hybrid models to predict new data , There's also a need to change the prediction of the underlying model .
...
# make predictions with base models
meta_X = list()
for name, model in models:
# predict with base model
yhat = model.predict_proba(X_test)
# store prediction
meta_X.append(yhat)
Put these together , The following is a complete example of the use of mixing for prediction class probabilities in binary classification problems .
# blending ensemble for classification using soft voting
from numpy import hstack
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
# get the dataset
def get_dataset():
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
return X, y
# get a list of base models
def get_models():
models = list()
models.append(('lr', LogisticRegression()))
models.append(('knn', KNeighborsClassifier()))
models.append(('cart', DecisionTreeClassifier()))
models.append(('svm', SVC(probability=True)))
models.append(('bayes', GaussianNB()))
return models
# fit the blending ensemble
def fit_ensemble(models, X_train, X_val, y_train, y_val):
# fit all models on the training set and predict on hold out set
meta_X = list()
for name, model in models:
# fit in training set
model.fit(X_train, y_train)
# predict on hold out set
yhat = model.predict_proba(X_val)
# store predictions as input for blending
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# define blending model
blender = LogisticRegression()
# fit on predictions from base models
blender.fit(meta_X, y_val)
return blender
# make a prediction with the blending ensemble
def predict_ensemble(models, blender, X_test):
# make predictions with base models
meta_X = list()
for name, model in models:
# predict with base model
yhat = model.predict_proba(X_test)
# store prediction
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# predict
return blender.predict(meta_X)
# define dataset
X, y = get_dataset()
# split dataset into train and test sets
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
# split training set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.33, random_state=1)
# summarize data split
print('Train: %s, Val: %s, Test: %s' % (X_train.shape, X_val.shape, X_test.shape))
# create the base models
models = get_models()
# train the blending ensemble
blender = fit_ensemble(models, X_train, X_val, y_train, y_val)
# make predictions on test set
yhat = predict_ensemble(models, blender, X_test)
# evaluate predictions
score = accuracy_score(y_test, yhat)
print('Blending Accuracy: %.3f' % (score*100))
Run the example first to report the training 、 Verify and test the shape of the data set , Then report the accuracy of the integration on the test data set .
notes : Due to the randomness of algorithm or calculation process , Or differences in numerical accuracy , Your results may be different . Consider running the example a few times , And compare the average results .
under these circumstances , We can see that the mixed class probability improves the classification accuracy to 98.240%.
Train: (3350, 20), Val: (1650, 20), Test: (5000, 20)
Blending Accuracy: 98.240
Hybrid integration is effective only if it outperforms any single contribution model .
We can confirm this by evaluating each basic model individually . Each basic model can fit the whole training data set ( Unlike hybrid integration ), And evaluate it on the test data set ( It's like hybrid integration ).
The following example demonstrates this , Evaluate each basic model separately .
# evaluate base models on the entire training dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
# get the dataset
def get_dataset():
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
return X, y
# get a list of base models
def get_models():
models = list()
models.append(('lr', LogisticRegression()))
models.append(('knn', KNeighborsClassifier()))
models.append(('cart', DecisionTreeClassifier()))
models.append(('svm', SVC(probability=True)))
models.append(('bayes', GaussianNB()))
return models
# define dataset
X, y = get_dataset()
# split dataset into train and test sets
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
# summarize data split
print('Train: %s, Test: %s' % (X_train_full.shape, X_test.shape))
# create the base models
models = get_models()
# evaluate standalone model
for name, model in models:
# fit the model on the training dataset
model.fit(X_train_full, y_train_full)
# make a prediction on the test dataset
yhat = model.predict(X_test)
# evaluate the predictions
score = accuracy_score(y_test, yhat)
# report the score
print('>%s Accuracy: %.3f' % (name, score*100))
The run example first reports the shape of the complete training and test data set , Then report the accuracy of each basic model on the test data set .
notes : Due to the randomness of algorithm or calculation process , Or differences in numerical accuracy , Your results may be different . Consider running the example a few times , And compare the average results .
under these circumstances , We can see that all models perform worse than hybrid integration .
Interestingly , We can see SVM The accuracy of is close to 98.200%, The accuracy of hybrid integration is 98.240 %.
Train: (5000, 20), Test: (5000, 20)
>lr Accuracy: 87.800
>knn Accuracy: 97.380
>cart Accuracy: 88.200
>svm Accuracy: 98.200
>bayes Accuracy: 87.300
We can choose to use mixed sets as our final model .
This includes matching the set to the entire training dataset , And predict new examples . say concretely , The whole training data set is divided into training set and verification set , Training base model and meta model respectively , Then use integration to predict .
Here is a complete example of using hybrid ensemble to classify and predict new data .
# example of making a prediction with a blending ensemble for classification
from numpy import hstack
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
# get the dataset
def get_dataset():
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
return X, y
# get a list of base models
def get_models():
models = list()
models.append(('lr', LogisticRegression()))
models.append(('knn', KNeighborsClassifier()))
models.append(('cart', DecisionTreeClassifier()))
models.append(('svm', SVC(probability=True)))
models.append(('bayes', GaussianNB()))
return models
# fit the blending ensemble
def fit_ensemble(models, X_train, X_val, y_train, y_val):
# fit all models on the training set and predict on hold out set
meta_X = list()
for _, model in models:
# fit in training set
model.fit(X_train, y_train)
# predict on hold out set
yhat = model.predict_proba(X_val)
# store predictions as input for blending
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# define blending model
blender = LogisticRegression()
# fit on predictions from base models
blender.fit(meta_X, y_val)
return blender
# make a prediction with the blending ensemble
def predict_ensemble(models, blender, X_test):
# make predictions with base models
meta_X = list()
for _, model in models:
# predict with base model
yhat = model.predict_proba(X_test)
# store prediction
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# predict
return blender.predict(meta_X)
# define dataset
X, y = get_dataset()
# split dataset set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize data split
print('Train: %s, Val: %s' % (X_train.shape, X_val.shape))
# create the base models
models = get_models()
# train the blending ensemble
blender = fit_ensemble(models, X_train, X_val, y_train, y_val)
# make a prediction on a new row of data
row = [0.30335011, 2.68066314, 2.07794281, 1.15253537, 2.0583897, 2.51936601, 0.67513028, 3.20651939, 1.60345385, 3.68820714, 0.05370913, 1.35804433, 0.42011397, 1.4732839, 2.89997622, 1.61119399, 7.72630965, 2.84089477, 1.83977415, 1.34381989]
yhat = predict_ensemble(models, blender, [row])
# summarize prediction
print('Predicted Class: %d' % (yhat))
Running the example trains the hybrid integration model on the dataset , It is then used to predict new data rows , As we do when we use the model in our applications .
Train: (6700, 20), Val: (3300, 20)
Predicted Class: 1
Next , Let's explore how to evaluate hybrid integration for regression .
Hybrid integration for regression
In this section , We'll look at how to use stacking to deal with a regression problem .
First , We can use make_regression() Function to create a file containing 10,000 Two examples and 20 A comprehensive regression problem of input characteristics .
Here's a complete example .
# test regression dataset
from sklearn.datasets import make_regression
# define dataset
X, y = make_regression(n_samples=10000, n_features=20, n_informative=10, noise=0.3, random_state=7)
# summarize the dataset
print(X.shape, y.shape)
Running the example creates a dataset and summarizes the shapes of the input and output .
Next , We can define a list of regression models as the base model . In this case , We're going to use linear regression 、kNN、 Decision tree and SVM Model .
# get a list of base models
def get_models():
models = list()
models.append(('lr', LinearRegression()))
models.append(('knn', KNeighborsRegressor()))
models.append(('cart', DecisionTreeRegressor()))
models.append(('svm', SVR()))
return models
For training hybrid integration fit_ensemble() Functions have nothing to do with classification , It's just that the model used for mixing has to be changed to a regression model .
In this case , We're going to use a linear regression model .
...
# define blending model
blender = LinearRegression()
Because it's a regression problem , We will use an error measure to evaluate the performance of the model , under these circumstances , Mean absolute error , Or abbreviation MAE.
...
# evaluate predictions
score = mean_absolute_error(y_test, yhat)
print('Blending MAE: %.3f' % score)
Put these together , The following is a complete example of hybrid integration for synthetic regression prediction modeling problems .
# evaluate blending ensemble for regression
from numpy import hstack
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
# get the dataset
def get_dataset():
X, y = make_regression(n_samples=10000, n_features=20, n_informative=10, noise=0.3, random_state=7)
return X, y
# get a list of base models
def get_models():
models = list()
models.append(('lr', LinearRegression()))
models.append(('knn', KNeighborsRegressor()))
models.append(('cart', DecisionTreeRegressor()))
models.append(('svm', SVR()))
return models
# fit the blending ensemble
def fit_ensemble(models, X_train, X_val, y_train, y_val):
# fit all models on the training set and predict on hold out set
meta_X = list()
for name, model in models:
# fit in training set
model.fit(X_train, y_train)
# predict on hold out set
yhat = model.predict(X_val)
# reshape predictions into a matrix with one column
yhat = yhat.reshape(len(yhat), 1)
# store predictions as input for blending
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# define blending model
blender = LinearRegression()
# fit on predictions from base models
blender.fit(meta_X, y_val)
return blender
# make a prediction with the blending ensemble
def predict_ensemble(models, blender, X_test):
# make predictions with base models
meta_X = list()
for name, model in models:
# predict with base model
yhat = model.predict(X_test)
# reshape predictions into a matrix with one column
yhat = yhat.reshape(len(yhat), 1)
# store prediction
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# predict
return blender.predict(meta_X)
# define dataset
X, y = get_dataset()
# split dataset into train and test sets
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
# split training set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.33, random_state=1)
# summarize data split
print('Train: %s, Val: %s, Test: %s' % (X_train.shape, X_val.shape, X_test.shape))
# create the base models
models = get_models()
# train the blending ensemble
blender = fit_ensemble(models, X_train, X_val, y_train, y_val)
# make predictions on test set
yhat = predict_ensemble(models, blender, X_test)
# evaluate predictions
score = mean_absolute_error(y_test, yhat)
print('Blending MAE: %.3f' % score)
Run the example first to report the training 、 Verify and test the shape of the data set , Then there's the integration on the test data set MAE.
notes : Due to the randomness of algorithm or calculation process , Or differences in numerical accuracy , Your results may be different . Consider running the example a few times , And compare the average results .
under these circumstances , We can see that the hybrid integration achieves about 0.237 Of MAE.
Train: (3350, 20), Val: (1650, 20), Test: (5000, 20)
Blending MAE: 0.237
Same as classification , Hybrid integration is useful only if it outperforms any underlying model that facilitates integration .
We can check this by evaluating each of the underlying models individually , First fit it on the whole training data set ( Unlike hybrid integration ), Then make predictions on the test data set ( Like hybrid integration ).
The following example evaluates each basic model on a composite regression prediction modeling dataset separately .
# evaluate base models in isolation on the regression dataset
from numpy import hstack
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
# get the dataset
def get_dataset():
X, y = make_regression(n_samples=10000, n_features=20, n_informative=10, noise=0.3, random_state=7)
return X, y
# get a list of base models
def get_models():
models = list()
models.append(('lr', LinearRegression()))
models.append(('knn', KNeighborsRegressor()))
models.append(('cart', DecisionTreeRegressor()))
models.append(('svm', SVR()))
return models
# define dataset
X, y = get_dataset()
# split dataset into train and test sets
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
# summarize data split
print('Train: %s, Test: %s' % (X_train_full.shape, X_test.shape))
# create the base models
models = get_models()
# evaluate standalone model
for name, model in models:
# fit the model on the training dataset
model.fit(X_train_full, y_train_full)
# make a prediction on the test dataset
yhat = model.predict(X_test)
# evaluate the predictions
score = mean_absolute_error(y_test, yhat)
# report the score
print('>%s MAE: %.3f' % (name, score))
The run example first reports the shape of the complete training and test data set , Then report each basic model on the test data set MAE. notes : Due to the randomness of algorithm or calculation process , Or differences in numerical accuracy , Your results may be different . Consider running the example a few times , And compare the average results . under these circumstances , We can see that linear regression model is slightly better than mixed ensemble , Relative to integrated 0.237,MAE Reached 0.236. This may have something to do with the way the composite dataset is built . however , under these circumstances , We will choose to use linear regression model directly to solve this problem . This highlights the importance of checking the performance of the contribution model before adopting the integration model as the final model .
Train: (5000, 20), Test: (5000, 20)
>lr MAE: 0.236
>knn MAE: 100.169
>cart MAE: 133.744
>svm MAE: 138.195
Again , We can choose to use hybrid integration as the final model of regression .
This involves fitting the entire data set into training sets and validation sets , It is suitable for basic model and meta model respectively , The integration can then be used to predict new rows of data .
Here's a complete example of using mixed ensemble regression to predict new data .
# example of making a prediction with a blending ensemble for regression
from numpy import hstack
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
# get the dataset
def get_dataset():
X, y = make_regression(n_samples=10000, n_features=20, n_informative=10, noise=0.3, random_state=7)
return X, y
# get a list of base models
def get_models():
models = list()
models.append(('lr', LinearRegression()))
models.append(('knn', KNeighborsRegressor()))
models.append(('cart', DecisionTreeRegressor()))
models.append(('svm', SVR()))
return models
# fit the blending ensemble
def fit_ensemble(models, X_train, X_val, y_train, y_val):
# fit all models on the training set and predict on hold out set
meta_X = list()
for _, model in models:
# fit in training set
model.fit(X_train, y_train)
# predict on hold out set
yhat = model.predict(X_val)
# reshape predictions into a matrix with one column
yhat = yhat.reshape(len(yhat), 1)
# store predictions as input for blending
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# define blending model
blender = LinearRegression()
# fit on predictions from base models
blender.fit(meta_X, y_val)
return blender
# make a prediction with the blending ensemble
def predict_ensemble(models, blender, X_test):
# make predictions with base models
meta_X = list()
for _, model in models:
# predict with base model
yhat = model.predict(X_test)
# reshape predictions into a matrix with one column
yhat = yhat.reshape(len(yhat), 1)
# store prediction
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# predict
return blender.predict(meta_X)
# define dataset
X, y = get_dataset()
# split dataset set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize data split
print('Train: %s, Val: %s' % (X_train.shape, X_val.shape))
# create the base models
models = get_models()
# train the blending ensemble
blender = fit_ensemble(models, X_train, X_val, y_train, y_val)
# make a prediction on a new row of data
row = [0.24038754, 0.55423865, 0.48979221, 1.56074459, 1.16007611, 1.10049103, 1.18385406, 1.57344162, 0.97862519, 0.03166643, 1.77099821, 1.98645499, 0.86780193, 2.01534177, 2.51509494, 1.04609004, 0.19428148, 0.05967386, 2.67168985, 1.07182911]
yhat = predict_ensemble(models, blender, [row])
# summarize prediction
print('Predicted: %.3f' % (yhat[0]))
Running the example trains the hybrid integration model on the dataset , It is then used to predict new data rows , As we do when we use the model in our applications .
Train: (6700, 20), Val: (3300, 20)
Predicted: 359.986
Extended reading
If you want to learn more about this topic , This section provides more resources .
Related courses
Stacking Ensemble Machine Learning With Python
How to Implement Stacked Generalization (Stacking) From Scratch With Python
The paper
FeatureWeighted Linear Stacking, 2009.
The BellKor 2008 Solution to the Netflix Prize, 2008.
Kaggle Ensemble Guide, MLWave, 2015.
article
Netflix Prize, Wikipedia.
summary
In this tutorial , You've learned how to python Develop and evaluate hybrid integration in .
say concretely , You learned. :
Hybrid integration is a kind of superposition , In this superposition , The meta model is fitted by forecasting an incomplete validation data set rather than an outdated one .
How to develop a hybrid integration , It includes the function of training model and forecasting new data .
How to evaluate the mixed set becomes the problem of classification and regression prediction modeling .
Original title ：
Blending Ensemble Machine Learning With Python
Link to the original text ：
https://machinelearningmastery.com/blendingensemblemachinelearningwithpython/
Introduction to translator ： Wang Khan , Tsinghua University mechanical engineering department direct doctoral students are studying . I used to have a background in physics , As a graduate student, I have a strong interest in data science , Machine learning AI Full of curiosity . Looking forward to on the road of scientific research , Artificial intelligence and mechanical engineering 、 Computational physics collides with other sparks . Hope to make friends and share more stories about Data Science , Look at the world with data science thinking .
END
Copyright notice ： Part of the content of this issue comes from the Internet , Please indicate the original link and author , If there is any infringement or source error, please contact us .
Cooperation please add QQ：365242293
Data analysis （ID : ecshujufenxi ） Internet technology and wechat of data circle itself , It's also WeMedia A member of the We Media Alliance ,WeMedia Alliance coverage 5000 10000 people .