Recommendation: hybrid integrated machine learning using python (link attached)

osc_ 7owgvpdx 2021-01-21 10:13:27
recommendation hybrid integrated machine learning


author :Jason Brownlee   translate : Wang Khan   proofreading :wwl

this paper about 7000 word , Recommended reading 16 minute

This article shows you how to python Develop and evaluate blended integrated learning in , And how to use it in classification and regression problems .

Hybrid is a machine learning algorithm based on ensemble learning .

It's a colloquial name for stack or stack Integration , under these circumstances , The meta model is not based on the prediction of the basic model , It's based on the prediction of the retained sample data set .

In value 100 Thousands of dollars in Netflix In the machine learning contest ,“ blend ” It is used to describe the stack model which is composed of hundreds of prediction models by competitors and in the fierce competition of machine learning circle , such as Kaggle Community , Still a popular technology and name .

In this tutorial , You will learn how to python Develop and evaluate blended integrated learning in .

After completing this tutorial , You will know :

  • Hybrid integration is a kind of superposition , In this superposition , The meta model is fitted by the prediction of a reserved validation data set rather than the prediction of an external data set .

  • How to develop a hybrid integration , It includes the function of training model and forecasting new data .

  • How to evaluate hybrid integration for classification and regression prediction modeling problems .

Let's get started .

Tutorial overview

This tutorial is divided into four parts , They are :

  • Hybrid integration

  • Develop hybrid integration

  • Hybrid integration for classification

  • Hybrid integration for regression

Hybrid integration

Hybrid is an integrated machine learning technology , It uses a machine learning model to learn how to best combine predictions from multiple member models .

Broadly speaking , In terms of itself , Blending and stacking generalizations ( That is superposition ) It's the same . In the same paper or model description , Mixing and stacking are often used alternately .

“ Many machine learning practitioners have successfully used superposition and correlation techniques to improve prediction accuracy , Beyond the level of any single model . In some cases , Stacking is also called mixing , Here we're going to use these two terms alternately .”

Feature weighted linear superposition ,2009 year .

The architecture of a stacked model consists of two or more basic models ( Often referred to as 0 Level model ) And a metamodel ( It combines the predictions of the basic model , be called 1 Level model ). The meta model is trained according to the prediction of the basic model to the data outside the sample .

  • 0 Level model ( Basic model ): Based on the training data, the training model is used to predict .

  • 1 Level model ( Metamodel ): Learn how to best combine basic models with predictive models .

However , Mixing has specific implications for how to build a stack integration model .

Mixing may suggest developing a stacked integrated learning model , The basic model is any kind of machine learning model , The metamodel is a “ blend ” The linear model of the basic model .

for example , The linear regression model for predicting values or the logistic regression model for predicting class labels will calculate the weighted sum of the predictions made by the basic model , And will be considered a mixture of predictions .

  • Hybrid integration : Use linear models , Such as linear regression or logistic regression , As a meta model in overlay Integration .

stay 2009 Year of Netflix In the bonus game “ blend ” It's a term for stack Integration . The award includes seeking to compare Netflix Native algorithm better movie recommendation prediction team , Performance improvement 10% Our team will get 100 $10000 bonus .

“ our RMSE=0.8643^2 The solution is 100 A linear mixture of multiple results .…… In the description of the method , We highlighted the specific predictors involved in the final hybrid solution .”

- 2008 year Netflix Grand Prix BellKor Solution

therefore , Mixing is a popular term , It refers to integrated learning with stack type architecture model . In addition to the content related to competitive machine learning , It's rarely used in textbooks or academic papers .

The most common is , blend (blending) Used to describe specific applications of stacking , under these circumstances , The meta model is trained according to the prediction made by the basic model on a fixed validation data set . The stack (stacking) Is reserved for the metamodel , In the process of cross validation, the training scenario is based on the prediction of the out of the box .

  • blend (blending): Stack type integration , The meta model is trained according to the prediction of the retained validation data set .

  • The stack (stacking): Stack Integration , stay k-fold In the process of cross validation , The meta model is trained according to the prediction .

What's the difference Kaggle It's very common in the competitive machine learning community .

“Blend( blend ) The word is from Netflix The winners of the . It's very close to stack generalization , But it's simpler , The risk of information leakage is smaller .… Use mix , Instead of creating out of fold data set predictions for the training set , You create a small reserved dataset , such as 10% Training set of . Then the stack model only runs on this reserved set .”

-《Kaggle Ensemble Guide》,MLWave, 2015.

We'll use the latter hybrid definition .

Next , Let's see how to achieve mixing .

Develop hybrid integration

At the time of writing ,scikit-learn Libraries don't inherently support mixing . but , We can use scikit-lern The model implements it itself .

First , We need to create some basic models . For regression or classification problems , These models can be any model we like . We can define a function get_models(), It returns a list of models , Each model is defined as a tuple of classifiers or regressors with name and configuration .

for example , For a question of classification , We can use logistic regression 、kNN、 Decision tree 、 Support vector machines and naive Bayesian models .

# get a list of base models
def get_models():
models = list()
models.append(('lr', LogisticRegression()))
models.append(('knn', KNeighborsClassifier()))
models.append(('cart', DecisionTreeClassifier()))
models.append(('svm', SVC(probability=True)))
models.append(('bayes', GaussianNB()))
return models

Next , We need to fit the hybrid model . Think about it , The basic model is used to fit the training data set . The meta model is based on each basic model to fit the prediction results of the reserved data set . First , We can list the models , Then train each model in turn on the training data set . Also in this cycle , We can use the trained model to predict the retained validation data set , And store predictions for the future .

...

# fit all models on the training set and predict on hold out set
meta_X = list()
for name, model in models:
# fit in training set
model.fit(X_train, y_train)
# predict on hold out set
yhat = model.predict(X_val)
# reshape predictions into a matrix with one column
yhat = yhat.reshape(len(yhat), 1)
# store predictions as input for blending
meta_X.append(yhat)

We now have a representation of the input data that can be used to train the metamodel “meta_X”. Each column or feature represents the output of a basic model .

Each row represents an example from the reserved dataset . We can use hstack() Function to ensure that the data set is a... As expected by the machine learning model 2D numpy Array .

...

# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)

We can now train our metamodel . This can be any machine learning model we like , Such as logistic regression of classification .

...

# define blending model
blender = LogisticRegression()
# fit on predictions from base models
blender.fit(meta_X, y_val)

We can integrate all of this into one called fit_ensemble() The function of , This function uses a training dataset and a validation dataset to train the hybrid model .

The next step is to use hybrid integration to predict new data . It's a two-step process . The first step is to use each base model for forecasting . And then gather these predictions together , As input to the hybrid model to make the final prediction . We can use the same loop structure as when training the model . in other words , We can gather the predictions of each basic model into the training dataset , Stack predictions together , And use this meta level dataset in blender Call... On the model predict(). Below predict_ensemble() The function implements this .

# fit the blending ensemble
def fit_ensemble(models, X_train, X_val, y_train, y_val):
# fit all models on the training set and predict on hold out set
meta_X = list()
for name, model in models:
# fit in training set
model.fit(X_train, y_train)
# predict on hold out set
yhat = model.predict(X_val)
# reshape predictions into a matrix with one column
yhat = yhat.reshape(len(yhat), 1)
# store predictions as input for blending
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# define blending model
blender = LogisticRegression()
# fit on predictions from base models
blender.fit(meta_X, y_val)
return blender

List of basic models for a given training 、 Trained blender Sets and datasets ( Like test data sets or new data ), It will return a set of predictions for the dataset .

# make a prediction with the blending ensemble
def predict_ensemble(models, blender, X_test):
# make predictions with base models
meta_X = list()
for name, model in models:
# predict with base model
yhat = model.predict(X_test)
# reshape predictions into a matrix with one column
yhat = yhat.reshape(len(yhat), 1)
# store prediction
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# predict
return blender.predict(meta_X)

Now? , We already have all the elements needed to implement hybrid integration of classification or regression prediction modeling problems .

Hybrid integration for classification

In this section , We'll look at how to use blending to solve the classification problem .

First , We can use make_classification() Function to create a 10,000 Two examples and 20 A binary classification problem for input properties .

Here's a complete example .

# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# summarize the dataset
print(X.shape, y.shape)

Running the example creates a dataset and summarizes the dimensions of the input and output .


(10000, 20) (10000,)

Next , We need to decompose the data set , First, it is decomposed into training set and test set , Then the training set is decomposed into subsets for training basic model and subsets for training meta model .

In this case , We will use... For training and test sets 50-50 Division , Then use... For training and validation sets 67-33 Division

...

# split dataset into train and test sets
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
# split training set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.33, random_state=1)
# summarize data split
print('Train: %s, Val: %s, Test: %s' % (X_train.shape, X_val.shape, X_test.shape))

then , We can use the get_models() Function to create the classification model used in the integration .

Then you can call fit_ensemble() Function to train hybrid ensemble on training and validation data sets , and predict_ensemble() Function can be used to predict a reserved dataset .

...
# create the base models
models = get_models()
# train the blending ensemble
blender = fit_ensemble(models, X_train, X_val, y_train, y_val)
# make predictions on test set
yhat = predict_ensemble(models, blender, X_test)

Last , We can evaluate the performance of the hybrid model by reporting the classification accuracy on the test data set .

...

# evaluate predictions
score = accuracy_score(y_test, yhat)
print('Blending Accuracy: %.3f' % score)

Put these together , The following is a complete example of evaluating hybrid integration on the binary classification problem .



# blending ensemble for classification using hard voting
from numpy import hstack
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
# get the dataset
def get_dataset():
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
return X, y
# get a list of base models
def get_models():
models = list()
models.append(('lr', LogisticRegression()))
models.append(('knn', KNeighborsClassifier()))
models.append(('cart', DecisionTreeClassifier()))
models.append(('svm', SVC()))
models.append(('bayes', GaussianNB()))
return models
# fit the blending ensemble
def fit_ensemble(models, X_train, X_val, y_train, y_val):
# fit all models on the training set and predict on hold out set
meta_X = list()
for name, model in models:
# fit in training set
model.fit(X_train, y_train)
# predict on hold out set
yhat = model.predict(X_val)
# reshape predictions into a matrix with one column
yhat = yhat.reshape(len(yhat), 1)
# store predictions as input for blending
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# define blending model
blender = LogisticRegression()
# fit on predictions from base models
blender.fit(meta_X, y_val)
return blender
# make a prediction with the blending ensemble
def predict_ensemble(models, blender, X_test):
# make predictions with base models
meta_X = list()
for name, model in models:
# predict with base model
yhat = model.predict(X_test)
# reshape predictions into a matrix with one column
yhat = yhat.reshape(len(yhat), 1)
# store prediction
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# predict
return blender.predict(meta_X)
# define dataset
X, y = get_dataset()
# split dataset into train and test sets
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
# split training set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.33, random_state=1)
# summarize data split
print('Train: %s, Val: %s, Test: %s' % (X_train.shape, X_val.shape, X_test.shape))
# create the base models
models = get_models()
# train the blending ensemble
blender = fit_ensemble(models, X_train, X_val, y_train, y_val)
# make predictions on test set
yhat = predict_ensemble(models, blender, X_test)
# evaluate predictions
score = accuracy_score(y_test, yhat)
print('Blending Accuracy: %.3f' % (score*100))

Run the example first to report the training 、 Verify and test the size of the data set , Then there's the integration on the test data set MAE.

notes : Due to the randomness of algorithm or calculation process , Or differences in numerical accuracy , Your results may be different . Consider running the example a few times , And compare the average results .

under these circumstances , We can see that the classification accuracy has been improved to 98.240%.

Train: (3350, 20), Val: (1650, 20), Test: (5000, 20)
Blending Accuracy: 97.900

In the previous example ,crisp The prediction of class labels is combined using a hybrid model . It's a hard vote .

Another way is to have each model predict the class probability , Then use the metamodel to mix the probabilities . It's a soft vote , Better performance can be achieved in some cases .

First , We have to configure the model to return the probability , such as SVM Model .

# get a list of base models
def get_models():
models = list()
models.append(('lr', LogisticRegression()))
models.append(('knn', KNeighborsClassifier()))
models.append(('cart', DecisionTreeClassifier()))
models.append(('svm', SVC(probability=True)))
models.append(('bayes', GaussianNB()))
return models

Next , We have to change the underlying model to predict probability , Instead of a clear class label .

This can be done by calling fit_ensemble() Function predict_proba() Function to implement .

...

# fit all models on the training set and predict on hold out set
meta_X = list()
for name, model in models:
# fit in training set
model.fit(X_train, y_train)
# predict on hold out set
yhat = model.predict_proba(X_val)
# store predictions as input for blending
meta_X.append(yhat)

This means that the set of metadata used to train the metamodel will have n Column , among n It's the number of classes in the prediction problem , In our case, there are two .

When using hybrid models to predict new data , There's also a need to change the prediction of the underlying model .

...

# make predictions with base models
meta_X = list()
for name, model in models:
# predict with base model
yhat = model.predict_proba(X_test)
# store prediction
meta_X.append(yhat)

Put these together , The following is a complete example of the use of mixing for prediction class probabilities in binary classification problems .



# blending ensemble for classification using soft voting
from numpy import hstack
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
# get the dataset
def get_dataset():
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
return X, y
# get a list of base models
def get_models():
models = list()
models.append(('lr', LogisticRegression()))
models.append(('knn', KNeighborsClassifier()))
models.append(('cart', DecisionTreeClassifier()))
models.append(('svm', SVC(probability=True)))
models.append(('bayes', GaussianNB()))
return models
# fit the blending ensemble
def fit_ensemble(models, X_train, X_val, y_train, y_val):
# fit all models on the training set and predict on hold out set
meta_X = list()
for name, model in models:
# fit in training set
model.fit(X_train, y_train)
# predict on hold out set
yhat = model.predict_proba(X_val)
# store predictions as input for blending
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# define blending model
blender = LogisticRegression()
# fit on predictions from base models
blender.fit(meta_X, y_val)
return blender
# make a prediction with the blending ensemble
def predict_ensemble(models, blender, X_test):
# make predictions with base models
meta_X = list()
for name, model in models:
# predict with base model
yhat = model.predict_proba(X_test)
# store prediction
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# predict
return blender.predict(meta_X)
# define dataset
X, y = get_dataset()
# split dataset into train and test sets
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
# split training set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.33, random_state=1)
# summarize data split
print('Train: %s, Val: %s, Test: %s' % (X_train.shape, X_val.shape, X_test.shape))
# create the base models
models = get_models()
# train the blending ensemble
blender = fit_ensemble(models, X_train, X_val, y_train, y_val)
# make predictions on test set
yhat = predict_ensemble(models, blender, X_test)
# evaluate predictions
score = accuracy_score(y_test, yhat)
print('Blending Accuracy: %.3f' % (score*100))

Run the example first to report the training 、 Verify and test the shape of the data set , Then report the accuracy of the integration on the test data set .

notes : Due to the randomness of algorithm or calculation process , Or differences in numerical accuracy , Your results may be different . Consider running the example a few times , And compare the average results .

under these circumstances , We can see that the mixed class probability improves the classification accuracy to 98.240%.

Train: (3350, 20), Val: (1650, 20), Test: (5000, 20)
Blending Accuracy: 98.240

Hybrid integration is effective only if it outperforms any single contribution model .

We can confirm this by evaluating each basic model individually . Each basic model can fit the whole training data set ( Unlike hybrid integration ), And evaluate it on the test data set ( It's like hybrid integration ).

The following example demonstrates this , Evaluate each basic model separately .

# evaluate base models on the entire training dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
# get the dataset
def get_dataset():
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
return X, y
# get a list of base models
def get_models():
models = list()
models.append(('lr', LogisticRegression()))
models.append(('knn', KNeighborsClassifier()))
models.append(('cart', DecisionTreeClassifier()))
models.append(('svm', SVC(probability=True)))
models.append(('bayes', GaussianNB()))
return models
# define dataset
X, y = get_dataset()
# split dataset into train and test sets
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
# summarize data split
print('Train: %s, Test: %s' % (X_train_full.shape, X_test.shape))
# create the base models
models = get_models()
# evaluate standalone model
for name, model in models:
# fit the model on the training dataset
model.fit(X_train_full, y_train_full)
# make a prediction on the test dataset
yhat = model.predict(X_test)
# evaluate the predictions
score = accuracy_score(y_test, yhat)
# report the score
print('>%s Accuracy: %.3f' % (name, score*100))

The run example first reports the shape of the complete training and test data set , Then report the accuracy of each basic model on the test data set .

notes : Due to the randomness of algorithm or calculation process , Or differences in numerical accuracy , Your results may be different . Consider running the example a few times , And compare the average results .

under these circumstances , We can see that all models perform worse than hybrid integration .

Interestingly , We can see SVM The accuracy of is close to 98.200%, The accuracy of hybrid integration is 98.240 %.

Train: (5000, 20), Test: (5000, 20)
>lr Accuracy: 87.800
>knn Accuracy: 97.380
>cart Accuracy: 88.200
>svm Accuracy: 98.200
>bayes Accuracy: 87.300

We can choose to use mixed sets as our final model .

This includes matching the set to the entire training dataset , And predict new examples . say concretely , The whole training data set is divided into training set and verification set , Training base model and meta model respectively , Then use integration to predict .

Here is a complete example of using hybrid ensemble to classify and predict new data .

# example of making a prediction with a blending ensemble for classification
from numpy import hstack
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
# get the dataset
def get_dataset():
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
return X, y
# get a list of base models
def get_models():
models = list()
models.append(('lr', LogisticRegression()))
models.append(('knn', KNeighborsClassifier()))
models.append(('cart', DecisionTreeClassifier()))
models.append(('svm', SVC(probability=True)))
models.append(('bayes', GaussianNB()))
return models
# fit the blending ensemble
def fit_ensemble(models, X_train, X_val, y_train, y_val):
# fit all models on the training set and predict on hold out set
meta_X = list()
for _, model in models:
# fit in training set
model.fit(X_train, y_train)
# predict on hold out set
yhat = model.predict_proba(X_val)
# store predictions as input for blending
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# define blending model
blender = LogisticRegression()
# fit on predictions from base models
blender.fit(meta_X, y_val)
return blender
# make a prediction with the blending ensemble
def predict_ensemble(models, blender, X_test):
# make predictions with base models
meta_X = list()
for _, model in models:
# predict with base model
yhat = model.predict_proba(X_test)
# store prediction
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# predict
return blender.predict(meta_X)
# define dataset
X, y = get_dataset()
# split dataset set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize data split
print('Train: %s, Val: %s' % (X_train.shape, X_val.shape))
# create the base models
models = get_models()
# train the blending ensemble
blender = fit_ensemble(models, X_train, X_val, y_train, y_val)
# make a prediction on a new row of data
row = [-0.30335011, 2.68066314, 2.07794281, 1.15253537, -2.0583897, -2.51936601, 0.67513028, -3.20651939, -1.60345385, 3.68820714, 0.05370913, 1.35804433, 0.42011397, 1.4732839, 2.89997622, 1.61119399, 7.72630965, -2.84089477, -1.83977415, 1.34381989]
yhat = predict_ensemble(models, blender, [row])
# summarize prediction
print('Predicted Class: %d' % (yhat))

Running the example trains the hybrid integration model on the dataset , It is then used to predict new data rows , As we do when we use the model in our applications .

Train: (6700, 20), Val: (3300, 20)
Predicted Class: 1

Next , Let's explore how to evaluate hybrid integration for regression .

Hybrid integration for regression

In this section , We'll look at how to use stacking to deal with a regression problem .

First , We can use make_regression() Function to create a file containing 10,000 Two examples and 20 A comprehensive regression problem of input characteristics .

Here's a complete example .

# test regression dataset
from sklearn.datasets import make_regression
# define dataset
X, y = make_regression(n_samples=10000, n_features=20, n_informative=10, noise=0.3, random_state=7)
# summarize the dataset
print(X.shape, y.shape)

Running the example creates a dataset and summarizes the shapes of the input and output .

Next , We can define a list of regression models as the base model . In this case , We're going to use linear regression 、kNN、 Decision tree and SVM Model .

# get a list of base models
def get_models():
models = list()
models.append(('lr', LinearRegression()))
models.append(('knn', KNeighborsRegressor()))
models.append(('cart', DecisionTreeRegressor()))
models.append(('svm', SVR()))
return models

For training hybrid integration fit_ensemble() Functions have nothing to do with classification , It's just that the model used for mixing has to be changed to a regression model .

In this case , We're going to use a linear regression model .

...

# define blending model
blender = LinearRegression()

Because it's a regression problem , We will use an error measure to evaluate the performance of the model , under these circumstances , Mean absolute error , Or abbreviation MAE.

...

# evaluate predictions
score = mean_absolute_error(y_test, yhat)
print('Blending MAE: %.3f' % score)

Put these together , The following is a complete example of hybrid integration for synthetic regression prediction modeling problems .



# evaluate blending ensemble for regression
from numpy import hstack
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
# get the dataset
def get_dataset():
X, y = make_regression(n_samples=10000, n_features=20, n_informative=10, noise=0.3, random_state=7)
return X, y
# get a list of base models
def get_models():
models = list()
models.append(('lr', LinearRegression()))
models.append(('knn', KNeighborsRegressor()))
models.append(('cart', DecisionTreeRegressor()))
models.append(('svm', SVR()))
return models
# fit the blending ensemble
def fit_ensemble(models, X_train, X_val, y_train, y_val):
# fit all models on the training set and predict on hold out set
meta_X = list()
for name, model in models:
# fit in training set
model.fit(X_train, y_train)
# predict on hold out set
yhat = model.predict(X_val)
# reshape predictions into a matrix with one column
yhat = yhat.reshape(len(yhat), 1)
# store predictions as input for blending
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# define blending model
blender = LinearRegression()
# fit on predictions from base models
blender.fit(meta_X, y_val)
return blender
# make a prediction with the blending ensemble
def predict_ensemble(models, blender, X_test):
# make predictions with base models
meta_X = list()
for name, model in models:
# predict with base model
yhat = model.predict(X_test)
# reshape predictions into a matrix with one column
yhat = yhat.reshape(len(yhat), 1)
# store prediction
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# predict
return blender.predict(meta_X)
# define dataset
X, y = get_dataset()
# split dataset into train and test sets
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
# split training set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.33, random_state=1)
# summarize data split
print('Train: %s, Val: %s, Test: %s' % (X_train.shape, X_val.shape, X_test.shape))
# create the base models
models = get_models()
# train the blending ensemble
blender = fit_ensemble(models, X_train, X_val, y_train, y_val)
# make predictions on test set
yhat = predict_ensemble(models, blender, X_test)
# evaluate predictions
score = mean_absolute_error(y_test, yhat)
print('Blending MAE: %.3f' % score)

Run the example first to report the training 、 Verify and test the shape of the data set , Then there's the integration on the test data set MAE.

notes : Due to the randomness of algorithm or calculation process , Or differences in numerical accuracy , Your results may be different . Consider running the example a few times , And compare the average results .

under these circumstances , We can see that the hybrid integration achieves about 0.237 Of MAE.

Train: (3350, 20), Val: (1650, 20), Test: (5000, 20)
Blending MAE: 0.237

Same as classification , Hybrid integration is useful only if it outperforms any underlying model that facilitates integration .

We can check this by evaluating each of the underlying models individually , First fit it on the whole training data set ( Unlike hybrid integration ), Then make predictions on the test data set ( Like hybrid integration ).

The following example evaluates each basic model on a composite regression prediction modeling dataset separately .



# evaluate base models in isolation on the regression dataset
from numpy import hstack
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
# get the dataset
def get_dataset():
X, y = make_regression(n_samples=10000, n_features=20, n_informative=10, noise=0.3, random_state=7)
return X, y
# get a list of base models
def get_models():
models = list()
models.append(('lr', LinearRegression()))
models.append(('knn', KNeighborsRegressor()))
models.append(('cart', DecisionTreeRegressor()))
models.append(('svm', SVR()))
return models
# define dataset
X, y = get_dataset()
# split dataset into train and test sets
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
# summarize data split
print('Train: %s, Test: %s' % (X_train_full.shape, X_test.shape))
# create the base models
models = get_models()
# evaluate standalone model
for name, model in models:
# fit the model on the training dataset
model.fit(X_train_full, y_train_full)
# make a prediction on the test dataset
yhat = model.predict(X_test)
# evaluate the predictions
score = mean_absolute_error(y_test, yhat)
# report the score
print('>%s MAE: %.3f' % (name, score))

The run example first reports the shape of the complete training and test data set , Then report each basic model on the test data set MAE. notes : Due to the randomness of algorithm or calculation process , Or differences in numerical accuracy , Your results may be different . Consider running the example a few times , And compare the average results . under these circumstances , We can see that linear regression model is slightly better than mixed ensemble , Relative to integrated 0.237,MAE Reached 0.236. This may have something to do with the way the composite dataset is built . however , under these circumstances , We will choose to use linear regression model directly to solve this problem . This highlights the importance of checking the performance of the contribution model before adopting the integration model as the final model .

Train: (5000, 20), Test: (5000, 20)
>lr MAE: 0.236
>knn MAE: 100.169
>cart MAE: 133.744
>svm MAE: 138.195

Again , We can choose to use hybrid integration as the final model of regression .

This involves fitting the entire data set into training sets and validation sets , It is suitable for basic model and meta model respectively , The integration can then be used to predict new rows of data .

Here's a complete example of using mixed ensemble regression to predict new data .

# example of making a prediction with a blending ensemble for regression
from numpy import hstack
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
# get the dataset
def get_dataset():
X, y = make_regression(n_samples=10000, n_features=20, n_informative=10, noise=0.3, random_state=7)
return X, y
# get a list of base models
def get_models():
models = list()
models.append(('lr', LinearRegression()))
models.append(('knn', KNeighborsRegressor()))
models.append(('cart', DecisionTreeRegressor()))
models.append(('svm', SVR()))
return models
# fit the blending ensemble
def fit_ensemble(models, X_train, X_val, y_train, y_val):
# fit all models on the training set and predict on hold out set
meta_X = list()
for _, model in models:
# fit in training set
model.fit(X_train, y_train)
# predict on hold out set
yhat = model.predict(X_val)
# reshape predictions into a matrix with one column
yhat = yhat.reshape(len(yhat), 1)
# store predictions as input for blending
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# define blending model
blender = LinearRegression()
# fit on predictions from base models
blender.fit(meta_X, y_val)
return blender
# make a prediction with the blending ensemble
def predict_ensemble(models, blender, X_test):
# make predictions with base models
meta_X = list()
for _, model in models:
# predict with base model
yhat = model.predict(X_test)
# reshape predictions into a matrix with one column
yhat = yhat.reshape(len(yhat), 1)
# store prediction
meta_X.append(yhat)
# create 2d array from predictions, each set is an input feature
meta_X = hstack(meta_X)
# predict
return blender.predict(meta_X)
# define dataset
X, y = get_dataset()
# split dataset set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize data split
print('Train: %s, Val: %s' % (X_train.shape, X_val.shape))
# create the base models
models = get_models()
# train the blending ensemble
blender = fit_ensemble(models, X_train, X_val, y_train, y_val)
# make a prediction on a new row of data
row = [-0.24038754, 0.55423865, -0.48979221, 1.56074459, -1.16007611, 1.10049103, 1.18385406, -1.57344162, 0.97862519, -0.03166643, 1.77099821, 1.98645499, 0.86780193, 2.01534177, 2.51509494, -1.04609004, -0.19428148, -0.05967386, -2.67168985, 1.07182911]
yhat = predict_ensemble(models, blender, [row])
# summarize prediction
print('Predicted: %.3f' % (yhat[0]))

Running the example trains the hybrid integration model on the dataset , It is then used to predict new data rows , As we do when we use the model in our applications .

Train: (6700, 20), Val: (3300, 20)
Predicted: 359.986

Extended reading

If you want to learn more about this topic , This section provides more resources .

Related courses

Stacking Ensemble Machine Learning With Python

How to Implement Stacked Generalization (Stacking) From Scratch With Python

The paper

Feature-Weighted Linear Stacking, 2009.

The BellKor 2008 Solution to the Netflix Prize, 2008.

Kaggle Ensemble Guide, MLWave, 2015.

article

Netflix Prize, Wikipedia.

summary

In this tutorial , You've learned how to python Develop and evaluate hybrid integration in .

say concretely , You learned. :

Hybrid integration is a kind of superposition , In this superposition , The meta model is fitted by forecasting an incomplete validation data set rather than an outdated one .

How to develop a hybrid integration , It includes the function of training model and forecasting new data .

How to evaluate the mixed set becomes the problem of classification and regression prediction modeling .

Original title :

Blending Ensemble Machine Learning With Python

Link to the original text : 

https://machinelearningmastery.com/blending-ensemble-machine-learning-with-python/

Introduction to translator : Wang Khan , Tsinghua University mechanical engineering department direct doctoral students are studying . I used to have a background in physics , As a graduate student, I have a strong interest in data science , Machine learning AI Full of curiosity . Looking forward to on the road of scientific research , Artificial intelligence and mechanical engineering 、 Computational physics collides with other sparks . Hope to make friends and share more stories about Data Science , Look at the world with data science thinking .

END

Copyright notice : Part of the content of this issue comes from the Internet , Please indicate the original link and author , If there is any infringement or source error, please contact us .


Cooperation please add QQ:365242293  

Data analysis (ID : ecshujufenxi ) Internet technology and wechat of data circle itself , It's also WeMedia A member of the We Media Alliance ,WeMedia Alliance coverage 5000 10000 people .

版权声明
本文为[osc_ 7owgvpdx]所创,转载请带上原文链接,感谢
https://pythonmana.com/2021/01/20210121095532241s.html

  1. Python批量 png转ico
  2. 使用line_profiler对python代码性能进行评估优化
  3. 使用line_profiler对python代码性能进行评估优化
  4. Getting started with Python 3 flash in win environment
  5. Common ways to write configuration files in Python
  6. Python会在2021年死去吗? Python 3.9最终版本的回顾
  7. Python batch PNG to ICO
  8. Using line_ Profiler evaluates and optimizes the performance of Python code
  9. Using line_ Profiler evaluates and optimizes the performance of Python code
  10. Will Python die in 2021? A review of the final version of Python 3.9
  11. Python3 SMTP send mail
  12. Understanding closures in Python: getting started with closures
  13. Python日志实践
  14. Python logging practice
  15. [python opencv 计算机视觉零基础到实战] 十、图片效果毛玻璃
  16. [python opencv 计算机视觉零基础到实战] 九、模糊
  17. 10. Picture effect ground glass
  18. [Python opencv computer vision zero basis to actual combat] 9. Fuzzy
  19. 使用line_profiler對python程式碼效能進行評估優化
  20. Using line_ Profiler to evaluate and optimize the performance of Python code
  21. LeetCode | 0508. 出现次数最多的子树元素和【Python】
  22. Leetcode | 0508
  23. LeetCode | 0530. 二叉搜索树的最小绝对差【Python】
  24. LeetCode | 0515. 在每个树行中找最大值【Python】
  25. Leetcode | 0530. Minimum absolute difference of binary search tree [Python]
  26. Leetcode | 0515. Find the maximum value in each tree row [Python]
  27. 我来记笔记啦-搭建python虚拟环境
  28. Let me take notes - building a python virtual environment
  29. LeetCode | 0513. 找树左下角的值【Python】
  30. Leetcode | 0513. Find the value in the lower left corner of the tree [Python]
  31. Python OpenCV 泛洪填充,取经之旅第 21 天
  32. Python opencv flood fill, day 21
  33. Python爬虫自学系列(二)
  34. Python crawler self study series (2)
  35. 【python】身份证号码有效性检验
  36. [Python] validity test of ID number
  37. Python ORM - pymysql&sqlalchemy
  38. Python ORM - pymysql&sqlalchemy
  39. centos7 安装python3.8
  40. centos7 安装python3.8
  41. Centos7 installing Python 3.8
  42. Centos7 installing Python 3.8
  43. Django——图书管理系统(六)
  44. Django——图书管理系统(五)
  45. Django -- library management system (6)
  46. Django -- library management system (5)
  47. python批量插入数据小脚本
  48. Python batch insert data script
  49. ZoomEye-python 使用指南
  50. Zoomeye Python User's Guide
  51. 用Python写代码,一分钟搞定一天工作量,同事直呼:好家伙 - 知乎
  52. Using Python to write code, one minute to complete a day's workload, colleagues call: good guy - Zhihu
  53. Python 上的可视化库——PyG2Plot
  54. Pyg2plot: a visualization library on Python
  55. Python 上的可视化库——PyG2Plot
  56. Python实用代码-无限级分类树状结构生成算法
  57. Pyg2plot: a visualization library on Python
  58. Python utility code - infinite classification tree structure generation algorithm
  59. 奇技淫巧,还是正统功夫?Python推导式最全用法
  60. Pandas 的这个知识点,估计 80% 的人都得挂!