Related articles :
R+python︱XGBoost Extreme gradients rise and forecastxgb( forecast )+xgboost( Return to ) Double case study
python︱sklearn Some tips for recording ( Training set partition /pipelline/ Cross validation, etc )
GBDT On a vine , Evolutionary xgb as well as lgb.
Some good practice code :
GDBT Model 、XGBoost and LightGBM The difference and connection between
XGBoost After using the pre sort, it is necessary to record the index of the characteristic value and the statistical value of the corresponding sample , and LightGBM The histogram algorithm is used to transform the eigenvalue into bin value , And you don't need to record the characteristics to the index of the sample , Change the spatial complexity from [ The formula ] Reduced to [ The formula ] , Greatly reduces memory consumption ;
LightGBM The histogram algorithm is used to transform the stored eigenvalues into storage bin value , Reduced memory consumption ;
LightGBM In the training process, the mutual exclusion feature binding algorithm is used to reduce the number of features , Reduced memory consumption .
LightGBM The histogram algorithm is used to transform the traversal sample into the ergodic histogram , Greatly reduces the time complexity ;
LightGBM In the training process, the single side gradient algorithm is used to filter out the samples with small gradient , It reduces a lot of computation ;
LightGBM Based on Leaf-wise The growth strategy of the algorithm constructs a tree , It reduces a lot of unnecessary computation ;
LightGBM The optimized features are used in parallel 、 Data parallel methods accelerate computation , When the amount of data is very large, we can also adopt the strategy of voting parallel ;
LightGBM The cache is also optimized , Added Cache hit shooting .
Comparative advantage :
shortcoming :
1) A deeper decision tree may grow , Produce over fitting . therefore LightGBM stay Leaf-wise A maximum depth limit has been added to it , Avoid over fitting while ensuring high efficiency
2) Deviation based algorithms , Will be more sensitive to noise
3) When looking for the optimal solution , Based on the optimal segmentation variable , The idea that the optimal solution is the synthesis of all the characteristics is not considered
Level-wise and Leaf-wise
Level-wise:
stay XGBoost in , Trees grow in layers , be called Level-wise tree growth, On the same floor All nodes are split , Finally, prune ,
stay Histogram Algorithm above ,LightGBM Further optimization , Adopted Leaf-wise It's a more efficient strategy , Every time from all the current leaves , Find the leaf with the most splitting gain , And then split , So circular .
LightGBM stay Leaf-wise Added a maximum depth limit to , Avoid over fitting while ensuring high efficiency
In fact, most machine learning tools can not directly support category features , Generally, we need to put the category characteristics , Transform to multidimensional one-hot Coding features , Reduced the efficiency of space and time .
And the use of category features is very common in practice .
Based on this consideration ,LightGBM Optimized support for category features , You can directly input category features , No additional one-hot Code expansion . And the decision rules of class feature are added to the decision tree algorithm .
stay Expo Experiments on datasets , comparison 0/1 How to expand , Training speed can speed up 8 times , And the accuracy is the same .
The following table corresponds to Faster Spread,better accuracy,over-fitting Three purposes , Adjustable parameters
Chinese document :
https://lightgbm.apachecn.org/#/docs/3
rely on :
pip install setuptools wheel numpy scipy scikit-learn -U
To verify that the installation was successful , Can be in Python in import lightgbm try :
import lightgbm as lgb
【lightgbm, xgboost, nn Code arrangement 1 】lightgbm Do two categories , Multi classification and regression tasks ( contain python Source code )
Official parameter document
Choice of parameters :
params = {'num_leaves': 60, # The result has a great influence on the final effect , The higher the value, the better , Too often there has been a fit
'min_data_in_leaf': 30,
'objective': 'binary', # Two classification , Defined objective function
'max_depth': -1,
'learning_rate': 0.03,
"min_sum_hessian_in_leaf": 6,
"boosting": "gbdt",
"feature_fraction": 0.9, # The extracted feature ratio
"bagging_freq": 1,
"bagging_fraction": 0.8,
"bagging_seed": 11,
"lambda_l1": 0.1, #l1 Regular
# 'lambda_l2': 0.001, #l2 Regular
"verbosity": -1,
"nthread": -1, # Number of threads ,-1 Represents all threads , More threads , The faster it runs
'metric': {'binary_logloss', 'auc'}, ## Two classification , Evaluation function selection
"random_state": 2019, # Random number seed , It can prevent inconsistent results from running each time
# 'device': 'gpu' ## If the installation thing gpu Version of lightgbm, It can speed up the calculation
}
folds = KFold(n_splits=5, shuffle=True, random_state=2019)
prob_oof = np.zeros((train_x.shape[0], ))
test_pred_prob = np.zeros((test.shape[0], ))
## train and predict
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train)):
print("fold {}".format(fold_ + 1))
trn_data = lgb.Dataset(train_x.iloc[trn_idx], label=train_y[trn_idx])
val_data = lgb.Dataset(train_x.iloc[val_idx], label=train_y[val_idx])
clf = lgb.train(params,
trn_data,
num_round,
valid_sets=[trn_data, val_data],
verbose_eval=20,
early_stopping_rounds=60)
prob_oof[val_idx] = clf.predict(train_x.iloc[val_idx], num_iteration=clf.best_iteration)
fold_importance_df = pd.DataFrame()
fold_importance_df["Feature"] = features
fold_importance_df["importance"] = clf.feature_importance()
fold_importance_df["fold"] = fold_ + 1
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
test_pred_prob += clf.predict(test[features], num_iteration=clf.best_iteration) / folds.n_splits
threshold = 0.5
for pred in test_pred_prob:
result = 1 if pred > threshold else 0
The objective function is binary
, The evaluation function uses {'binary_logloss', 'auc'}
, The evaluation function can be adjusted as needed , You can set one or more evaluation functions ;'num_leaves'
It has a big impact on the final result , If the value is set too much, over fitting will appear .
frequently-used 5 There are two kinds of discount statistics :StratifiedKFold and KFold, The biggest difference is StratifiedKFold Stratified sampling cross segmentation , Make sure the training set , The proportion of samples in the test set is the same as that in the original data set , In practical use, we can test the performance of the two according to the specific data .
【lightgbm, xgboost, nn Code arrangement 1 】lightgbm Do two categories , Multi classification and regression tasks ( contain python Source code )
Official parameter document
params = {'num_leaves': 60,
'min_data_in_leaf': 30,
'objective': 'multiclass', # Multi classification needs attention
'num_class': 33, # Multi classification needs attention
'max_depth': -1,
'learning_rate': 0.03,
"min_sum_hessian_in_leaf": 6,
"boosting": "gbdt",
"feature_fraction": 0.9,
"bagging_freq": 1,
"bagging_fraction": 0.8,
"bagging_seed": 11,
"lambda_l1": 0.1,
"verbosity": -1,
"nthread": 15,
'metric': 'multi_logloss', # Multi classification needs attention
"random_state": 2019,
# 'device': 'gpu'
}
folds = KFold(n_splits=5, shuffle=True, random_state=2019)
prob_oof = np.zeros((train_x.shape[0], 33))
test_pred_prob = np.zeros((test.shape[0], 33))
## train and predict
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train)):
print("fold {}".format(fold_ + 1))
trn_data = lgb.Dataset(train_x.iloc[trn_idx], label=train_y.iloc[trn_idx])
val_data = lgb.Dataset(train_x.iloc[val_idx], label=train_y.iloc[val_idx])
clf = lgb.train(params,
trn_data,
num_round,
valid_sets=[trn_data, val_data],
verbose_eval=20,
early_stopping_rounds=60)
prob_oof[val_idx] = clf.predict(train_x.iloc[val_idx], num_iteration=clf.best_iteration)
fold_importance_df = pd.DataFrame()
fold_importance_df["Feature"] = features
fold_importance_df["importance"] = clf.feature_importance()
fold_importance_df["fold"] = fold_ + 1
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
test_pred_prob += clf.predict(test[features], num_iteration=clf.best_iteration) / folds.n_splits
result = np.argmax(test_pred_prob, axis=1)
【lightgbm, xgboost, nn Code arrangement 1 】lightgbm Do two categories , Multi classification and regression tasks ( contain python Source code )
Official parameter document
params = {'num_leaves': 38,
'min_data_in_leaf': 50,
'objective': 'regression', # Return to settings
'max_depth': -1,
'learning_rate': 0.02,
"min_sum_hessian_in_leaf": 6,
"boosting": "gbdt",
"feature_fraction": 0.9,
"bagging_freq": 1,
"bagging_fraction": 0.7,
"bagging_seed": 11,
"lambda_l1": 0.1,
"verbosity": -1,
"nthread": 4,
'metric': 'mae', # Return to settings
"random_state": 2019,
# 'device': 'gpu'
}
def mean_absolute_percentage_error(y_true, y_pred):
return np.mean(np.abs((y_true - y_pred) / (y_true))) * 100
def smape_func(preds, dtrain):
label = dtrain.get_label().values
epsilon = 0.1
summ = np.maximum(0.5 + epsilon, np.abs(label) + np.abs(preds) + epsilon)
smape = np.mean(np.abs(label - preds) / summ) * 2
return 'smape', float(smape), False
folds = KFold(n_splits=5, shuffle=True, random_state=2019)
oof = np.zeros(train_x.shape[0])
predictions = np.zeros(test.shape[0])
train_y = np.log1p(train_y) # Data smoothing
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_x)):
print("fold {}".format(fold_ + 1))
trn_data = lgb.Dataset(train_x.iloc[trn_idx], label=train_y.iloc[trn_idx])
val_data = lgb.Dataset(train_x.iloc[val_idx], label=train_y.iloc[val_idx])
clf = lgb.train(params,
trn_data,
num_round,
valid_sets=[trn_data, val_data],
verbose_eval=200,
early_stopping_rounds=200)
oof[val_idx] = clf.predict(train_x.iloc[val_idx], num_iteration=clf.best_iteration)
fold_importance_df = pd.DataFrame()
fold_importance_df["Feature"] = features
fold_importance_df["importance"] = clf.feature_importance()
fold_importance_df["fold"] = fold_ + 1
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
predictions += clf.predict(test, num_iteration=clf.best_iteration) / folds.n_splits
print('mse %.6f' % mean_squared_error(train_y, oof))
print('mae %.6f' % mean_absolute_error(train_y, oof))
result = np.expm1(predictions) #reduction
result = predictions
In the regression task, we added a log smooth , If the results to be predicted span a wide range , do log Smoothing has a good effect to improve .
LightGBM_Regression_pm25 Case study
def Train(data, modelcount, censhu, yanzhgdata):
model = lgbm.LGBMRegressor(boosting_type='gbdt', objective='regression', num_leaves=1200,
learning_rate=0.17, n_estimators=modelcount, max_depth=censhu,
metric='rmse', bagging_fraction=0.8, feature_fraction=0.8, reg_lambda=0.9)
model.fit(data[:, :-1], data[:, -1])
# Give the predicted value of training data
train_out = model.predict(data[:, :-1])
# Calculation MSE
train_mse = mse(data[:, -1], train_out)
# The predicted values of the validation data are given
add_yan = model.predict(yanzhgdata[:, :-1])
# Calculation MSE
add_mse = mse(yanzhgdata[:, -1], add_yan)
print(train_mse, add_mse)
return train_mse, add_mse
Reference resources : actual combat !LightGBM Algorithm principle 、 Training and Forecasting
Native Spark Version of LightGBM The algorithm is integrated into Microsoft's open source project MMLSPARK(Microsoft Machine Learning for Apache Spark), The project is Microsoft in the cognitive toolkit (Microsoft Cognitive Toolkit, Name used before CNTK) Based on the development of Apache Spark Implementation of big data framework , because mmlspark It integrates a large number of machine learning and deep learning algorithms , Resulting in dependence on the project maven after , The project is playing jar It's huge (400M+), therefore , Need to be right mmlspark The project will be castrated , Only keep LightGBM Algorithm ( classification , The regression supports ) Recompile .
The author is developing the prediction code , A lot of holes , A handful of bitter tears . Try different ways to score the forecast , This includes PMML Solution 、MMLSPARK Native prediction solutions and Java Reconfigurable predictive solutions . Finally, I chose java Reconfigurable predictive solutions , The reasons for abandoning the first two solutions are as follows :
1、PMML There will be a certain scoring error in the solution , And the scoring time is not enough to satisfy the current business
2、MMLSPARK The code in the native prediction solution relies on the underlying C++ Dynamic link library , And prediction code has a certain optimization space , Scoring takes a lot of time ( Each scoring needs to be reinitialized C++ Some data objects that depend on )
https://www.zhihu.com/question/344433472/answer/959927756
LightGBM It's used in all big factories , What you're talking about very little is probably the online model ? As far as I know, only meituan and Ali have some online models that use the improved version Lightgbm Sorting , Combined with the pair-wise Loss . But the most commonly used is the offline model effect , Because of the original lightgbm Although using cache acceleration and histogram error , There's no need to pre sort storage , But it doesn't support extensions .
It means , In very large datasets lightgbm It's very unwise , No company will use it directly .
More to quickly validate data 、 Whether the idea is correct and feasible , It's that many teams start with small data LightGBM Run it over , The depth model and algorithm are improved after the effect .
And finally ,lightGBM Although it directly supports categorical variables , Can also output sub barrel , But feature engineering is still very important , It also takes some time to adjust the parameters . It's not an innovative application , Of course, no company has deliberately promoted .
author : Turing's cat
link :https://www.zhihu.com/question/344433472/answer/959927756
source : You know
The copyright belongs to the author . Commercial reprint please contact the author for authorization , Non-commercial reprint please indicate the source .
Reference resources :https://lightgbm.apachecn.org/#/docs/5
Below are two rows from MSLR-WEB10K dataset:
0 qid:1 1:3 2:0 3:2 4:2 … 135:0 136:0
2 qid:1 1:3 2:3 3:0 4:0 … 135:0 136:0
lightgbm Used to sort
jiangnanboy/learning_to_rank
1.raw_train.txt
0 qid:10002 1:0.007477 2:0.000000 ... 45:0.000000 46:0.007042 #docid = GX008-86-4444840 inc = 1 prob = 0.086622
0 qid:10002 1:0.603738 2:0.000000 ... 45:0.333333 46:1.000000 #docid = GX037-06-11625428 inc = 0.0031586555555558 prob = 0.0897452 ...
Parameters of the model :
train params = {
'task': 'train', # The type of task performed
'boosting_type': 'gbrt', # Basic learner
'objective': 'lambdarank', # Sorting task ( Objective function )
'metric': 'ndcg', # Indicators of measurement ( Evaluation function )
'max_position': 10, # @NDCG Location optimization
'metric_freq': 1, # How many times to output the measurement results
'train_metric': True, # When training, output measurement results
'ndcg_at': [10],
'max_bin': 255, # An integer , The maximum number of barrels . The default value is 255.lightgbm It will automatically compress the memory according to it . Such as max_bin=255 when , be lightgbm Will use uint8 To represent each value of a feature .
'num_iterations': 200, # The number of iterations , That is, the number of trees generated
'learning_rate': 0.01, # Learning rate
'num_leaves': 31, # The number of leaves
'max_depth':6,
'tree_learner': 'serial', # For parallel learning ,‘serial’: Single machine tree learner
'min_data_in_leaf': 30, # The minimum number of samples contained in a leaf node
'verbose': 2 # Show training information
}
Report errors :
LightGBMError: Do not support non-ASCII characters in feature name
Report errors 2:
ValueError: DataFrame.dtypes for data must be int, float or bool.
Did not expect the data types in fields xxxx
Report errors , I see... In the back light The version of should be rolled back to :2.2.3