Practice of Python machine learning lightgbm

Understanding oneself 2020-11-13 10:09:10
practice python machine learning lightgbm


Related articles :
R+python︱XGBoost Extreme gradients rise and forecastxgb( forecast )+xgboost( Return to ) Double case study
python︱sklearn Some tips for recording ( Training set partition /pipelline/ Cross validation, etc )

GBDT On a vine , Evolutionary xgb as well as lgb.
Some good practice code :



0 Related theories

GDBT Model 、XGBoost and LightGBM The difference and connection between

0.1 Smaller memory

XGBoost After using the pre sort, it is necessary to record the index of the characteristic value and the statistical value of the corresponding sample , and LightGBM The histogram algorithm is used to transform the eigenvalue into bin value , And you don't need to record the characteristics to the index of the sample , Change the spatial complexity from [ The formula ] Reduced to [ The formula ] , Greatly reduces memory consumption ;
LightGBM The histogram algorithm is used to transform the stored eigenvalues into storage bin value , Reduced memory consumption ;
LightGBM In the training process, the mutual exclusion feature binding algorithm is used to reduce the number of features , Reduced memory consumption .

0.2 Faster

LightGBM The histogram algorithm is used to transform the traversal sample into the ergodic histogram , Greatly reduces the time complexity ;
LightGBM In the training process, the single side gradient algorithm is used to filter out the samples with small gradient , It reduces a lot of computation ;
LightGBM Based on Leaf-wise The growth strategy of the algorithm constructs a tree , It reduces a lot of unnecessary computation ;
LightGBM The optimized features are used in parallel 、 Data parallel methods accelerate computation , When the amount of data is very large, we can also adopt the strategy of voting parallel ;
LightGBM The cache is also optimized , Added Cache hit shooting .

Comparative advantage :

  • Faster training efficiency , Faster , yes XGBoost Fast 16 times , The memory usage is XGBoost Of 1/6
  • Low memory usage
  • Better accuracy ( I compare XGBoost No big difference )
  • Parallel learning support
  • Large scale data processing

shortcoming :
1) A deeper decision tree may grow , Produce over fitting . therefore LightGBM stay Leaf-wise A maximum depth limit has been added to it , Avoid over fitting while ensuring high efficiency
2) Deviation based algorithms , Will be more sensitive to noise
3) When looking for the optimal solution , Based on the optimal segmentation variable , The idea that the optimal solution is the synthesis of all the characteristics is not considered

    1. Tree segmentation strategies are different :XGB yes level-wise, and LGB yes leaf-wise.level-wise All leaf nodes in the current layer are treated the same way , Some leaf nodes have very small splitting gains and still need to be split , Increased computational cost .leaf-wise The accuracy of the method is higher , But it's easy to over fit , So control the maximum depth of the tree .
    1. When selecting the data segmentation point :XGB It's through pre ordering , Space consumption is high ;LGB It's through histogram algorithm , There is no need to pre sort , Lower memory usage .
    1. In parallel strategy ,XGB It mainly focuses on feature parallelism , and LGB The parallelization strategy of includes feature parallelism 、 Data parallelism and voting parallelism (Data parallel,Feature parallel, Voting parallel).

Level-wise and Leaf-wise

Level-wise:

stay XGBoost in , Trees grow in layers , be called Level-wise tree growth, On the same floor All nodes are split , Finally, prune ,

 Insert picture description here

stay Histogram Algorithm above ,LightGBM Further optimization , Adopted Leaf-wise It's a more efficient strategy , Every time from all the current leaves , Find the leaf with the most splitting gain , And then split , So circular .

 Insert picture description here

LightGBM stay Leaf-wise Added a maximum depth limit to , Avoid over fitting while ensuring high efficiency

0.3 Directly support category features ( That is, there is no need to do one-hot code )

In fact, most machine learning tools can not directly support category features , Generally, we need to put the category characteristics , Transform to multidimensional one-hot Coding features , Reduced the efficiency of space and time .
And the use of category features is very common in practice .
Based on this consideration ,LightGBM Optimized support for category features , You can directly input category features , No additional one-hot Code expansion . And the decision rules of class feature are added to the decision tree algorithm .
stay Expo Experiments on datasets , comparison 0/1 How to expand , Training speed can speed up 8 times , And the accuracy is the same .
 Insert picture description here

0.4 LightGBM Parameter tuning

LightGBM Practical summary
 Insert picture description here
 Insert picture description here
 Insert picture description here

0.5 Transfer experience

LightGBM Practical summary

 Insert picture description here
The following table corresponds to Faster Spread,better accuracy,over-fitting Three purposes , Adjustable parameters

 Insert picture description here

0.6 install

Chinese document :
https://lightgbm.apachecn.org/#/docs/3
rely on :

pip install setuptools wheel numpy scipy scikit-learn -U

To verify that the installation was successful , Can be in Python in import lightgbm try :

import lightgbm as lgb

1 The second classification parameter selection

【lightgbm, xgboost, nn Code arrangement 1 】lightgbm Do two categories , Multi classification and regression tasks ( contain python Source code )
Official parameter document

Choice of parameters :

params = {'num_leaves': 60, # The result has a great influence on the final effect , The higher the value, the better , Too often there has been a fit
'min_data_in_leaf': 30,
'objective': 'binary', # Two classification , Defined objective function
'max_depth': -1,
'learning_rate': 0.03,
"min_sum_hessian_in_leaf": 6,
"boosting": "gbdt",
"feature_fraction": 0.9, # The extracted feature ratio
"bagging_freq": 1,
"bagging_fraction": 0.8,
"bagging_seed": 11,
"lambda_l1": 0.1, #l1 Regular
# 'lambda_l2': 0.001, #l2 Regular
"verbosity": -1,
"nthread": -1, # Number of threads ,-1 Represents all threads , More threads , The faster it runs
'metric': {'binary_logloss', 'auc'}, ## Two classification , Evaluation function selection
"random_state": 2019, # Random number seed , It can prevent inconsistent results from running each time
# 'device': 'gpu' ## If the installation thing gpu Version of lightgbm, It can speed up the calculation
}
folds = KFold(n_splits=5, shuffle=True, random_state=2019)
prob_oof = np.zeros((train_x.shape[0], ))
test_pred_prob = np.zeros((test.shape[0], ))
## train and predict
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train)):
print("fold {}".format(fold_ + 1))
trn_data = lgb.Dataset(train_x.iloc[trn_idx], label=train_y[trn_idx])
val_data = lgb.Dataset(train_x.iloc[val_idx], label=train_y[val_idx])
clf = lgb.train(params,
trn_data,
num_round,
valid_sets=[trn_data, val_data],
verbose_eval=20,
early_stopping_rounds=60)
prob_oof[val_idx] = clf.predict(train_x.iloc[val_idx], num_iteration=clf.best_iteration)
fold_importance_df = pd.DataFrame()
fold_importance_df["Feature"] = features
fold_importance_df["importance"] = clf.feature_importance()
fold_importance_df["fold"] = fold_ + 1
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
test_pred_prob += clf.predict(test[features], num_iteration=clf.best_iteration) / folds.n_splits
threshold = 0.5
for pred in test_pred_prob:
result = 1 if pred > threshold else 0

The objective function is binary, The evaluation function uses {'binary_logloss', 'auc'}, The evaluation function can be adjusted as needed , You can set one or more evaluation functions ;'num_leaves' It has a big impact on the final result , If the value is set too much, over fitting will appear .
frequently-used 5 There are two kinds of discount statistics :StratifiedKFold and KFold, The biggest difference is StratifiedKFold Stratified sampling cross segmentation , Make sure the training set , The proportion of samples in the test set is the same as that in the original data set , In practical use, we can test the performance of the two according to the specific data .

2 Multi parameter selection

【lightgbm, xgboost, nn Code arrangement 1 】lightgbm Do two categories , Multi classification and regression tasks ( contain python Source code )
Official parameter document

params = {'num_leaves': 60,
'min_data_in_leaf': 30,
'objective': 'multiclass', # Multi classification needs attention
'num_class': 33, # Multi classification needs attention
'max_depth': -1,
'learning_rate': 0.03,
"min_sum_hessian_in_leaf": 6,
"boosting": "gbdt",
"feature_fraction": 0.9,
"bagging_freq": 1,
"bagging_fraction": 0.8,
"bagging_seed": 11,
"lambda_l1": 0.1,
"verbosity": -1,
"nthread": 15,
'metric': 'multi_logloss', # Multi classification needs attention
"random_state": 2019,
# 'device': 'gpu'
}
folds = KFold(n_splits=5, shuffle=True, random_state=2019)
prob_oof = np.zeros((train_x.shape[0], 33))
test_pred_prob = np.zeros((test.shape[0], 33))
## train and predict
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train)):
print("fold {}".format(fold_ + 1))
trn_data = lgb.Dataset(train_x.iloc[trn_idx], label=train_y.iloc[trn_idx])
val_data = lgb.Dataset(train_x.iloc[val_idx], label=train_y.iloc[val_idx])
clf = lgb.train(params,
trn_data,
num_round,
valid_sets=[trn_data, val_data],
verbose_eval=20,
early_stopping_rounds=60)
prob_oof[val_idx] = clf.predict(train_x.iloc[val_idx], num_iteration=clf.best_iteration)
fold_importance_df = pd.DataFrame()
fold_importance_df["Feature"] = features
fold_importance_df["importance"] = clf.feature_importance()
fold_importance_df["fold"] = fold_ + 1
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
test_pred_prob += clf.predict(test[features], num_iteration=clf.best_iteration) / folds.n_splits
result = np.argmax(test_pred_prob, axis=1)

3 Return to task parameter settings

3.1 Case a

【lightgbm, xgboost, nn Code arrangement 1 】lightgbm Do two categories , Multi classification and regression tasks ( contain python Source code )
Official parameter document

params = {'num_leaves': 38,
'min_data_in_leaf': 50,
'objective': 'regression', # Return to settings
'max_depth': -1,
'learning_rate': 0.02,
"min_sum_hessian_in_leaf": 6,
"boosting": "gbdt",
"feature_fraction": 0.9,
"bagging_freq": 1,
"bagging_fraction": 0.7,
"bagging_seed": 11,
"lambda_l1": 0.1,
"verbosity": -1,
"nthread": 4,
'metric': 'mae', # Return to settings
"random_state": 2019,
# 'device': 'gpu'
}
def mean_absolute_percentage_error(y_true, y_pred):
return np.mean(np.abs((y_true - y_pred) / (y_true))) * 100
def smape_func(preds, dtrain):
label = dtrain.get_label().values
epsilon = 0.1
summ = np.maximum(0.5 + epsilon, np.abs(label) + np.abs(preds) + epsilon)
smape = np.mean(np.abs(label - preds) / summ) * 2
return 'smape', float(smape), False
folds = KFold(n_splits=5, shuffle=True, random_state=2019)
oof = np.zeros(train_x.shape[0])
predictions = np.zeros(test.shape[0])
train_y = np.log1p(train_y) # Data smoothing
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_x)):
print("fold {}".format(fold_ + 1))
trn_data = lgb.Dataset(train_x.iloc[trn_idx], label=train_y.iloc[trn_idx])
val_data = lgb.Dataset(train_x.iloc[val_idx], label=train_y.iloc[val_idx])
clf = lgb.train(params,
trn_data,
num_round,
valid_sets=[trn_data, val_data],
verbose_eval=200,
early_stopping_rounds=200)
oof[val_idx] = clf.predict(train_x.iloc[val_idx], num_iteration=clf.best_iteration)
fold_importance_df = pd.DataFrame()
fold_importance_df["Feature"] = features
fold_importance_df["importance"] = clf.feature_importance()
fold_importance_df["fold"] = fold_ + 1
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
predictions += clf.predict(test, num_iteration=clf.best_iteration) / folds.n_splits
print('mse %.6f' % mean_squared_error(train_y, oof))
print('mae %.6f' % mean_absolute_error(train_y, oof))
result = np.expm1(predictions) #reduction
result = predictions

In the regression task, we added a log smooth , If the results to be predicted span a wide range , do log Smoothing has a good effect to improve .

3.2 Case 2

LightGBM_Regression_pm25 Case study

def Train(data, modelcount, censhu, yanzhgdata):
model = lgbm.LGBMRegressor(boosting_type='gbdt', objective='regression', num_leaves=1200,
learning_rate=0.17, n_estimators=modelcount, max_depth=censhu,
metric='rmse', bagging_fraction=0.8, feature_fraction=0.8, reg_lambda=0.9)
model.fit(data[:, :-1], data[:, -1])
# Give the predicted value of training data
train_out = model.predict(data[:, :-1])
# Calculation MSE
train_mse = mse(data[:, -1], train_out)
# The predicted values of the validation data are given
add_yan = model.predict(yanzhgdata[:, :-1])
# Calculation MSE
add_mse = mse(yanzhgdata[:, -1], add_yan)
print(train_mse, add_mse)
return train_mse, add_mse

4 Other related

4.1 Spark - LightGBM

Reference resources : actual combat !LightGBM Algorithm principle 、 Training and Forecasting

Native Spark Version of LightGBM The algorithm is integrated into Microsoft's open source project MMLSPARK(Microsoft Machine Learning for Apache Spark), The project is Microsoft in the cognitive toolkit (Microsoft Cognitive Toolkit, Name used before CNTK) Based on the development of Apache Spark Implementation of big data framework , because mmlspark It integrates a large number of machine learning and deep learning algorithms , Resulting in dependence on the project maven after , The project is playing jar It's huge (400M+), therefore , Need to be right mmlspark The project will be castrated , Only keep LightGBM Algorithm ( classification , The regression supports ) Recompile .

The author is developing the prediction code , A lot of holes , A handful of bitter tears . Try different ways to score the forecast , This includes PMML Solution 、MMLSPARK Native prediction solutions and Java Reconfigurable predictive solutions . Finally, I chose java Reconfigurable predictive solutions , The reasons for abandoning the first two solutions are as follows :

1、PMML There will be a certain scoring error in the solution , And the scoring time is not enough to satisfy the current business

2、MMLSPARK The code in the native prediction solution relies on the underlying C++ Dynamic link library , And prediction code has a certain optimization space , Scoring takes a lot of time ( Each scoring needs to be reinitialized C++ Some data objects that depend on )


4.2 LightGBM It's used a lot in the game , Why there are so few ?

https://www.zhihu.com/question/344433472/answer/959927756

LightGBM It's used in all big factories , What you're talking about very little is probably the online model ? As far as I know, only meituan and Ali have some online models that use the improved version Lightgbm Sorting , Combined with the pair-wise Loss . But the most commonly used is the offline model effect , Because of the original lightgbm Although using cache acceleration and histogram error , There's no need to pre sort storage , But it doesn't support extensions .

It means , In very large datasets lightgbm It's very unwise , No company will use it directly .
More to quickly validate data 、 Whether the idea is correct and feasible , It's that many teams start with small data LightGBM Run it over , The depth model and algorithm are improved after the effect .
And finally ,lightGBM Although it directly supports categorical variables , Can also output sub barrel , But feature engineering is still very important , It also takes some time to adjust the parameters . It's not an innovative application , Of course, no company has deliberately promoted .

author : Turing's cat
link :https://www.zhihu.com/question/344433472/answer/959927756
source : You know
The copyright belongs to the author . Commercial reprint please contact the author for authorization , Non-commercial reprint please indicate the source .


5 Sorting algorithm &LightGBM

5.1 Case a

 Insert picture description here
Reference resources :https://lightgbm.apachecn.org/#/docs/5

Below are two rows from MSLR-WEB10K dataset:

0 qid:1 1:3 2:0 3:2 4:2 … 135:0 136:0
2 qid:1 1:3 2:3 3:0 4:0 … 135:0 136:0

5.2 Case 2

lightgbm Used to sort
jiangnanboy/learning_to_rank

1.raw_train.txt

0 qid:10002 1:0.007477 2:0.000000 ... 45:0.000000 46:0.007042 #docid = GX008-86-4444840 inc = 1 prob = 0.086622
0 qid:10002 1:0.603738 2:0.000000 ... 45:0.333333 46:1.000000 #docid = GX037-06-11625428 inc = 0.0031586555555558 prob = 0.0897452 ...

Parameters of the model :

train params = {
'task': 'train', # The type of task performed
'boosting_type': 'gbrt', # Basic learner
'objective': 'lambdarank', # Sorting task ( Objective function )
'metric': 'ndcg', # Indicators of measurement ( Evaluation function )
'max_position': 10, # @NDCG Location optimization
'metric_freq': 1, # How many times to output the measurement results
'train_metric': True, # When training, output measurement results
'ndcg_at': [10],
'max_bin': 255, # An integer , The maximum number of barrels . The default value is 255.lightgbm It will automatically compress the memory according to it . Such as max_bin=255 when , be lightgbm Will use uint8 To represent each value of a feature .
'num_iterations': 200, # The number of iterations , That is, the number of trees generated
'learning_rate': 0.01, # Learning rate
'num_leaves': 31, # The number of leaves
'max_depth':6,
'tree_learner': 'serial', # For parallel learning ,‘serial’: Single machine tree learner
'min_data_in_leaf': 30, # The minimum number of samples contained in a leaf node
'verbose': 2 # Show training information
}

6 debug

6.1 non-ASCII characters Version of the problem


Report errors :
LightGBMError: Do not support non-ASCII characters in feature name
Report errors 2:
ValueError: DataFrame.dtypes for data must be int, float or bool.
Did not expect the data types in fields xxxx

Report errors , I see... In the back light The version of should be rolled back to :2.2.3

版权声明
本文为[Understanding oneself]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database