author |Louis Chan compile |VK source |Towards Data Science
Python It's the coolest programming language today ( Thanks to machine learning and Data Science ), But with one of the best programming languages C comparison , It's not very efficient .
When developing machine learning models , It's very common that , We need hard coding rules derived from statistical analysis or the results of the last iteration , And then update it programmatically . There's no shame in admitting that : I've been using Pandas apply Write code , Until one day I was very tired of nesting , So I decided to study ( also called Google) Others are more maintainable 、 A more efficient way
Demo dataset
The dataset we're going to use is iris Data sets , You can go through pandas or seaborn Get it for free .
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
# import seaborn as sns
# iris = sns.load_dataset("iris")
iris Before the dataset 5 That's ok
Data statistics
Suppose after the initial analysis , We want to label the dataset with the following logic :
-
If the sepal length (sepal length)< 5.1, Then the label is 0;
-
otherwise , If the sepal width (sepal width)> 3.3 And sepal length < 5.8, Then the label is 1;
-
otherwise , If the sepal width > 3.3, Petal length (petal length)> 5.1, Then the label is 2;
-
otherwise , If the sepal width > 3.3, Petal length < 1.6 And sepal length < 6.4 Or petal width < 1.3, Then label 3;
-
otherwise , If the sepal width >3.3 And sepal length < 6.4 Or petal width < 1.3, Then the label is 4;
-
otherwise , If the sepal width > 3.3, Then the label is 5;
-
Otherwise the label 6
Before delving into the code , Let's quickly put a new label The column is set to None:
iris['label'] = None
Pandas.iterrows+ nesting If Else block
If you're still using this , This post is definitely the right place for you !
%%timeit
for idx, row in iris.iterrows():
if row['sepal_length'] < 5.1:
iris.loc[idx, 'label'] = 0
elif row['sepal_width'] > 3.3:
if row['sepal_length'] < 5.8:
iris.loc[idx, 'label'] = 1
elif row['petal_length'] > 5.1:
iris.loc[idx, 'label'] = 2
elif (row['sepal_length'] < 6.4) or (row['petal_width'] < 1.3):
if row['petal_length'] < 1.6:
iris.loc[idx, 'label'] = 3
else:
iris.loc[idx, 'label'] = 4
else:
iris.loc[idx, 'label'] = 5
else:
iris.loc[idx, 'label'] = 6
1min 29s ± 8.91 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
It's been a long time … ok , We continue …
Pandas .apply
Pandas.apply Directly along the axis of the data frame or Series To apply functions . for example , If we have a function f, It can be the sum of a sequence ( for example , It could be a list
, np.array
, tuple
etc. ), And pass it to the following data frame , We're going to sum across rows :
def f(numbers):
return sum(numbers)
df['Row Subtotal'] = df.apply(f, axis=1)
stay axis=1 Application function on . By default ,apply Parameters axis=0, That is, the function is applied line by line ; and axis=1 The function will be applied column by column .
Now we are right pandas.apply With a basic understanding of , Now let's write the logic code for assigning tags , See how long it runs :
%%timeit
def rules(row):
if row['sepal_length'] < 5.1:
return 0
elif row['sepal_width'] > 3.3:
if row['sepal_length'] < 5.8:
return 1
elif row['petal_length'] > 5.1:
return 2
elif (row['sepal_length'] < 6.4) or (row['petal_width'] < 1.3):
if row['petal_length'] < 1.6:
return 3
return 4
return 5
return 6
iris['label'] = iris.apply(rules, 1)
1.43 s ± 115 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
15 Wanxing just needs 1.43s It's a big improvement over the previous level , But it's still very slow .
Imagine , If you need to process a data set of millions of transactions or credit approvals , So every time we apply a set of rules and apply functions to a column , It will occupy 14 More than seconds . Run enough Columns , You may be gone in an afternoon .
Pandas.loc[]
If you are familiar with SQL, So use .loc[] Assigning a new column is actually just a function with WHERE Conditions of the UPDATE sentence . therefore , This should be much better than applying a function to each row or column .
%%timeit
iris['label'] = 6
iris.loc[iris['sepal_width'] > 3.3, 'label'] = 5
iris.loc[
(iris['sepal_width'] > 3.3) &
((iris['sepal_length'] < 6.4) | (iris['petal_width'] < 1.3)),
'label'] = 4
iris.loc[
(iris['sepal_width'] > 3.3) &
((iris['sepal_length'] < 6.4) | (iris['petal_width'] < 1.3)) &
(iris['petal_length'] < 1.6),
'label'] = 3
iris.loc[
(iris['sepal_width'] > 3.3) &
(iris['petal_length'] > 5.1),
'label'] = 2
iris.loc[
(iris['sepal_width'] > 3.3) &
(iris['sepal_length'] < 5.8),
'label'] = 1
iris.loc[
(iris['sepal_length'] < 5.1),
'label'] = 0
13.3 ms ± 837 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Now we've only spent a tenth of the time we did last time , This means that when you work from home , You don't have any more excuses to leave your desk . however , We only use pandas Built in functions . Even though pandas It provides us with a very convenient high-level interface to interact with the data table , But through layers of abstraction , Efficiency may be reduced .
Numpy.where
Numpy There is a lower level interface , Allow and n dimension iterables( The vector 、 matrix 、 Tensors, etc ) Interact more effectively . Its approach is usually based on C Linguistic , When it comes to more complex calculations , It uses an optimized algorithm , Make it faster than our reinvented wheel .
according to numpy Official documents of ,np.where()
Accept the following grammar :
np.where(condition, return value if True, return value if False)
Essentially , It's a dichotomy , Where the condition is evaluated as a Boolean value and returns the value accordingly . The trick here is that the condition can actually be iterable( That is Boolean ndarray type ). This means that we can put df['feature']==1 As a condition , And will where The logical code is :
np.where(
df['feature'] == 1,
'It is one',
'It is not one'
)
So you might ask , How do we use an image like np.where() Such a binary function to achieve the above logic ? The answer is simple , But it's disturbing . nesting np.where()
%%timeit
iris['label'] = np.where(
iris['sepal_length'] < 5.1,
0,
np.where(
iris['sepal_width'] > 3.3,
np.where(
iris['sepal_length'] < 5.8,
1,
np.where(
iris['petal_length'] > 5.1,
2,
np.where(
(iris['sepal_length'] < 6.4) | (iris['petal_width'] < 1.3),
np.where(
iris['petal_length'] < 1.6,
3,
4
),
5
)
)
),
6
)
)
3.6 ms ± 149 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
congratulations , You made it . I can't tell you how many times I spent calculating the right bracket , But hey , That's done ! We started from pandas Cut off the body 10 millisecond .loc[]. However , This code fragment is not maintainable , It means , It's not acceptable .
Numpy.select
Numpy.select, It is associated with .where Different , It is a function used to implement multithreading logic .
np.select(condlist, choicelist, default=0)
Its grammar is similar to np.where, But the first parameter is now a list of conditions , It should be the same length as the option . One thing to remember when using np.select Is to select an option immediately after the first condition is met .
It means , If the superset rule appears before the subset rule in the list , Then subset selection will never be selected . say concretely :
condlist = [
df['A'] <= 1,
df['A'] < 1
]
choicelist = ['<=1', '<1']
selection = np.select(condlist, choicelist, default='>1')
Because all hits df['A']<1 The line will also be df['A']<=1 Capture , Therefore, no rows are ultimately marked as '<1'. To avoid that , Be sure to make a less specific rule before a more specific one :
condlist = [
df['A'] < 1, # < ───┬ In exchange for
df['A'] <= 1 # < ───┘
]
choicelist = ['<1', '<=1'] # Remember to update this as well !
selection = np.select(condlist, choicelist, default='>1')
You can see from above , You need to update it at the same time condlist and choicelsit, To ensure that the code runs smoothly . But seriously , This step also takes our own time . By changing it to a dictionary , We're going to achieve roughly the same time and memory complexity , But use code snippets that are easier to maintain :
%%timeit
rules = {
0: (iris['sepal_length'] < 5.1),
1: (iris['sepal_width'] > 3.3) & (iris['sepal_length'] < 5.8),
2: (iris['sepal_width'] > 3.3) & (iris['petal_length'] > 5.1),
3: (
(iris['sepal_width'] > 3.3) & \
((iris['sepal_length'] < 6.4) | (iris['petal_width'] < 1.3)) & \
(iris['petal_length'] < 1.6)
),
4: (
(iris['sepal_width'] > 3.3) & \
((iris['sepal_length'] < 6.4) | (iris['petal_width'] < 1.3))
),
5: (iris['sepal_width'] > 3.3),
}
iris['label'] = np.select(rules.values(), rules.keys(), default=6)
6.29 ms ± 475 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It's about np.where Half of , But it doesn't just save you from debugging all kinds of nesting , And make choicelist There is a change . I've forgotten to update choicelist Too many times , It took me more than four times as long to debug my machine learning model . believe me ,np.select and dict. It's a very good choice
Excellent functions
-
Numpy The vectorization operation of : If your code involves loops and computation of unary functions 、 A binary function or function that operates on a sequence of numbers . You should transform data into numpy-ndarray To refactor the code , And make the most of it numpy To greatly improve the speed of the script . stay Numpy View the unary function in the official document of 、 Examples of binary functions or functions that operate on a sequence of numbers :https://www.pythonlikeyoumeanit.com/Module3_IntroducingNumpy/VectorizedOperations.html#NumPy%E2%80%99s-Mathematical-Functions
-
np.vectorize: Don't be fooled by the name of this function . It's just a convenient function , It doesn't make the code run faster . To use this function , First you need to code the logic as a callable function , And then run np.vectorize( Your function )( Your data series ). Another big drawback is the need to convert data frames into one-dimensional iterable, In order to pass on to “ Vectorization ” Function . Conclusion : If it's not convenient to use np.vectorize, Do not use .
-
numba.njit: Now this is the real vectorization . It tries to put any numpy Value to move as close as possible to C Language , To improve its efficiency . Although it can speed up the numerical calculation , But it also limits itself to numerical computation , This means that there is no pandas series , No string index , Only have int、float、datetime、bool and category Type of numpy Of ndarray. Conclusion : If you can easily use Numpy Of ndarray And convert logic to numerical calculation or only to numerical calculation , So it would be a very good choice . Learn more from here :https://numba.pydata.org/numba-doc/dev/user/5minguide.html
ending
If possible , To fight for numba.njit; otherwise , Use np.select and dict Can help you to sail long distances . remember , Every improvement will help !
Link to the original text :https://towardsdatascience.com/efficient-implementation-of-conditional-logic-on-pandas-dataframes-4afa61eb7fce
Welcome to join us AI Blog station : http://panchuang.net/
sklearn Machine learning Chinese official documents : http://sklearn123.com/
Welcome to pay attention to pan Chuang blog resource summary station : http://docs.panchuang.net/