1 brief introduction
pandas When conducting data analysis , Try to avoid too much fragmentation Organization code , Especially creating too many unnecessary Intermediate variable , It's a waste Memory , It also brings the trouble of variable naming , It is not conducive to the readability of the whole analysis process code , Therefore, it is necessary to organize the code in a pipeline way .
And in some of the articles I've written before , I introduced to you
query() These two help us chain code , Build a practical data analysis workflow
API, Plus the following
pipe(), We can take whatever
pandas The code is perfectly organized into a pipeline .
2 stay pandas Flexible use of pipe()
pipe() seeing the name of a thing one thinks of its function , It is specially used for
DataFrame The operation of the pipeline （pipeline） Transformed API, Its function is to transform the nested function call process into The chain The process , Its first parameter
func Afferent acts on the corresponding
DataFrame Function of .
pipe() There are two ways to use it , The first way Next , The parameter in the first position of the input function must be the target
DataFrame, Other related parameters use the conventional Key value pair It can be passed in , Like the following example , We make our own function to Titanic dataset Carry out some basic engineering treatment ：
import pandas as pd train = pd.read_csv('train.csv') def do_something(data, dummy_columns): ''' Self compiled sample function ''' data = ( pd # Generate dummy variables for the specified column .get_dummies(data, # Delete first data Column specified in columns=dummy_columns, drop_first=True) ) return data # Chain assembly line ( train # take Pclass Columns are converted to character type for subsequent dummy variable processing .eval('Pclass=Pclass.astype("str")', engine='python') # Delete the specified column .drop(columns=['PassengerId', 'Name', 'Cabin', 'Ticket']) # utilize pipe Call your own function in a chained way .pipe(do_something, dummy_columns=['Pclass', 'Sex', 'Embarked']) # Delete rows with missing values .dropna() )
You can see , And then
drop() The next step is
pipe() in , We pass in the custom function as its first parameter , Thus, a series of operations are skillfully embedded in the chain process .
The second way to use it Fit the target
DataFrame Not for the first parameter of the pass in function , For example, in the following example, we assume that the target input data is the second parameter
pipe() The first parameter of should take
( Function name , ' Parameter name ') In the format of ：
def do_something(data1, data2, axis): ''' Self compiled sample function ''' data = ( pd .concat([data1, data2], axis=axis) ) return data # pipe() The second way to use it ( train .pipe((do_something, 'data2'), data1=train, axis=0) )
In such a design, we can avoid many nested function calls , Optimize our code at will ~
The above is the whole content of this paper , Welcome to discuss with me in the comments section ~