To study the Pandas Classmate , There are more than 60% Still in the direction of Excel The arms of , The reason for this is that , It's mostly because I just started using Python While processing data ,
It's too painful to choose the row and column you want , No at all Excel Where do you want the pleasure .
First time to know Pandas Considering the length of the course, only the most basic Column index , But this obviously can't satisfy the growing personalized service of comrades （ selection ） demand . To ease the pain , Increase pleasure , To meet the requirements , In the second part, we separate
Indexes Take it out , This paper introduces two common indexing methods in detail ：
The first is based on location （ Integers ） The index of , The case is short and straightforward , A rough idea is enough , In practice, I can use , But it's not as widely used as the second .
The second is based on the name （ label ） The index of , The point is to practice on the blackboard , Because it will be an important cornerstone for data cleaning and analysis in the future .
First , Briefly introduce the case data of the exercise ：
|Source of flow||Source details||Number of visitors||Payment conversion rate||Customer unit price|
Just like the first dataset , Record different traffic sources , The number of visitors corresponding to the source details of each channel 、 Pay conversion rate and customer price . Although the data set is short （ Complex case data sets will arrive at the end of the basic article ）, But it's representative enough , Let's start our index show .
Let's take a look at how the index works ：
df.iloc[ Row index , Column index ]
The first position is the row index , Enter the parameters of which lines we want to take
The second position is the column index , Enter the position parameters of which columns we want to take
We need to be based on the actual situation , Fill in the corresponding row and column parameters .
The goal is ： choice
Source of flow be equal to
Class A All of the line .
Ideas ： Finger at the screen and count , Primary channel , It's from No 1 Go to the first place 13 That's ok , The corresponding row index is 0-12, but Python By default, slicing contains the beginning and not the end , To choose 0-12 Index lines of , We have to type in
0:13, Lie wants to Choose all , Then enter a colon
： that will do .
The goal is ： We want to take a look at the flow source and customer price list of all channels .
Ideas ： All traffic channels , That is, all the lines , In the position of the first line parameter, we enter
：; Look at the column again , The source of traffic is 1 Column , The unit price per customer is No 5 Column , The corresponding column indexes are 0 and 4：
It is worth noting that , If we want to Cross column selection , You have to construct the positional parameters into a list , Here is the [0,4], If it is
Continuous selection , There is no need to construct a list , Direct input
0:5（ Select index as 0 The column to index is 4 The column of ） Just fine .
The goal is ： We want to take a look at level two 、 Third level traffic source 、 Source details correspond to visitors and payment conversion rates
Ideas ： Look first , The corresponding row index of secondary and tertiary channels is 13:17, Again
The index has a beginning but not a tail Principles , The row parameter we passed in is 13:18; We need the source of traffic 、 Source details 、 Visitors and transformations , Is the former 4 Column , Pass in the parameter 0:4.
In order to create a sense of lateral contrast , We still use the above three scenes .
Ideas ： This time we don't have to count the positions one by one , To screen traffic channels for
Class A All of the line , Just make a judgment , Determine the source of traffic column , Which values are equal to
Class A .
The result returned by True and False（ Boolean type ） constitute , In this example, the results are equal to level 1 and level 1 respectively . stay loc In the method , We can pass the values from this column to the row parameter position ,Pandas The default return result is True The line of （ Here is the index from 0 To 12 The line of ）, And the result is False The line of , Direct example ：
Ideas ： All channels are equal to all lines , We input parameters directly in the line
:, To extract traffic source and customer price column , Enter the name directly into the column parameter position , Because there are two columns involved , So you have to wrap it up in a list ：
Ideas ： Line extraction with judgment , Column extraction input specific name parameter .
df2.loc[df2[' Source of flow '].isin([' second level ',' Level three ']),[' Source of flow ',' Source details ',' Number of visitors ',' Payment conversion rate ']]
Here's a piece of
isin Advertising of functions , This function can help us quickly determine a column in the source data （Series） Whether the value of is equal to the value in the list . Take the case ,df[‘ Source of flow ’].isin([‘ second level ’,‘ Level three ’]), What is judged is the value of the column of traffic source , Is it equal to “ second level ” perhaps “ Level three ”, If it is equal to （ Equal to any one of them ） Just go back to True, Otherwise return to False. Let's pass the boolean result to the row parameter , It's easy to get a channel with a flow source equal to two or three levels .
since loc More widely used scenarios , He should be given a drumstick , Let's have a grounded scene to practice .
Before inserting the scene , Let's spend first 30 Seconds time to stroke Pandas Middle column （Series） The use of evaluation to , The specific operation is as follows ：
df2[' Number of visitors '].mean() df2[' Number of visitors '].std() df2[' Number of visitors '].median() df2[' Number of visitors '].max() df2[' Number of visitors '].min()
Just add a tail , mean value 、 The standard deviation and other statistics will come out , After learning about this , Now we enter scene four .
Scene 4 ： For traffic channel data , What we should really focus on is High quality canal Avenue , If we define the number of visitors here 、 Conversion rate 、 The customer unit price is higher than the average, and the channel is a high-quality channel , How to find these channels ？
Ideas ： Quality channel , We have to satisfy the visitors at the same time 、 conversion 、 Customer order is higher than average , This is the key to solving the problem . Let's start by looking at the average ：
Then judge whether each index column is greater than the mean value ：
df2[' Number of visitors '] > df2[' Number of visitors '].mean() df2[' Payment conversion rate ']> df2[' Payment conversion rate '].mean() df2[' Customer unit price '] > df2[' Customer unit price '].mean()
Three conditions must be satisfied at the same time , Between them is a “ And ” The relationship between （ At the same time satisfy ）, stay pandas in , It means to be satisfied at the same time , Between the conditions, use
& Symbolic connection , It's better to use parentheses to distinguish between conditions ; If it is
or The relationship between （ Meet one ）, Then use
| Symbolic connection ：
(df2[' Number of visitors '] > df2[' Number of visitors '].mean())&(df2[' Payment conversion rate ']> df2[' Payment conversion rate '].mean())&(df2[' Customer unit price '] > df2[' Customer unit price '].mean())
After this connection , return True It means that the channel satisfies visitors at the same time 、 Conversion rate 、 The condition that the unit price per customer is higher than the average value , Next, we just need to pass these values to the position of the row parameter .
df2.loc[(df2[' Number of visitors '] > df2[' Number of visitors '].mean())&(df2[' Payment conversion rate ']> df2[' Payment conversion rate '].mean())&(df2[' Customer unit price '] > df2[' Customer unit price '].mean()),:]
To this step , We directly screened out 4 High quality channels where all the key indicators are higher than the average .
It's using pandas.ix[ That's ok , Column ], But the new version pandas It is no longer recommended to use the modified method , It's better to use it or not 1 or 2.
These two indexing methods , Namely
Based on location （ Numbers ） The index of and
Based on the name （ label ） The index of , The key is to put the rows and columns you want to select in your mind , Map to the corresponding row and column parameters .
With a little practice , We can use whatever we want pandas Processing and analyzing data , After that step , You'll find out and Excel comparison ,Python It's so beautiful .