Python installs pandas module

LinMz 2020-11-19 03:47:52
python installs pandas module


Pandas It is a powerful tool for data analysis , You can think of it as Excel. It is NumPy The extension library of . If combined NumPy Use , It basically solves most of the data problems . I will introduce from the following aspects Pandas Some common functions and functions of :

  • Series Series
  • DataFrames Data frame
  • Missing Data Null value ( Invalid value )
  • GroupBy grouping
  • Merging, Joining,and Concatenating Merger
  • Operations Operation 0
  • Data Input and Output Data input and output

Series

Series It's like NumPy Medium Array. The difference between the two : Series There will be one axis labels( Dimension label ), This axis labels It's not just numbers ,Series You can use this as an index . and array Can only use real numbers to express position . in addition Series The data in can be any type of object, however array It can only be numbers (NumPy1.1 Version data type ). In terms of this function ,Series It's kind of like Hash table.

Let's look at a few examples :

import numpy as np
import pandas as pd

Create a Series

You can take list,numpy array, perhaps dictionary convert to Series:

labels = ['a','b','c'] #list
my_list = [10,20,30] #list
arr = np.array([10,20,30]) # numpy array
d = {'a':10,'b':20,'c':30} # dictionary

use Lists

pd.Series(data=my_list)
0 10
1 20
2 30
dtype: int64
pd.Series(data=my_list,index=labels) # Series The main two parameters data before ,index After 
a 10
b 20
c 30
dtype: int64
pd.Series(my_list,labels) # In a formal project, it's best to specify which variable points to which parameter 

I'll introduce more data science tools in the future , Some functions are related to NumPy perhaps pandas be similar , But more professional , The effect is on business More help . So in projects that don't have strict code format and reporting requirements, feel free to , But strictly business It's better to have some format requirements in the project .

a 10
b 20
c 30
dtype: int64

NumPy Arrays

pd.Series(arr) # axis labels The default is rational real number 
0 10
1 20
2 30
dtype: int64
pd.Series(arr,labels)
a 10
b 20
c 30
dtype: int64

Dictionary

pd.Series(d)
a 10
b 20
c 30
dtype: int64

Series The type of data in

pandas Series It can hold a variety of data types :

pd.Series(data=labels)
0 a
1 b
2 c
dtype: object
# Even function 
pd.Series([sum,print,len])
0 <built-in function sum>
1 <built-in function print> # Must be added from __future__ import print_function Before the code , The reasons are listed below 
2 <built-in function len>
dtype: object

be-all build-in Function list

Use Index

Use Series The key is to use index,Series There are two key parameters :data and index, so index The status of and data Equally important . Learn how to use index It's faster to find data . Here are some examples of how to do this in Series Use in index, First we create two Series: ser1 & ser2

ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])
ser1
USA 1
Germany 2
USSR 3
Japan 4
dtype: int64
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])
ser2
USA 1
Germany 2
Italy 5
Japan 4
dtype: int64
ser1['USA']
1

Arithmetic operations are also based on index Conduct

ser1 + ser2
Germany 4.0
Italy NaN
Japan 8.0
USA 2.0
USSR NaN
dtype: float64 # Once an arithmetic operation is performed, the result is automatically converted to float. And the results are based on index Arrange in order 

A brief introduction , If you want to Series Add data to , It's usually used set_value. The following parameters need to follow index and data.append and add Only one other method can be added Series.

ser2.set_value('Canada',7)
USA 1
Germany 2
Italy 5
Japan 4
china 6
Canada 7
dtype: int64
ser1 + ser2
Canada NaN
China NaN
Germany 4.0
Italy NaN
Japan 8.0
USA 2.0
USSR NaN
dtype: float64
# Only two. Series The Communist Party has index Add up to get results , Otherwise, we will get NaN value .

DataFrames

DataFrames The importance of , suffer R Inspired by language . DataFrame It's just a lot of object Share the same index. In my learning R It's hard to understand DataFrame The meaning of . In short, you can think of it as Excel One of them sheet, Or a table in the database . Let's take a few examples

import pandas as pd
import numpy as np
from numpy.random import randn
np.random.seed(101) # seed amount to reset. Because random numbers take one number at a time seed, And then I'm going to do an operation on this number , A series of random numbers are obtained by algorithm , If seed Set the , So the random number is predictable . So we can reset it every time seed.
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df
  W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509

Selection and Indexing

from DataFrame Select data

df['W']
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64
# Pass a list of column names Notice the double brackets
df[['W','Z']]
  W Z
A 2.706850 0.503826
B 0.651118 0.605965
C -2.018168 -0.589001
D 0.188695 0.955057
E 0.190794 0.683509

use SQL grammar ( Not recommended )

df.W
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64

DataFrame Of Columns Namely Series (DataFrame Columns are just Series)

type(df['W'])
pandas.core.series.Series

Add column

df['new'] = df['W'] + df['Y']
df
  W X Y Z new
A 2.706850 0.628133 0.907969 0.503826 3.614819
B 0.651118 -0.319318 -0.848077 0.605965 -0.196959
C -2.018168 0.740122 0.528813 -0.589001 -1.489355
D 0.188695 -0.758872 -0.933237 0.955057 -0.744542
E 0.190794 1.978757 2.605967 0.683509 2.796762

Remove column

df.drop('new',axis=1) #axis=1 It means column ,axis=0 Finger line ,axis=2 Two dimensional table , And so on 
  W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509

Be careful :inplace The parameter defaults to false, Whether to replace the original data . the previous drop No settings inplace Parameters , therefore df The data of , There is no change

df
  W X Y Z new
A 2.706850 0.628133 0.907969 0.503826 3.614819
B 0.651118 -0.319318 -0.848077 0.605965 -0.196959
C -2.018168 0.740122 0.528813 -0.589001 -1.489355
D 0.188695 -0.758872 -0.933237 0.955057 -0.744542
E 0.190794 1.978757 2.605967 0.683509 2.796762

Now we set up inplace Parameters

df.drop('new',axis=1,inplace=True)
df
  W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509

You can also delete lines

df.drop('E',axis=0)
  W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057

Select row

df.loc['A'] # You must use loc perhaps iloc function 
W 2.706850
X 0.628133
Y 0.907969
Z 0.503826
Name: A, dtype: float64

Or position instead of index

df.iloc[2]
W -2.018168
X 0.740122
Y 0.528813
Z -0.589001
Name: C, dtype: float64

Select subset

df.loc['B','Y']
-0.84807698340363147

Attention! , If you use it directly  df[‘B’,’Y’]  Will report a mistake , use  df[‘Y’,’B’]  You will also report mistakes. . So we have to use df.loc[‘B’,’Y’]

df.loc[['A','B'],['W','Y']]
  W Y
A 2.706850 0.907969
B 0.651118 -0.848077
df.loc[['B','X','A','Y']]
B 0.190915
X NaN
A -0.747158
Y NaN
Name: Y, dtype: float64

Situation comparison and selection (Conditional Selection)

A very important function , It's like numpy A function of :

df
  W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
df>0
  W X Y Z
A True True True True
B True False False True
C False True True False
D True False False True
E True True True True
df[df>0]
  W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 NaN NaN 0.605965
C NaN 0.740122 0.528813 NaN
D 0.188695 NaN NaN 0.955057
E 0.190794 1.978757 2.605967 0.683509
# Be careful , In the last case, it was false The value of will use NaN placeholder ( Of course, we can replace it with other values in machine learning NaN), But in this case , by false The value of is not displayed directly
df[df['W']>0]
  W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
df[df['W']>0]['Y']
A 0.907969
B -0.848077
D -0.933237
E 2.605967
Name: Y, dtype: float64
df[df['W']>0][['Y','X']]
# Attention! ! Attention! ! Here is the juxtaposition of two double brackets , The first set condition , The latter one sets the range to be selected 
  Y X
A 0.907969 0.628133
B -0.848077 -0.319318
D -0.933237 -0.758872
E 2.605967 1.978757
When there are two or more conditions , Need to use perhaps & Connect
df[(df['W']>0) & (df['Y'] > 1)]
  W X Y Z
E 0.190794 1.978757 2.605967 0.683509

more index details Let's see more index Usage situation

df
  W X Y Z new
A 2.706850 0.628133 0.907969 0.503826 3.614819
B 0.651118 -0.319318 -0.848077 0.605965 -0.196959
C -2.018168 0.740122 0.528813 -0.589001 -1.489355
D 0.188695 -0.758872 -0.933237 0.955057 -0.744542
E 0.190794 1.978757 2.605967 0.683509 2.796762
# take index Reset to 0,1...n
df.reset_index()
# There will be inplace Parameters and fillin Parameters can be reset index
  index W X Y Z
0 A 2.706850 0.628133 0.907969 0.503826
1 B 0.651118 -0.319318 -0.848077 0.605965
2 C -2.018168 0.740122 0.528813 -0.589001
3 D 0.188695 -0.758872 -0.933237 0.955057
4 E 0.190794 1.978757 2.605967 0.683509
newind = 'CA NY WY OR CO'.split()
df['States'] = newind
df # Add a new column 
  W X Y Z States
A 2.706850 0.628133 0.907969 0.503826 CA
B 0.651118 -0.319318 -0.848077 0.605965 NY
C -2.018168 0.740122 0.528813 -0.589001 WY
D 0.188695 -0.758872 -0.933237 0.955057 OR
E 0.190794 1.978757 2.605967 0.683509 CO
df.set_index('States')
  W X Y Z
States        
CA 2.706850 0.628133 0.907969 0.503826
NY 0.651118 -0.319318 -0.848077 0.605965
WY -2.018168 0.740122 0.528813 -0.589001
OR 0.188695 -0.758872 -0.933237 0.955057
CO 0.190794 1.978757 2.605967 0.683509
df
  W X Y Z States
A 2.706850 0.628133 0.907969 0.503826 CA
B 0.651118 -0.319318 -0.848077 0.605965 NY
C -2.018168 0.740122 0.528813 -0.589001 WY
D 0.188695 -0.758872 -0.933237 0.955057 OR
E 0.190794 1.978757 2.605967 0.683509 CO
df.set_index('States',inplace=True)
# thus it can be seen inplace Parameters will affect the original data 
df
  W X Y Z
States        
CA 2.706850 0.628133 0.907969 0.503826
NY 0.651118 -0.319318 -0.848077 0.605965
WY -2.018168 0.740122 0.528813 -0.589001
OR 0.188695 -0.758872 -0.933237 0.955057
CO 0.190794 1.978757 2.605967 0.683509

Reunite with index and index Stratum

Reunite with index Need to put index Make one tuple

# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
hier_index
MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df
    A B
Group Num    
  1 0.153661 0.167638
G1 2 -0.765930 0.962299
  3 0.902826 -0.537909
  1 -1.549671 0.435253
G2 2 1.259904 -0.447898
  3 0.266207 0.412580

utilize xs The function returns dataframe Of cross-section A subset of  python df.xs('G1')

  A B
Num    
1 0.153661 0.167638
2 -0.765930 0.962299
3 0.902826 -0.537909
df.xs(['G1',1])
A 0.153661
B 0.167638
Name: (G1, 1), dtype: float64
df.xs(1,level='Num')
  A B
Group    
G1 0.153661 0.167638
G2 -1.549671 0.435253

Missing Data( Invalid data )

Pandas There are several ways to deal with invalid data

import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,2,np.nan],
'B':[5,np.nan,np.nan],
'C':[1,2,3]})
df
  A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
df.dropna() # Will contain nan It's worth it rows Delete together 

| | A | B | C | | —- | —- | —- | —- | | 0 | 1.0 | 5.0 | 1 | python df.dropna(axis=1) # Will contain nan It's worth it columns Delete together

  C
0 1
1 2
2 3
df.dropna(thresh=2) # This is very useful , drop Drop all valid data ( except NaN Outside ) Less than thresh Of Series
# There's another. how Parameters , It needs to be specified as drop It's all for NaN My line is still drop It's coming out NaN The line of , I won't give an example
  A B C
0 1.0 5.0 1
1 2.0 NaN 2
df.fillna(value='FILL VALUE') # I don't need to introduce it 
  A B C
0 1 5 1
1 2 FILL VALUE 2
2 FILL VALUE FILL VALUE 3
df['A'].fillna(value=df['A'].mean())
0 1.0
1 2.0
2 1.5
Name: A, dtype: float64
df.fillna(value=df['B'].mean())
  A B C
0 1.0 5.0 1
1 2.0 5.0 2
2 5.0 5.0 3

So remember to deal with NaN The data used dropna() and fillna() Two ways are OK La ~~

Groupby

Groupby You can group your data and call aggregate functions . Please refer to the database for details groupby

import pandas as pd
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df
  Company Person Sales
0 GOOG Sam 200
1 GOOG Charlie 120
2 MSFT Amy 340
3 MSFT Vanessa 124
4 FB Carl 243
5 FB Sarah 350

Now let's use groupBy() Method to group , A new one will be added below DataFrameGroupBy Of object: python df.groupby('Company')<pandas.core.groupby.DataFrameGroupBy object at 0x113014128>

We assign it to a new variable  python by_comp = df.groupby("Company")  Come on call Polymerization methods  python by_comp.mean()

  Sales
Company  
FB 296.5
GOOG 160.0
MSFT 232.0

Equate to : python df.groupby('Company').mean()

  Sales
Company  
FB 296.5
GOOG 160.0
MSFT 232.0

Standard deviation : python by_comp.std()

  Sales
Company  
FB 75.660426
GOOG 56.568542
MSFT 152.735065

minimum value : python by_comp.min()

| | Person | Sales | | ——- | ——- | —– | | Company | | | | FB | Carl | 243 | | GOOG | Charlie | 120 | | MSFT | Amy | 124 | Maximum :python by_comp.max()

| | Person | Sales | | ——- | ——- | —– | | Company | | | | FB | Sarah | 350 | | GOOG | Sam | 200 | | MSFT | Vanessa | 340 | Count :python by_comp.count()

  Person Sales
Company    
FB 2 2
GOOG 2 2
MSFT 2 2

Describe the analysis ( Equivalent to the function in the database ): python by_comp.describe()

    Sales
Company    
FB count 2.000000
FB mean 296.500000
FB std 75.660426
FB min 243.000000
FB 25% 269.750000
FB 50% 296.500000
FB 75% 323.250000
FB max 350.000000
GOOG count 2.000000
GOOG mean 160.000000
GOOG std 56.568542
GOOG min 120.000000
GOOG 25% 140.000000
GOOG 50% 160.000000
GOOG 75% 180.000000
GOOG max 200.000000
MSFT count 2.000000
MSFT mean 232.000000
MSFT std 152.735065
MSFT min 124.000000
MSFT 25% 178.000000
MSFT 50% 232.000000
MSFT 75% 286.000000
MSFT max 340.000000

notes : because markdown Format problem , I've written down all the company names , In the compiler, the company name actually shows only one .

by_comp.describe().transpose() # Flip
Company FB                 GOOG             MSFT        
  count mean std min 25% 50% 75% max count mean 75% max count mean std min 25% 50% 75% max
Sales 2.0 296.5 75.660426 243.0 269.75 296.5 323.25 350.0 2.0 160.0 180.0 200.0 2.0 232.0 152.735065 124.0 178.0 232.0 286.0 340.0
by_comp.describe().transpose()['GOOG']

You can also flip a company's data , I won't print it out

Merging, Joining, and Concatenating( Merger , Connect , cascade )

These are the three main ways to merge data , Several of them join I won't elaborate on the difference between , This paper mainly introduces the differences between the three methods

Concatenation( cascade )

The most basic connection DF Methods , Pay attention to the connected DF Dimensions must be the same , Can pass axis Parameter to set the position of the cascade , The default is 0( stay row It's connected to the back of )

First of all, let's set up three DF

import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
df1
  A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
df2
  A B C D
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
df3
  A B C D
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
pd.concat([df1,df2,df3])
  A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
pd.concat([df1,df2,df3],axis=1) # adopt column Connected to a 
  A B C D A B C D A B C D
0 A0 B0 C0 D0 NaN NaN NaN NaN NaN NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN NaN NaN NaN NaN NaN
2 A2 B2 C2 D2 NaN NaN NaN NaN NaN NaN NaN NaN
3 A3 B3 C3 D3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN A4 B4 C4 D4 NaN NaN NaN NaN
5 NaN NaN NaN NaN A5 B5 C5 D5 NaN NaN NaN NaN
6 NaN NaN NaN NaN A6 B6 C6 D6 NaN NaN NaN NaN
7 NaN NaN NaN NaN A7 B7 C7 D7 NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN A8 B8 C8 D8
9 NaN NaN NaN NaN NaN NaN NaN NaN A9 B9 C9 D9
10 NaN NaN NaN NaN NaN NaN NaN NaN A10 B10 C10 D10
11 NaN NaN NaN NaN NaN NaN NaN NaN A11 B11 C11 D11

Merging

Merging is a bit like SQL In the sentence join, need keys( In two DF in key It could be the same Series It can be different ) A merger And join Is the difference between the merge Of key It's through similar column( In practical terms ), And there is no master-slave relationship between the two merged tables , and join Of index Refer to row Of index, One is the main one and the other is the deputy .

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
left
  A B key
0 A0 B0 K0
1 A1 B1 K1
2 A2 B2 K2
3 A3 B3 K3
right
  C D key
0 C0 D0 K0
1 C1 D1 K1
2 C2 D2 K2
3 C3 D3 K3
pd.merge(left,right,how='inner',on='key')
  A B key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K2 C2 D2
3 A3 B3 K3 C3 D3

Let's use key Two different tables

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left, right, on=['key1', 'key2'])
  A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
pd.merge(right,left, on=['key1', 'key2']) # Pay attention to the result of the left-right exchange 
  C D key1 key2 A B
0 C0 D0 K0 K0 A0 B0
1 C1 D1 K1 K0 A2 B2
2 C2 D2 K1 K0 A2 B2
pd.merge(left, right, how='outer', on=['key1', 'key2'])
  A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K2 K1 NaN NaN
5 NaN NaN K2 K0 C3 D3
pd.merge(left, right, how='right', on=['key1', 'key2'])
  A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
3 NaN NaN K2 K0 C3 D3
pd.merge(left, right, how='left', on=['key1', 'key2'])
  A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K2 K1 NaN NaN

Joining

Joining The two will be different index Of DF Merge into a single DF, So most of them are required column identical

left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
left.join(right)
  A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
left.join(right, how='outer')
  A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 NaN NaN C3 D3

Operations( operation )

Ha , Finally, it comes to operations . Many of them are related to R The language is very similar to .

import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()
  col1 col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz

Various Unique value

df['col2'].unique() # Find the unique value and return array
array([444, 555, 666])
df['col2'].nunique() # Returns the number of unique values 
3
df['col2'].value_counts() # Count 
444 2
555 1
666 1
Name: col2, dtype: int64

Select data

# Select data from composite conditions 
newdf = df[(df['col1']>2) & (df['col2']==444)]
newdf
  col1 col2 col3
3 4 444 xyz

Apply an equation or function

def times2(x):
return x*2
df['col1'].apply(times2) # adopt apply Yes DF Use custom functions 
0 2
1 4
2 6
3 8
Name: col1, dtype: int64
df['col3'].apply(len)
0 3
1 3
2 3
3 3
Name: col3, dtype: int64
df['col1'].sum()
10 ### Permanently delete a column
del df['col1']
df
  col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz

obtain column perhaps index name

df.columns
 Index(['col2', 'col3'], dtype='object')
df.index
 RangeIndex(start=0, stop=4, step=1)

Sort

df
  col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz
df.sort_values(by='col2') #inplace=False Default 
  col2 col3
0 444 abc
3 444 xyz
1 555 def
2 666 ghi

Sentenced to empty

df.isnull()
  col2 col3
0 False False
1 False False
2 False False
3 False False
# Drop rows with NaN Values
df.dropna()
  col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz

Replace NaN

import numpy as np
df = pd.DataFrame({'col1':[1,2,3,np.nan],
'col2':[np.nan,555,666,444],
'col3':['abc','def','ghi','xyz']})
df.head()
  col1 col2 col3
0 1.0 NaN abc
1 2.0 555.0 def
2 3.0 666.0 ghi
3 NaN 444.0 xyz
df.fillna('FILL')
  col1 col2 col3
0 1 FILL abc
1 2 555 def
2 3 666 ghi
3 FILL 444 xyz

Build a benchmark , That's interesting

data = {'A':['foo','foo','foo','bar','bar','bar'],
'B':['one','one','two','two','one','one'],
'C':['x','y','x','y','x','y'],
'D':[1,3,2,5,4,1]}
df = pd.DataFrame(data)
df
  A B C D
0 foo one x 1
1 foo one y 3
2 foo two x 2
3 bar two y 5
4 bar one x 4
5 bar one y 1
df.pivot_table(values='D',index=['A', 'B'],columns=['C'])
  C x y
A B    
bar one 4.0 1.0
bar two NaN 5.0
foo one 1.0 3.0
foo two 2.0 NaN

Data input and output

pandas Need to use pd.read_ methods It can read all kinds of data !!

import numpy as np
import pandas as pd

## CSV

CSV Input

df = pd.read_csv('example') # The files should be in the same directory , If not, you need to specify the file path 
df
  a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15

CSV Output

df.to_csv('example',index=False)

## Excel Pandas Only for Data Input , You can't enter images or formulas , Otherwise, an error will be reported

Excel Input

pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1') # Notice which one needs to be specified sheet
  a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15

Excel Output

df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

## HTML

You need to install Python Of html5 Library support

pip install lxml
pip install html5lib
pip install BeautifulSoup4 # I like this very much , It's easier to write about simple reptiles 

HTML Input

Pandas read_html The table in the web page will be automatically read and a return containing DataFrame Of List object

df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')
df[0]

There's so much data that I won't list it , You can try it yourself

HTML Output

df.to_html(‘banklist.html’) # Pay attention to the extension

SQL sentence

You don't usually need to use pandas Read the data directly from the database . Because of security considerations and independence . however pandas Yes pandas.io.sql The module can read data from the database , But have the right API

Here are some common features :

  • read_sql_table(table_name, con[, schema, …]) Read the table in the database to DF in
  • read_sql_query(sql, con[, index_col, …]) Read Query Into the DF
  • read_sql(sql, con[, index_col, …]) Read the table or query Into the DF
  • DataFrame.to_sql(name, con[, flavor, …]) take DF Import the records in the database

Here's a simple example :

from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:') # Python By default sqlite database 
df.to_sql('data', engine)
sql_df = pd.read_sql('data',con=engine)
sql_df
  index a b c d
0 0 0 1 2 3
1 1 4 5 6 7
2 2 8 9 10 11
3 3 12 13 14 15
版权声明
本文为[LinMz]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database