Pandas brief tutorial

Tianyuan prodigal son 2020-11-13 09:09:27
pandas brief tutorial


1. Pandas overview

1.1 Why use Pandas?

Pandas It's based on NumPy A tool set for analyzing structured data ,NumPy It provides high-performance data processing capabilities .Pandas It is widely used in data mining and data analysis , It also provides data cleaning 、 data I/O、 Visualization and other auxiliary functions .

As Python Basic software package for Scientific Computing ,NumPy It's strong enough , But it's not perfect , because NumPy Table data with heterogeneous columns is not supported . So called heterogeneous List lattice data , In a two-dimensional data structure, different columns are allowed to have different data types . Even though NumPy Support data structure of any dimension , But in practice , Whether it's traditional software development or machine learning , Most of the data we face are two-dimensional heterogeneous column data .Pandas It is in order to process this kind of data , It's for processing and SQL or Excel Table like heterogeneous List lattice data provides flexibility 、 Convenient data structure , And quickly become Python The core data analysis support library .

1.2 Pandas Characteristics

Pandas Born in 2008 year , It was originally dedicated to finance 、 Data processors in the field of statistics, not programmers, are tailor-made , And it provides almost all the functions that may be needed in a way that best suits the user's thinking habits .Pandas The goal seems to be to shield as many software engineering concepts as possible , Keep only the physical properties and logic of the data . Just one example ( There are many similar examples ) Prove my point : Only one line of code can be used to capture and parse network data , Users don't need to know http The protocol and html Parsing Technology .

>>> data = pd.read_html('http://ditu.92cha.com/dizhen.php')
>>> data[0].head()
The moment of the earthquake Magnitude (M) longitude (°) latitude (°) depth ( km ) Reference position
0 2020-05-29 03:34:23 4.2 120.80 23.63 26.0 Nantou County, Taiwan
1 2020-05-29 00:35:59 3.9 78.40 34.27 20.0 Kashmir
2 2020-05-28 21:33:47 3.1 83.43 41.25 10.0 Kuqa City, Aksu Prefecture, Xinjiang
3 2020-05-28 17:46:49 5.9 -175.35 -27.43 10.0 The kmadek Islands
4 2020-05-28 17:11:38 3.6 117.38 44.67 15.0 Xiwuzhumuqin banner, Xilin Gol League, Inner Mongolia

So simple , What to ask for ! however , Excessive indulgence and doting on users , It's actually a double-edged sword . just as Pandas The father of Wes McKinney said ,Pandas It's a departure from the simplicity and ease of use that he originally expected , Becoming more and more bloated and uncontrollable .

I quite agree with Wes McKinney Point of view , I even feel like Pandas Abandoned panel When it comes to this concept , I'm already possessed by the devil .panel yes Pandas Originally proposed for processing higher dimensional data , Very close to HDF perhaps netCDF Idea .Pandas Later, I used “ Hierarchical index ” Processing higher dimensional data , It leads to the complexity of the structure , So that programmers can not focus on the processing of transaction logic .

however , One flaw cannot obscure the splendor of the jade . Even if there is a little regret , It's hard to hide Pandas The glory of .Pandas It's not just simplicity , It also has outstanding data processing capabilities 、 Complete auxiliary function . Sum up ,Pandas It has the following five characteristics .

  1. It has strong adaptive ability . Whether it's Python still NumPy Data objects for , Even if it's irregularly structured data , It can also be easily converted to DataFrame object .Pandas It also automatically processes missing data , Be similar to NumPy The mask array of .
  2. Inherited from NumPy The genes provide the ability to organize and process data quickly .Pandas Support arbitrary addition and deletion of data columns , Support consolidation 、 Connect 、 restore 、 Perspective dataset , Support aggregation 、 transformation 、 section 、 Fancy index 、 Operations such as subset decomposition .
  3. Perfect time series .Pandas Support date range generation 、 Frequency conversion 、 Mobile window statistics 、 Moving window linear regression 、 Time series functions such as date displacement .
  4. Have the most comprehensive IO Tools .Pandas Supports reading text files (CSV Files that support separators )、Excel file 、HDF file 、SQL Table data 、json data 、html data , You can even go straight from url Download and parse the data , You can also save data as CSV File or Excel file .
  5. User friendly display format . No matter how complex the data is ,Pandas Always trying to show you the clearest structure , It automatically aligns objects and labels , You can ignore the label if necessary .

1.3 Installation and use

because Pandas rely on NumPy, It's better to install NumPy、Matplotlib etc. SciPy Other modules of the family . Of course , without , The installation process will also automatically install various dependency packages for you .

PS C:\Users\xufive> pip install pandas

It's like importing NumPy Use abbreviation when module , Import Pandas When the module , It is usually abbreviated as pd, It's almost a rule of thumb for programmers . The following code , A two-dimensional data table with labels is constructed . Beijing 、 Guangzhou 、 Shanghai 、 Hangzhou is the label for each column of data , The labels for all columns are called column names ;2020、2019、2018 It's a label for each row of data , The labels of all rows are called indexes . This two-dimensional data table with labels , Namely Pandas The core data structure DataFrame, All about Pandas The operation and skills of , Almost all of them are aimed at DataFrame Of this structure .

>>> import pandas as pd
>>> idx = ['2020','2019','2018']
>>> colname = [' Beijing ',' Guangzhou ',' Shanghai ',' Hangzhou ']
>>> data = [[35200.00, 30500.00,31800.00,26300.00],
[35500.00,31300.00,32200.00,28100.00],
[34900.00,29600.00,30100.00,24700.00]]
>>> df = pd.DataFrame(data, columns=colname, index=idx)
>>> df
Beijing Guangzhou Shanghai Hangzhou
2020 35200.0 30500.0 31800.0 26300.0
2019 35500.0 31300.0 32200.0 28100.0
2018 34900.0 29600.0 30100.0 24700.0

In all the sample code in this article , If used pd perhaps np Module abbreviation of , This means that you have imported the following statement before Pandas and NumPy modular .

>>> import numpy as np
>>> import pandas as pd

2. data structure

Study Pandas The best way to start is to understand its data structure . A lot of people say ,Pandas It's simple , Only Series and DataFrame Two data structures . however , Don't forget it , Whether it's Series still DataFrame, They all have an index object Index,Index It's also Pandas One of the basic data structures of .

2.1 The index array Index

An index array is similar to a one-dimensional array , stay Pandas As tags in other data structures . Although you don't need to know too much about indexed objects, you can use Pandas, But to be proficient in Pandas Words , Deep understanding of index objects ( For example, hierarchical index objects MultiIndex) be necessary .

>>> pd.Index([3,4,5])
Int64Index([3, 4, 5], dtype='int64')
>>> pd.Index(['x','y','z'])
Index(['x', 'y', 'z'], dtype='object')
>>> pd.Index(range(3))
RangeIndex(start=0, stop=3, step=1)
>>> idx = pd.Index(['x','y','z'])

Using arrays 、 list 、 Iterators and so on can create index objects . Index objects look like one-dimensional arrays , But you can't change the value of the element . This is very important , Only so , In order to ensure the safe sharing of multiple data structures .
actually , There are many types of indexed objects , Except for one-dimensional index data sets , And time nanosecond stamp index 、 Hierarchical index, etc . Besides , Index objects are also deleted 、 Insert 、 Connect 、 intersection 、 Operations such as Union . These will be used in later applications .

2.2 One dimensional isomorphic arrays with labels Series

Series It consists of a set of data of the same type and a set of tags corresponding to the data (Index object ) Data structure of composition , Data tags are also called indexes , Indexes are allowed to repeat .Pandas Provides a variety of generation Series How objects work .
The following code , Using an integer list and a string list, you create two Series object . Because there is no index specified ,Series The generator automatically adds a default index . The default index is from 0 The starting integer sequence .

>>> pd.Series([0,1,2]) # Generate from a list Series, Use the default index 
0 0
1 1
2 2
dtype: int64
>>> pd.Series(['a','b','c']) # Generate from a list Series, Use the default index 
0 a
1 b
2 c
dtype: object

establish Series when , You can also specify the index . however , The index length must be equal to the list length , Otherwise, an exception will be thrown . in addition ,Series The generator also accepts iteration objects as parameters .

>>> pd.Series([0,1,2], index=['a','b','c']) # Generate from a list Series
a 0
b 1
c 2
dtype: int64
>>> pd.Series(range(3), index=list('abc')) # Using iterative objects to generate Series
a 0
b 1
c 2
dtype: int64

Create with dictionary Series when , If no index is specified , The key of the dictionary is used as the index ; If the index is specified , It's not required to match the key of the dictionary .

>>> pd.Series({
'a':1,'b':2,'c':3}) # Use a dictionary to generate Series, Use the key of the dictionary to index 
a 1
b 2
c 3
dtype: int64
>>> pd.Series({
'a':1,'b':2,'c':3}, index=list('abxy')) # Specify the index 
a 1.0
b 2.0
x NaN
y NaN
dtype: float64

Series Objects have many properties and methods , Most of them are related to NumPy similar , Even exactly the same . These properties and methods will be used in later applications , Beginners don't have to be in a hurry to understand everything now , But the following three attributes must be kept in mind .

>>> s = pd.Series({
'a':1,'b':2,'c':3})
>>> s.dtype # Series The data type of an object is one of the three most important properties 
dtype('int64')
>>> s.values # Series The array of objects is the second of the three most important properties 
array([1, 2, 3], dtype=int64)
>>> s.index # Series The index of an object is the third of the three most important properties 
Index(['a', 'b', 'c'], dtype='object')

A deep understanding Series Two things need to be kept in mind : First of all ,Series All of the data of is the same data type , That's one Series There must be a data type ; second ,Series Each of the data corresponds to an index , But indexes are allowed to be duplicated .

2.3 Two dimensional heterogeneous tables with labels DataFrame

DataFrame It can be seen as being made up of multiple Series Two dimensional tabular data structure , every last Series As DataFrame A list of , They all have a list , Each column can have its own data type , be-all Series Share an index . The names are called DataFrame Column labels , The index is called DataFrame The line of label .

It needs to be explained ,DataFrame Although it's a two-dimensional structure , But that doesn't mean it can't handle higher dimensional data . in fact , Dependency hierarchical index ,DataFrame You can easily handle high-dimensional data . We'll discuss this later .

There are many ways to create DataFrame object , such as , A two-dimensional NumPy Array or mask array , By an array of 、 list 、 Tuples 、 Dictionaries 、Series A dictionary or list of objects, etc , Even DataFrame object , Can be converted into DataFrame object . For data with irregular structure , It's also easy to switch , because DataFrame Constructors are very fault tolerant .

Convert dictionary data to DataFrame Object is the most common creation method , The key of the dictionary corresponds to DataFrame The column of , Key names are automatically called column names . If no index is specified , The default index is used .

>>> data = {

' East China science and technology ': [1.91, 1.90, 1.86, 1.84],
' Changan automobile ': [11.27, 11.14, 11.28, 11.71],
' Tibet Mining ': [7.89, 7.79, 7.61, 7.50],
' Chongqing beer ': [50.46, 50.17, 50.28, 50.28]
}
>>> pd.DataFrame(data)
East China science and technology Changan automobile Tibet Mining Chongqing beer
0 1.91 11.27 7.89 50.46
1 1.90 11.14 7.79 50.17
2 1.86 11.28 7.61 50.28
3 1.84 11.71 7.50 50.28

You can also create DataFrame Object time , Specify the index . The date string is used as the index directly , The correct way to do this is to use the date index object . We're going to talk about time series , Formal introduction DatetimeIndex The use of the class .

>>> idx = ['2020-03-10','2020-03-11','2020-03-12','2020-03-13']
>>> pd.DataFrame(data, index=idx)
East China science and technology Changan automobile Tibet Mining Chongqing beer
2020-03-10 1.91 11.27 7.89 50.46
2020-03-11 1.90 11.14 7.79 50.17
2020-03-12 1.86 11.28 7.61 50.28
2020-03-13 1.84 11.71 7.50 50.28

Creating DataFrame Object time , Even if the data is provided in dictionary form , You can also specify column labels ,DataFrame The generator does not need to match all the keys of the column label and dictionary . For keys that don't exist ,DataFrame The generator will fill in automatically NaN value .

>>> data = {

' East China science and technology ': [1.91, 1.90, 1.86, 1.84],
' Changan automobile ': [11.27, 11.14, 11.28, 11.71],
' Tibet Mining ': [7.89, 7.79, 7.61, 7.50],
' Chongqing beer ': [50.46, 50.17, 50.28, 50.28]
}
>>> idx = ['2020-03-10','2020-03-11','2020-03-12','2020-03-13']
>>> colnames = [' East China science and technology ', ' Changan automobile ', ' HANGGANG Co., Ltd ', ' Tibet Mining ', ' Chongqing beer ']
>>> pd.DataFrame(data, columns=colnames, index=idx)
East China science and technology Changan automobile HANGGANG Co., Ltd Tibet Mining Chongqing beer
2020-03-10 1.91 11.27 NaN 7.89 50.46
2020-03-11 1.90 11.14 NaN 7.79 50.17
2020-03-12 1.86 11.28 NaN 7.61 50.28
2020-03-13 1.84 11.71 NaN 7.50 50.28

Two dimensional data or lists can also be converted directly into DataFrame Object time , Specify both index and column labels . If no index or column label is specified , Will be automatically added from 0 The starting index object is used as an index or column label .

>>> data = np.array([
[ 1.91, 11.27, 7.89, 50.46],
[ 1.9 , 11.14, 7.79, 50.17],
[ 1.86, 11.28, 7.61, 50.28],
[ 1.84, 11.71, 7.5 , 50.28]
])
>>> idx = ['2020-03-10','2020-03-11','2020-03-12','2020-03-13']
>>> colnames = [' East China science and technology ', ' Changan automobile ', ' Tibet Mining ', ' Chongqing beer ']
>>> pd.DataFrame(data, columns=colnames, index=idx)
East China science and technology Changan automobile Tibet Mining Chongqing beer
2020-03-10 1.91 11.27 7.89 50.46
2020-03-11 1.90 11.14 7.79 50.17
2020-03-12 1.86 11.28 7.61 50.28
2020-03-13 1.84 11.71 7.50 50.28

DataFrame It can be seen as many Series A collection of objects , Every Series Objects can have their own independent data types , therefore ,DataFrame There is no unique data type of its own , Naturally, there is no dtype The attribute is . however ,DataFrame One more. dtypes attribute , The type of this property is Series class . except dtypes attribute ,DataFrame Object's values attribute 、index attribute 、columns Attributes are also very important , It needs to be kept in mind .

>>> df = pd.DataFrame(data, columns=colnames, index=idx)
>>> df.dtypes # dtypes attribute , It's made up of the data types of all the columns Series
East China science and technology float64
Changan automobile float64
Tibet Mining float64
Chongqing beer float64
dtype: object
>>> df.values # DataFrame One of the important attributes of 
array([[ 1.91, 11.27, 7.89, 50.46],
[ 1.9 , 11.14, 7.79, 50.17],
[ 1.86, 11.28, 7.61, 50.28],
[ 1.84, 11.71, 7.5 , 50.28]])
>>> df.index # DataFrame One of the important attributes of 
DatetimeIndex(['2020-03-10', '2020-03-11', '2020-03-12', '2020-03-13'], dtype='datetime64[ns]', freq=None)
>>> df.columns # DataFrame One of the important attributes of 
Index([' East China science and technology ', ' Changan automobile ', ' Tibet Mining ', ' Chongqing beer '], dtype='object')

3. Basic operation

DataFrame It's almost made into an all powerful little monster : It has the shadow of a dictionary , Yes NumPy Array performance , Even inherited NumPy Many properties and methods of arrays ; It can store and process many different types of data in one structure ; It looks like a two-dimensional structure , But it can handle higher dimensional data ; It can handle any type of data, including date and time , It can read and write almost all data formats ; It offers a myriad of ways , And derive infinite operating skills . I would like to introduce in detail DataFrame It is unrealistic to operate , This section is just the most basic 、 The core operation is briefly introduced .

For demonstration purposes , Let's construct an opening price of several stocks on the same day 、 Closing price 、 Trading volume and other information DataFrame object , And stack name .

>>> data = np.array([
[10.70, 11.95, 10.56, 11.71, 789.10, 68771048],
[7.28, 7.59, 7.17, 7.50, 57.01, 7741802],
[48.10, 50.59, 48.10, 50.28, 223.06, 4496598],
[66.70, 69.28, 66.66, 68.92, 1196.14, 17662768],
[7.00, 7.35, 6.93, 7.11, 783.15, 109975919],
[2.02, 2.10, 2.01, 2.08, 56.32, 27484360]
])
>>> colnames = [' Opening price ',' Highest price ',' The lowest price ',' Closing price ',' turnover ',' volume ']
>>> idx = ['000625.SZ','000762.SZ','600132.SH','600009.SH','600126.SH','000882.SZ']
>>> stock = pd.DataFrame(data, columns=colnames, index=idx)

3.1 Data preview

3.1.1 Show the beginning of 5 OK and the last 5 That's ok

>>> stock.head() # From the beginning 5 That's ok 
Opening price Highest price The lowest price Closing price turnover volume
000625.SZ 10.70 11.95 10.56 11.71 789.10 68771048.0
000762.SZ 7.28 7.59 7.17 7.50 57.01 7741802.0
600132.SH 48.10 50.59 48.10 50.28 223.06 4496598.0
600009.SH 66.70 69.28 66.66 68.92 1196.14 17662768.0
600126.SH 7.00 7.35 6.93 7.11 783.15 109975919.0
>>> stock.tail() # At the end of the 5 That's ok 
Opening price Highest price The lowest price Closing price turnover volume
000762.SZ 7.28 7.59 7.17 7.50 57.01 7741802.0
600132.SH 48.10 50.59 48.10 50.28 223.06 4496598.0
600009.SH 66.70 69.28 66.66 68.92 1196.14 17662768.0
600126.SH 7.00 7.35 6.93 7.11 783.15 109975919.0
000882.SZ 2.02 2.10 2.01 2.08 56.32 27484360.0

3.1.2 Look at the mean 、 variance 、 Statistical summary of extreme values

>>> stock.describe()
Opening price Highest price The lowest price Closing price turnover volume
count 6.000000 6.000000 6.000000 6.00000 6.000000 6.000000e+00
mean 23.633333 24.810000 23.571667 24.60000 517.463333 3.935542e+07
std 26.951297 28.016756 26.975590 27.91178 472.508554 4.166194e+07
min 2.020000 2.100000 2.010000 2.08000 56.320000 4.496598e+06
25% 7.070000 7.410000 6.990000 7.20750 98.522500 1.022204e+07
50% 8.990000 9.770000 8.865000 9.60500 503.105000 2.257356e+07
75% 38.750000 40.930000 38.715000 40.63750 787.612500 5.844938e+07
max 66.700000 69.280000 66.660000 68.92000 1196.140000 1.099759e+08

3.1.3 Transposition

>>> stock.T
000625.SZ 000762.SZ ... 600126.SH 000882.SZ
Opening price 10.70 7.28 ... 7.000000e+00 2.02
Highest price 11.95 7.59 ... 7.350000e+00 2.10
The lowest price 10.56 7.17 ... 6.930000e+00 2.01
Closing price 11.71 7.50 ... 7.110000e+00 2.08
turnover 789.10 57.01 ... 7.831500e+02 56.32
volume 68771048.00 7741802.00 ... 1.099759e+08 27484360.00
[6 rows x 6 columns]

3.1.4 Sort

>>> stock.sort_index(axis=0) # Sort by index 
Opening price Highest price The lowest price Closing price turnover volume
000625.SZ 10.70 11.95 10.56 11.71 789.10 68771048.0
000762.SZ 7.28 7.59 7.17 7.50 57.01 7741802.0
000882.SZ 2.02 2.10 2.01 2.08 56.32 27484360.0
600009.SH 66.70 69.28 66.66 68.92 1196.14 17662768.0
600126.SH 7.00 7.35 6.93 7.11 783.15 109975919.0
600132.SH 48.10 50.59 48.10 50.28 223.06 4496598.0
>>> stock.sort_index(axis=1) # Sort by column label 
Opening price volume turnover Closing price The lowest price Highest price
000625.SZ 10.70 68771048.0 789.10 11.71 10.56 11.95
000762.SZ 7.28 7741802.0 57.01 7.50 7.17 7.59
600132.SH 48.10 4496598.0 223.06 50.28 48.10 50.59
600009.SH 66.70 17662768.0 1196.14 68.92 66.66 69.28
600126.SH 7.00 109975919.0 783.15 7.11 6.93 7.35
000882.SZ 2.02 27484360.0 56.32 2.08 2.01 2.10
>>> stock.sort_values(by=' volume ') # Sort by the value of the specified column 
Opening price Highest price The lowest price Closing price turnover volume
600132.SH 48.10 50.59 48.10 50.28 223.06 4496598.0
000762.SZ 7.28 7.59 7.17 7.50 57.01 7741802.0
600009.SH 66.70 69.28 66.66 68.92 1196.14 17662768.0
000882.SZ 2.02 2.10 2.01 2.08 56.32 27484360.0
000625.SZ 10.70 11.95 10.56 11.71 789.10 68771048.0
600126.SH 7.00 7.35 6.93 7.11 783.15 109975919.0

3.2 Data selection

3.2.1 Row selection

DataFrame Supports slicing operations like arrays or lists , such as stock[2:3], But not like stock[2] So index directly .

>>> stock[2:3] # section 
Opening price Highest price The lowest price Closing price turnover volume
600132.SH 48.1 50.59 48.1 50.28 223.06 4496598.0
>>> stock[::2] # In steps of 2 The section of 
Opening price Highest price The lowest price Closing price turnover volume
000625.SZ 10.7 11.95 10.56 11.71 789.10 68771048.0
600132.SH 48.1 50.59 48.10 50.28 223.06 4496598.0
600126.SH 7.0 7.35 6.93 7.11 783.15 109975919.0

You can also label rows ( Indexes ) section . The slice order is based on DataFrame Of Index object , The return result contains two index entries of the specified slice , It's like a closed interval in mathematics .

>>> stock['000762.SZ':'600009.SH']
Opening price Highest price The lowest price Closing price turnover volume
000762.SZ 7.28 7.59 7.17 7.50 57.01 7741802.0
600132.SH 48.10 50.59 48.10 50.28 223.06 4496598.0
600009.SH 66.70 69.28 66.66 68.92 1196.14 17662768.0

3.2.2 Column selection

DataFrame Only single column data is allowed to be selected , return Series object . If you want to select multiple columns , You must also specify the selected line .

>>> stock[' Opening price '] # Select a single column , You can also use stock. Opening price 
000625.SZ 10.70
000762.SZ 7.28
600132.SH 48.10
600009.SH 66.70
600126.SH 7.00
000882.SZ 2.02
Name: Opening price , dtype: float64

3.2.3 Line selection

Use the row and column selector loc You can select both rows and columns . Line selection uses slicing , Use the selection column .

>>> stock.loc['000762.SZ':'600009.SH', [' Opening price ', ' Closing price ', ' volume ']]
Opening price Closing price volume
000762.SZ 7.28 7.50 7741802.0
600132.SH 48.10 50.28 4496598.0
600009.SH 66.70 68.92 17662768.0

If you want to access a two-dimensional array DataFrame object , You can use at as well as iat or iloc Wait for the column selector .

>>> stock.at['000762.SZ', ' Opening price ']
7.28
>>> stock.iat[1, 0]
7.28
>>> stock.iloc[1:4, 0:3]
Opening price Highest price The lowest price
000762.SZ 7.28 7.59 7.17
600132.SH 48.10 50.59 48.10
600009.SH 66.70 69.28 66.66

3.2.4 To choice

be familiar with NumPy Words , It will be easy to understand DataFrame The choice of conditions for .

>>> stock[(stock[' turnover ']>500)&(stock[' Opening price ']>10)] # Support compound conditions 
Opening price Highest price The lowest price Closing price turnover volume
000625.SZ 10.7 11.95 10.56 11.71 789.10 68771048.0
600009.SH 66.7 69.28 66.66 68.92 1196.14 17662768.0
>>> stock[stock[' turnover '].isin([56.32,57.01,223.06])] # Use isin() Filter multiple specific values 
Opening price Highest price The lowest price Closing price turnover volume
000762.SZ 7.28 7.59 7.17 7.50 57.01 7741802.0
600132.SH 48.10 50.59 48.10 50.28 223.06 4496598.0
000882.SZ 2.02 2.10 2.01 2.08 56.32 27484360.0

3.3 Change the data structure

3.3.1 Re index

DataFrame Of reindex() Method can redefine row labels or column labels , And return a new object , The original data structure will not be changed . Re indexing can delete existing rows or columns , You can also add new rows or columns . If the filling value is not specified , The default value of the new row or column is NaN.

>>> stock.reindex(index=idx, columns=colnames)
Opening price Closing price turnover volume applies
000762.SZ 7.28 7.50 57.01 7741802.0 NaN
000625.SZ 10.70 11.71 789.10 68771048.0 NaN
600132.SH 48.10 50.28 223.06 4496598.0 NaN
000955.SZ NaN NaN NaN NaN NaN
>>> stock.reindex(index=idx, columns=colnames, fill_value=0)
Opening price Closing price turnover volume applies
000762.SZ 7.28 7.50 57.01 7741802.0 0.0
000625.SZ 10.70 11.71 789.10 68771048.0 0.0
600132.SH 48.10 50.28 223.06 4496598.0 0.0
000955.SZ 0.00 0.00 0.00 0.0 0.0

3.3.2 Delete row or column

DataFrame Of drop() Method to delete the specified item of the specified axis , Return a new object , The original data structure will not be changed .

>>> stock.drop(['000762.SZ', '600132.SH'], axis=0) # Delete specified row 
Opening price Highest price The lowest price Closing price turnover volume
000625.SZ 10.70 11.95 10.56 11.71 789.10 68771048.0
600009.SH 66.70 69.28 66.66 68.92 1196.14 17662768.0
600126.SH 7.00 7.35 6.93 7.11 783.15 109975919.0
000882.SZ 2.02 2.10 2.01 2.08 56.32 27484360.0
>>> stock.drop([' turnover ', ' Highest price ', ' The lowest price '], axis=1) # Delete the specified column 
Opening price Closing price volume
000625.SZ 10.70 11.71 68771048.0
000762.SZ 7.28 7.50 7741802.0
600132.SH 48.10 50.28 4496598.0
600009.SH 66.70 68.92 17662768.0
600126.SH 7.00 7.11 109975919.0
000882.SZ 2.02 2.08 27484360.0

3.3.3 Row extension

DataFrame Of append() Method can append another DataFrame object , Implement line extension . Row extension does not require two DataFrame Object's column labels match . Row extension returns a new data structure , The column labels are two DataFrame Union of object column labels . Row expansion does not change the original data structure .

>>> idx = ['600161.SH', '600169.SH']
>>> colnames = [' Opening price ', ' Closing price ', ' turnover ', ' volume ', ' applies ']
>>> data = np.array([
[31.00, 32.16, 284.02, 8932594, 0.03],
[2.02, 2.13, 115.87, 54146894, 0.05]
])
>>> s = pd.DataFrame(data, columns=colnames, index=idx)
>>> stock.append(s)
Opening price Highest price The lowest price Closing price turnover volume applies
000625.SZ 10.70 11.95 10.56 11.71 789.10 68771048.0 NaN
000762.SZ 7.28 7.59 7.17 7.50 57.01 7741802.0 NaN
600132.SH 48.10 50.59 48.10 50.28 223.06 4496598.0 NaN
600009.SH 66.70 69.28 66.66 68.92 1196.14 17662768.0 NaN
600126.SH 7.00 7.35 6.93 7.11 783.15 109975919.0 NaN
000882.SZ 2.02 2.10 2.01 2.08 56.32 27484360.0 NaN
600161.SH 31.00 NaN NaN 32.16 284.02 8932594.0 0.03
600169.SH 2.02 NaN NaN 2.13 115.87 54146894.0 0.05

Pandas Under the namespace concat() function , You can also achieve multiple DataFrame Object vertical connection function , Use up ratio append() More convenient .

>>> pd.concat((stock, s))
Opening price Highest price The lowest price Closing price turnover volume applies
000625.SZ 10.70 11.95 10.56 11.71 789.10 68771048.0 NaN
000762.SZ 7.28 7.59 7.17 7.50 57.01 7741802.0 NaN
600132.SH 48.10 50.59 48.10 50.28 223.06 4496598.0 NaN
600009.SH 66.70 69.28 66.66 68.92 1196.14 17662768.0 NaN
600126.SH 7.00 7.35 6.93 7.11 783.15 109975919.0 NaN
000882.SZ 2.02 2.10 2.01 2.08 56.32 27484360.0 NaN
600161.SH 31.00 NaN NaN 32.16 284.02 8932594.0 0.03
600169.SH 2.02 NaN NaN 2.13 115.87 54146894.0 0.05

3.3.4 Column extension

Assign values directly to new columns , Column expansion can be realized . assignment , The data length must be equal to DataFrame Match the length of . Here's a special point to make , Other operations that change the data structure are to return the new data structure , The original data structure will not be changed , The assignment operation changes the original data structure .

>>> stock[' applies '] = [0.02, 0.03, 0.05, 0.01, 0.02, 0.03]
>>> stock
Opening price Highest price The lowest price Closing price turnover volume applies
000625.SZ 10.70 11.95 10.56 11.71 789.10 68771048.0 0.02
000762.SZ 7.28 7.59 7.17 7.50 57.01 7741802.0 0.03
600132.SH 48.10 50.59 48.10 50.28 223.06 4496598.0 0.05
600009.SH 66.70 69.28 66.66 68.92 1196.14 17662768.0 0.01
600126.SH 7.00 7.35 6.93 7.11 783.15 109975919.0 0.02
000882.SZ 2.02 2.10 2.01 2.08 56.32 27484360.0 0.03

3.4 Change the data type

Be similar to NumPy Array ,Series The object provides astype() To change the data type . however astype() It's just a new Series object , It doesn't really change the original Series object .DataFrame Object does not provide a way to change the data type of a column , If you want to do this , You can assign a new value to this column .

>>> stock[' applies '].dtype
dtype('float64')
>>> stock[' applies '] = stock[' applies '].astype('float32').values
>>> stock[' applies '].dtype
dtype('float32')

3.5 Broadcasting and vectorization

Pandas Is based on NumPy Extension of arrays , Inherited NumPy Broadcast and vectorization of arrays . Whether it's Series Inside object , still Series Between objects , Even DataFrame Between objects , All operations support broadcasting and vectorization . Besides ,NumPy The mathematical and statistical functions of arrays , Almost all can be applied to Pandas In terms of structure .

>>> stock
Opening price Highest price The lowest price Closing price turnover volume
000625.SZ 10.70 11.95 10.56 11.71 789.10 34385524.0
000762.SZ 7.28 7.59 7.17 7.50 57.01 3870901.0
600132.SH 48.10 50.59 48.10 50.28 223.06 2248299.0
600009.SH 66.70 69.28 66.66 68.92 1196.14 8831384.0
600126.SH 7.00 7.35 6.93 7.11 783.15 54987959.5
000882.SZ 2.02 2.10 2.01 2.08 56.32 13742180.0
>>> stock[' volume '] /= 2 # The volume has been halved 
>>> stock
Opening price Highest price The lowest price Closing price turnover volume
000625.SZ 10.70 11.95 10.56 11.71 789.10 17192762.00
000762.SZ 7.28 7.59 7.17 7.50 57.01 1935450.50
600132.SH 48.10 50.59 48.10 50.28 223.06 1124149.50
600009.SH 66.70 69.28 66.66 68.92 1196.14 4415692.00
600126.SH 7.00 7.35 6.93 7.11 783.15 27493979.75
000882.SZ 2.02 2.10 2.01 2.08 56.32 6871090.00
>>> stock[' Highest price '] += stock[' The lowest price '] # The highest price plus the lowest price 
>>> stock
Opening price Highest price The lowest price Closing price turnover volume
000625.SZ 10.70 22.51 10.56 11.71 789.10 17192762.00
000762.SZ 7.28 14.76 7.17 7.50 57.01 1935450.50
600132.SH 48.10 98.69 48.10 50.28 223.06 1124149.50
600009.SH 66.70 135.94 66.66 68.92 1196.14 4415692.00
600126.SH 7.00 14.28 6.93 7.11 783.15 27493979.75
000882.SZ 2.02 4.11 2.01 2.08 56.32 6871090.00
>>> stock[' Opening price '] = (stock[' Opening price ']-stock[' Opening price '].mean())/stock[' Opening price '].std() # The opening price is standardized : De centralization ( Minus the average opening price ), Divide it by the standard deviation of the opening price 
>>> stock
Opening price Highest price The lowest price Closing price turnover volume
000625.SZ -0.479878 22.51 10.56 11.71 789.10 17192762.00
000762.SZ -0.606774 14.76 7.17 7.50 57.01 1935450.50
600132.SH 0.907810 98.69 48.10 50.28 223.06 1124149.50
600009.SH 1.597944 135.94 66.66 68.92 1196.14 4415692.00
600126.SH -0.617163 14.28 6.93 7.11 783.15 27493979.75
000882.SZ -0.801940 4.11 2.01 2.08 56.32 6871090.00

Yes DataFrame The broadcast operation of objects is also feasible , Two DataFrame Arithmetic operations can also be performed between objects . Two DataFrame When an object performs arithmetic operations , Arithmetic operations are performed between the corresponding index entries of the corresponding column labels , Elements without corresponding items are automatically filled in NaN value .

>>> df_a = pd.DataFrame(np.arange(6).reshape((2,3)), columns=list('abc'))
>>> df_b = pd.DataFrame(np.arange(6,12).reshape((3,2)), columns=list('ab'))
>>> df_a
a b c
0 0 1 2
1 3 4 5
>>> df_b
a b
0 6 7
1 8 9
2 10 11
>>> df_a + 1 # Yes DataFrame Broadcast operations on objects 
a b c
0 1 2 3
1 4 5 6
>>> df_a + df_b # Two DataFrame Vector operations of objects 
a b c
0 6.0 8.0 NaN
1 11.0 13.0 NaN
2 NaN NaN NaN

3.6 Column level broadcast function

NumPy Most of the mathematical and statistical functions of an array are broadcast functions , Can be implicitly mapped to the elements of an array .NumPy Arrays also support custom broadcast functions .Pandas Of apply() The function is similar to NumPy Custom broadcast function function of , You can map functions to DataFrame On a specific row or column of an object , In other words, a one-dimensional array of rows or columns is used as the input parameter of the function , Instead of taking each element of a row or column as the input parameter of the function . This is a Pandas Of apply() Functions are different from NumPy Custom broadcast function place .

>>> stock
Opening price Highest price The lowest price Closing price turnover volume
000625.SZ -0.479878 22.51 10.56 11.71 789.10 17192762.00
000762.SZ -0.606774 14.76 7.17 7.50 57.01 1935450.50
600132.SH 0.907810 98.69 48.10 50.28 223.06 1124149.50
600009.SH 1.597944 135.94 66.66 68.92 1196.14 4415692.00
600126.SH -0.617163 14.28 6.93 7.11 783.15 27493979.75
000882.SZ -0.801940 4.11 2.01 2.08 56.32 6871090.00
>>> f = lambda x:(x-x.min())/(x.max()-x.min()) # Define the normalization function 
>>> stock.apply(f, axis=0) # 0 Axis ( Line direction ) normalization 
Opening price Highest price The lowest price Closing price turnover volume
000625.SZ 0.134199 0.139574 0.132251 0.144075 0.642891 0.609356
000762.SZ 0.081323 0.080786 0.079814 0.081089 0.000605 0.030766
600132.SH 0.712430 0.717439 0.712916 0.721125 0.146286 0.000000
600009.SH 1.000000 1.000000 1.000000 1.000000 1.000000 0.124822
600126.SH 0.076994 0.077145 0.076102 0.075254 0.637671 1.000000
000882.SZ 0.000000 0.000000 0.000000 0.000000 0.000000 0.217936
>>> stock.apply(f, axis=1) # 1 Axis ( Column direction ) normalization 
Opening price Highest price The lowest price Closing price turnover volume
000625.SZ 0.0 1.337183e-06 6.421236e-07 7.090122e-07 0.000046 1.0
000762.SZ 0.0 7.939634e-06 4.018068e-06 4.188571e-06 0.000030 1.0
600132.SH 0.0 8.698333e-05 4.198038e-05 4.391963e-05 0.000198 1.0
600009.SH 0.0 3.042379e-05 1.473429e-05 1.524610e-05 0.000271 1.0
600126.SH 0.0 5.418336e-07 2.745024e-07 2.810493e-07 0.000029 1.0
000882.SZ 0.0 7.148705e-07 4.092422e-07 4.194298e-07 0.000008 1.0

4. Advanced applications

DataFrame As a powerful tool for data analysis , Its function is embodied in two aspects : One is to temporarily store complex data , The second is to provide efficient processing means . The basic operation of the previous section , Focus on temporary data , Advanced applications of this section , It focuses on how to process data efficiently .

4.1 grouping

Grouping and aggregation are the most common application scenarios in data processing . such as , For multiple stocks, trading volume analysis of multiple trading days , Statistics need to be made by stock and trading day .

>>> data = {

' date ': ['2020-03-11','2020-03-11','2020-03-11','2020-03-11','2020-03-11',
'2020-03-12','2020-03-12','2020-03-12','2020-03-12','2020-03-12',
'2020-03-13','2020-03-13','2020-03-13','2020-03-13','2020-03-13'],
' Code ': ['000625.SZ','000762.SZ','600132.SH','600009.SH','000882.SZ',
'000625.SZ','000762.SZ','600132.SH','600009.SH','000882.SZ',
'000625.SZ','000762.SZ','600132.SH','600009.SH','000882.SZ'],
' turnover ': [422.08,73.65,207.04,510.59,63.28,
471.78,59.2,156.82,853.83,52.84,
789.1,57.01,223.06,1196.14,56.32],
' volume ': [37091400,9315300,4127800,7233100,28911100,
42471700,7724200,3143100,12350400,24828900,
68771048,7741802,4496598,17662768,27484360]
}
>>> vo = pd.DataFrame(data)
>>> vo
date Code turnover volume
0 2020-03-11 000625.SZ 422.08 37091400
1 2020-03-11 000762.SZ 73.65 9315300
2 2020-03-11 600132.SH 207.04 4127800
3 2020-03-11 600009.SH 510.59 7233100
4 2020-03-11 000882.SZ 63.28 28911100
5 2020-03-12 000625.SZ 471.78 42471700
6 2020-03-12 000762.SZ 59.20 7724200
7 2020-03-12 600132.SH 156.82 3143100
8 2020-03-12 600009.SH 853.83 12350400
9 2020-03-12 000882.SZ 52.84 24828900
10 2020-03-13 000625.SZ 789.10 68771048
11 2020-03-13 000762.SZ 57.01 7741802
12 2020-03-13 600132.SH 223.06 4496598
13 2020-03-13 600009.SH 1196.14 17662768
14 2020-03-13 000882.SZ 56.32 27484360

Use groupby() Functions are grouped by date , The returned grouping result is an iterator , Traverse this iterator , You can get three by group name ( date ) And the group's DataFrame A tuple of components .

>>> for name, df in vo.groupby(' date '):
print(' Group name :%s'%name)
print('-------------------------------------------')
print(df)
print()
Group name :2020-03-11
-------------------------------------------
date Code turnover volume
0 2020-03-11 000625.SZ 422.08 37091400
1 2020-03-11 000762.SZ 73.65 9315300
2 2020-03-11 600132.SH 207.04 4127800
3 2020-03-11 600009.SH 510.59 7233100
4 2020-03-11 000882.SZ 63.28 28911100
Group name :2020-03-12
-------------------------------------------
date Code turnover volume
5 2020-03-12 000625.SZ 471.78 42471700
6 2020-03-12 000762.SZ 59.20 7724200
7 2020-03-12 600132.SH 156.82 3143100
8 2020-03-12 600009.SH 853.83 12350400
9 2020-03-12 000882.SZ 52.84 24828900
Group name :2020-03-13
-------------------------------------------
date Code turnover volume
10 2020-03-13 000625.SZ 789.10 68771048
11 2020-03-13 000762.SZ 57.01 7741802
12 2020-03-13 600132.SH 223.06 4496598
13 2020-03-13 600009.SH 1196.14 17662768
14 2020-03-13 000882.SZ 56.32 27484360

Use groupby() Functions are grouped by stock code , The returned grouping result is an iterator , Traverse this iterator , You can get five by group name ( Stock code ) And the group's DataFrame A tuple of components .

>>> for name, df in vo.groupby(' Code '):
print(' Group name :%s'%name)
print('-------------------------------------------')
print(df)
print()
Group name :000625.SZ
-------------------------------------------
date Code turnover volume
0 2020-03-11 000625.SZ 422.08 37091400
5 2020-03-12 000625.SZ 471.78 42471700
10 2020-03-13 000625.SZ 789.10 68771048
Group name :000762.SZ
-------------------------------------------
date Code turnover volume
1 2020-03-11 000762.SZ 73.65 9315300
6 2020-03-12 000762.SZ 59.20 7724200
11 2020-03-13 000762.SZ 57.01 7741802
Group name :000882.SZ
-------------------------------------------
date Code turnover volume
4 2020-03-11 000882.SZ 63.28 28911100
9 2020-03-12 000882.SZ 52.84 24828900
14 2020-03-13 000882.SZ 56.32 27484360
Group name :600009.SH
-------------------------------------------
date Code turnover volume
3 2020-03-11 600009.SH 510.59 7233100
8 2020-03-12 600009.SH 853.83 12350400
13 2020-03-13 600009.SH 1196.14 17662768
Group name :600132.SH
-------------------------------------------
date Code turnover volume
2 2020-03-11 600132.SH 207.04 4127800
7 2020-03-12 600132.SH 156.82 3143100
12 2020-03-13 600132.SH 223.06 4496598

4.2 polymerization

Understand the grouping , Next, you can do something according to the group . such as , Statistics of all stocks daily turnover and turnover of the total amount and so on .

>>> vo.groupby(' date ').sum() # According to the date statistics of the total number of transactions and transactions of all stocks 
turnover volume
date
2020-03-11 1276.64 86678700
2020-03-12 1594.47 90518300
2020-03-13 2321.63 126156576
>>> vo.groupby(' Code ').mean() # Statistics of the average trading volume and average trading volume of each stock in multiple trading days 
turnover volume
Code
000625.SZ 560.986667 4.944472e+07
000762.SZ 63.286667 8.260434e+06
000882.SZ 57.480000 2.707479e+07
600009.SH 853.520000 1.241542e+07
600132.SH 195.640000 3.922499e+06

Functions that can be applied directly to grouping results include : Count (count)、 Sum up (sum)、 mean value (mean)、 Median (median)、 The product of significant values (prod)、 Variance and standard deviation (var、std)、 Maximum and minimum (min and max)、 The first and last valid values (first and last) etc. .

If you want to implement a custom function for grouping results , Or do more statistical analysis of the grouping results , In this case, we need to use the aggregate function agg() 了 .

>>> def scope(x): # Returns the difference between the maximum and minimum values ( The range of fluctuation ) Function of 
return x.max()-x.min()
>>> vo.groupby(' Code ').agg(scope) # Statistics of each stock turnover and trading volume fluctuations 
turnover volume
Code
000625.SZ 367.02 31679648
000762.SZ 16.64 1591100
000882.SZ 10.44 4082200
600009.SH 685.55 10429668
600132.SH 66.24 1353498
>>> vo.groupby(' Code ').agg(['mean', scope]) # Statistics turnover and turnover of the average and volatility 
turnover volume
mean scope mean scope
Code
000625.SZ 560.986667 367.02 4.944472e+07 31679648
000762.SZ 63.286667 16.64 8.260434e+06 1591100
000882.SZ 57.480000 10.44 2.707479e+07 4082200
600009.SH 853.520000 685.55 1.241542e+07 10429668
600132.SH 195.640000 66.24 3.922499e+06 1353498

Aggregate functions can also perform different function operations on different columns . such as , The following code is used to average the turnover , The trading volume of the implementation of custom volatility function .

>>> vo.groupby(' Code ').agg({
' turnover ':'mean', ' volume ':scope})
turnover volume
Code
000625.SZ 560.986667 31679648
000762.SZ 63.286667 1591100
000882.SZ 57.480000 4082200
600009.SH 853.520000 10429668
600132.SH 195.640000 1353498

4.3 Hierarchical index

In the case of discussing grouping and aggregation , date - Stock code - Turnover and volume , Such a data structure is already three-dimensional , But it still works DataFrame Staging and handling . Such a data structure , Although it is possible to obtain multiple two-dimensional DataFrame object , But after all, you can't directly index or select . Hierarchical index can solve this problem well , by DataFrame Processing higher dimensional data points the way .

The date string is still used as the index , The correct way to do this is to use the date index object . We're going to talk about time series , Formal introduction DatetimeIndex The use of the class .

>>> dt = ['2020-03-11', '2020-03-12','2020-03-13']
>>> sc = ['000625.SZ','000762.SZ','600132.SH','600009.SH','600126.SH']
>>> cn = [' turnover ', ' volume ']
>>> idx = pd.MultiIndex.from_product([dt, sc], names=[' date ', ' Code '])
>>> data = np.array([
[422.08, 37091400],
[73.65, 9315300],
[207.04, 4127800],
[510.59, 7233100],
[63.28, 28911100],
[471.78, 42471700],
[59.2, 7724200],
[156.82, 3143100],
[853.83, 12350400],
[52.84, 24828900],
[789.1, 68771048],
[57.01, 7741802],
[223.06, 4496598],
[1196.14, 17662768],
[56.32, 27484360]
])
>>> vom1 = pd.DataFrame(data, index=idx, columns=cn)
>>> vom1
turnover volume
date Code
2020-03-11 000625.SZ 422.08 37091400.0
000762.SZ 73.65 9315300.0
600132.SH 207.04 4127800.0
600009.SH 510.59 7233100.0
600126.SH 63.28 28911100.0
2020-03-12 000625.SZ 471.78 42471700.0
000762.SZ 59.20 7724200.0
600132.SH 156.82 3143100.0
600009.SH 853.83 12350400.0
600126.SH 52.84 24828900.0
2020-03-13 000625.SZ 789.10 68771048.0
000762.SZ 57.01 7741802.0
600132.SH 223.06 4496598.0
600009.SH 1196.14 17662768.0
600126.SH 56.32 27484360.0

Hierarchical index data vom1 Now there are two index entries, date and code . There is another form of hierarchical index , That is to use hierarchical index objects on row labels .

>>> dt = ['2020-03-11', '2020-03-12','2020-03-13']
>>> sc = ['000625.SZ','000762.SZ','600132.SH','600009.SH','000882.SZ']
>>> cn = [' turnover ', ' volume ']
>>> cols = pd.MultiIndex.from_product([dt, cn], names=[' date ', ' data '])
>>> data = np.array([
[422.08, 37091400, 471.78, 42471700, 789.1, 68771048],
[73.65, 9315300, 59.2, 7724200, 57.01, 7741802],
[207.04, 4127800, 156.82, 3143100, 223.06, 4496598],
[510.59, 7233100, 853.83, 12350400, 1196.14, 17662768],
[63.28, 28911100, 52.84, 24828900, 56.32, 27484360]
])
>>> vom2 = pd.DataFrame(data, index=sc, columns=cols)
>>> vom2
date 2020-03-11 2020-03-12 2020-03-13
data turnover volume turnover volume turnover volume
000625.SZ 422.08 37091400.0 471.78 42471700.0 789.10 68771048.0
000762.SZ 73.65 9315300.0 59.20 7724200.0 57.01 7741802.0
600132.SH 207.04 4127800.0 156.82 3143100.0 223.06 4496598.0
600009.SH 510.59 7233100.0 853.83 12350400.0 1196.14 17662768.0
000882.SZ 63.28 28911100.0 52.84 24828900.0 56.32 27484360.0

For hierarchical index data, the index and selection are similar to common DataFrame object .

>>> vom1.loc['2020-03-11']
turnover volume
second
000625.SZ 422.08 37091400.0
000762.SZ 73.65 9315300.0
600132.SH 207.04 4127800.0
600009.SH 510.59 7233100.0
000882.SZ 63.28 28911100.0
>>> vom1.loc['2020-03-11', '000625.SZ']
turnover 422.08
volume 37091400.00
Name: (2020-03-11, 000625.SZ), dtype: float64
>>> vom1.loc['2020-03-11', '000625.SZ'][' volume ']
37091400.0
>>> vom2['2020-03-11']
data turnover volume
000625.SZ 422.08 37091400.0
000762.SZ 73.65 9315300.0
600132.SH 207.04 4127800.0
600009.SH 510.59 7233100.0
000882.SZ 63.28 28911100.0
>>> vom2['2020-03-11', ' turnover ']
000625.SZ 422.08
000762.SZ 73.65
600132.SH 207.04
600009.SH 510.59
000882.SZ 63.28
Name: (2020-03-11, turnover ), dtype: float64
>>> vom2.loc['000625.SZ']
date data
2020-03-11 turnover 422.08
volume 37091400.00
2020-03-12 turnover 471.78
volume 42471700.00
2020-03-13 turnover 789.10
volume 68771048.00
Name: 000625.SZ, dtype: float64
>>> vom2.loc['000625.SZ'][:,' turnover ']
date
2020-03-11 422.08
2020-03-12 471.78
2020-03-13 789.10
Name: 000625.SZ, dtype: float64

4.4 Table level broadcast function

Column level broadcast function apply() You can map a computational function to DataFrame On the row or column of an object , And take the row or column one-dimensional array as the input parameter of the calculation function . similar apply(), Table level broadcast function pipe() You can map a computational function to DataFrame On every element of an object , Each element is taken as the first input parameter of the calculation function .

>>> def scale(x, k): # Yes x Zoom , The scaling factor is k
return x*k
>>> vom1.pipe(scale, 0.2) # Yes vom1 All data performs a scaling function , Scaling factor 0.2
turnover volume
first second
2020-03-11 000625.SZ 84.416 7418280.0
000762.SZ 14.730 1863060.0
600132.SH 41.408 825560.0
600009.SH 102.118 1446620.0
000882.SZ 12.656 5782220.0
2020-03-12 000625.SZ 94.356 8494340.0
000762.SZ 11.840 1544840.0
600132.SH 31.364 628620.0
600009.SH 170.766 2470080.0
000882.SZ 10.568 4965780.0
2020-03-13 000625.SZ 157.820 13754209.6
000762.SZ 11.402 1548360.4
600132.SH 44.612 899319.6
600009.SH 239.228 3532553.6
000882.SZ 11.264 5496872.0

As a broadcast function ,pipe() Nothing new , however ,pipe() Function will DataFrame Object as the first parameter , This provides the possibility of chaining calls . Chaining as a popular coding style , With the simplicity and readability of its code , It's popular with many languages and programmers .

>>> def adder(x, dx):
return x+dx
>>> vom1.pipe(scale, 0.2).pipe(adder, 5) # call chaining 
turnover volume
first second
2020-03-11 000625.SZ 89.416 7418285.0
000762.SZ 19.730 1863065.0
600132.SH 46.408 825565.0
600009.SH 107.118 1446625.0
000882.SZ 17.656 5782225.0
2020-03-12 000625.SZ 99.356 8494345.0
000762.SZ 16.840 1544845.0
600132.SH 36.364 628625.0
600009.SH 175.766 2470085.0
000882.SZ 15.568 4965785.0
2020-03-13 000625.SZ 162.820 13754214.6
000762.SZ 16.402 1548365.4
600132.SH 49.612 899324.6
600009.SH 244.228 3532558.6
000882.SZ 16.264 5496877.0

4.5 Date time index object

Pandas It also has good support for time and date data , Provides a lot of very practical methods , It's very easy to generate 、 Convert date time index object .DatetimeIndex Class is a kind of index array , It is also the most commonly used tool for generating and converting date time series , The date time index object can be directly generated from the date time string list , You can also index objects of string type 、Series Object to date time index object .

>>> pd.DatetimeIndex(['2020-03-10', '2020-03-11', '2020-03-12'])
pd.DatetimeIndex(pd.Index(['2020-03-10', '2020-03-11', '2020-03-12']))
>>> idx = pd.Index(['2020-03-10', '2020-03-11', '2020-03-12'])
>>> sdt = pd.Series(['2020-03-10', '2020-03-11', '2020-03-12'])
>>> idx
Index(['2020-03-10', '2020-03-11', '2020-03-12'], dtype='object')
>>> sdt
0 2020-03-10
1 2020-03-11
2 2020-03-12
dtype: object
>>> pd.DatetimeIndex(idx)
DatetimeIndex(['2020-03-10', '2020-03-11', '2020-03-12'], dtype='datetime64[ns]', freq=None)
>>> pd.DatetimeIndex(sdt)
DatetimeIndex(['2020-03-10', '2020-03-11', '2020-03-12'], dtype='datetime64[ns]', freq=None)

Conversion function pd.to_datetime() The function of is similar to DatetimeIndex class , You can also convert date time strings of various formats into date time index objects .

>>> pd.to_datetime(['2020-03-10', '2020-03-11', '2020-03-12', '2020-03-13'])
DatetimeIndex(['2020-03-10', '2020-03-11', '2020-03-12', '2020-03-13'], dtype='datetime64[ns]', freq=None)
>>> pd.to_datetime(idx)
DatetimeIndex(['2020-03-10', '2020-03-11', '2020-03-12'], dtype='datetime64[ns]', freq=None)
>>> pd.to_datetime(std)

Given the start and end time 、 Sequence length or segmentation step size ,date_range() You can also quickly create date time index objects . The segmentation step size uses L、S、T、H、D、M Represents milliseconds 、 second 、 minute 、 Hours 、 God , Month etc , You can also add numbers , such as 3H Indicates that the segmentation step is 3 Hours .

>>> pd.date_range(start='2020-05-12', end='2020-05-18')
DatetimeIndex(['2020-05-12', '2020-05-13', '2020-05-14', '2020-05-15',
'2020-05-16', '2020-05-17', '2020-05-18'],
dtype='datetime64[ns]', freq='D')
>>> pd.date_range(start='2020-05-12 08:00:00', periods=6, freq='3H')
DatetimeIndex(['2020-05-12 08:00:00', '2020-05-12 11:00:00',
'2020-05-12 14:00:00', '2020-05-12 17:00:00',
'2020-05-12 20:00:00', '2020-05-12 23:00:00'],
dtype='datetime64[ns]', freq='3H')
>>> pd.date_range(start='08:00:00', end='9:00:00', freq='15T')
DatetimeIndex(['2020-05-13 08:00:00', '2020-05-13 08:15:00',
'2020-05-13 08:30:00', '2020-05-13 08:45:00',
'2020-05-13 09:00:00'],
dtype='datetime64[ns]', freq='15T')

4.6 Data visualization

Pandas The visualization of is based on Matplotlib An encapsulation of , And the packaging is not thorough enough , Many places are still inseparable from Matplotlib. such as , Out of the ipython or jupyter Environment , You have to use pyplot.show() To display the drawing results , To solve the problem of Chinese display, we must import it explicitly matplotlib.pyplot package , Unless you modify it manually Matplotlib Font profile for . therefore , Use Pandas Before the visualization function of , You need to import the module and set the default font . All examples in this section , It is assumed that the following code has been run .

>>> import numpy as np
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> plt.rcParams['font.sans-serif'] = ['FangSong']
>>> plt.rcParams['axes.unicode_minus'] = False

Pandas Visualization API Provides a drawing line chart 、 Histogram 、 Box figure 、 Histogram 、 Scatter plot 、 Pie chart and other functions . about Series Objects and DataFrame Object data visualization , The index is usually represented by the horizontal axis , The data is represented on the vertical axis .

>>> idx = pd.date_range(start='08:00:00',end='9:00:00',freq='T') # interval 1 minute 
>>> y = np.sin(np.linspace(0,2*np.pi,61)) # 0~2π Between 61 Sine of points 
>>> s = pd.Series(y, index=idx) # establish Series object , The index is a time series 
>>> s.plot() # Draw line chart 
<matplotlib.axes._subplots.AxesSubplot object at 0x0000029D4EC95C08>
>>> plt.show() # Show drawing results 

The above code calls Series Object's plot() function , It draws a sine curve , Pictured 6 1 Shown .Series The index of an object is a date time series , from 8 It's time to 9 when , interval 1 minute .
 Insert picture description here
about DataFrame Object data visualization , The index is also represented by the horizontal axis , Multiple columns of data can be drawn on the canvas (figure) The same subgraph of (axes) On , You can also draw multiple subgraphs on the same canvas (axes) On .

>>> data = np.random.randn(10,4)
>>> idx = pd.date_range('08:00:00', periods=10, freq='H')
>>> df = pd.DataFrame(data, index=idx, columns=list('ABCD'))
>>> df.plot()
<matplotlib.axes._subplots.AxesSubplot object at 0x0000029D525FA548>
>>> plt.show()

DataFrame Object's plot() Method also draws 4 Column data , Automatically generated the legend , Obviously better than direct use Matplotlib It's much simpler .
 Insert picture description here
Need to know something about Matplotlib The concept and method of , Can be used Pandas Visualization API Draw a histogram of multiple subgraphs on the same canvas .

>>> df = pd.DataFrame(np.random.rand(10,4),columns=list('ABCD'))
>>> fig = plt.figure( )
>>> ax = fig.add_subplot(131)
>>> df.plot.bar(ax=ax)
<matplotlib.axes._subplots.AxesSubplot object at 0x0000029D52A4B288>
>>> ax = fig.add_subplot(132)
>>> df.plot.bar(ax=ax, stacked=True)
<matplotlib.axes._subplots.AxesSubplot object at 0x0000029D56808308>
>>> ax = fig.add_subplot(133)
>>> df.plot.barh(ax=ax, stacked=True)
<matplotlib.axes._subplots.AxesSubplot object at 0x0000029D52606B08>
>>> plt.show()

This is a normal histogram 、 Stack bar chart and horizontal stacked bar chart are drawn on the same canvas .
 Insert picture description here

4.7 data I/O

As a powerful tool for data processing , The input and output of data is an essential function .Pandas Different formats can be read 、 Different sources of data , You can also save the data into various formats of data files .

4.7.1 Reading and writing CSV Format data file

Write CSV When you file , The index will be written to the first column (0 Column ), When reading data , If the first column is not specified (0 Column ) Index , The default index is automatically added .

>>> df = pd.DataFrame(np.random.rand(10,4),columns=list('ABCD')) # Generate simulation data 
>>> df
A B C D
0 0.367409 0.542233 0.468111 0.732681
1 0.465060 0.172522 0.939913 0.654894
2 0.455698 0.487195 0.980735 0.752743
3 0.951230 0.940689 0.455013 0.682672
4 0.283269 0.421182 0.024713 0.245193
5 0.297696 0.981307 0.513994 0.698454
6 0.034707 0.688815 0.530870 0.921954
7 0.159914 0.185290 0.489379 0.299581
8 0.213631 0.950752 0.128683 0.499867
9 0.403379 0.269299 0.173059 0.939896
>>> df.to_csv('random.csv') # Save as CSV file 
>>> df = pd.read_csv('random.csv') # Read CSV file 
>>> df
Unnamed: 0 A B C D
0 0 0.367409 0.542233 0.468111 0.732681
1 1 0.465060 0.172522 0.939913 0.654894
2 2 0.455698 0.487195 0.980735 0.752743
3 3 0.951230 0.940689 0.455013 0.682672
4 4 0.283269 0.421182 0.024713 0.245193
5 5 0.297696 0.981307 0.513994 0.698454
6 6 0.034707 0.688815 0.530870 0.921954
7 7 0.159914 0.185290 0.489379 0.299581
8 8 0.213631 0.950752 0.128683 0.499867
9 9 0.403379 0.269299 0.173059 0.939896

When reading data , have access to index_col Parameter specifies the first column (0 Column ) Index .

>>> df = pd.read_csv(r'D:\NumPyFamily\data\random.csv', index_col=0)
>>> df
A B C D
0 0.367409 0.542233 0.468111 0.732681
1 0.465060 0.172522 0.939913 0.654894
2 0.455698 0.487195 0.980735 0.752743
3 0.951230 0.940689 0.455013 0.682672
4 0.283269 0.421182 0.024713 0.245193
5 0.297696 0.981307 0.513994 0.698454
6 0.034707 0.688815 0.530870 0.921954
7 0.159914 0.185290 0.489379 0.299581
8 0.213631 0.950752 0.128683 0.499867
9 0.403379 0.269299 0.173059 0.939896

4.7.2 Reading and writing Excel file

Reading and writing Excel You need to use sheet_name Parameter specifies the table name . in addition , Write Excel When you file , The index will be written to the first column (0 Column ), When reading data , If the first column is not specified (0 Column ) Index , The default index is automatically added .

>>> idx = pd.date_range('08:00:00', periods=10, freq='H')
>>> df = pd.DataFrame(np.random.rand(10,4),columns=list('ABCD'),index=idx)
>>> df
A B C D
2020-05-14 08:00:00 0.760846 0.926615 0.325205 0.525448
2020-05-14 09:00:00 0.845306 0.176587 0.764530 0.674024
2020-05-14 10:00:00 0.697167 0.861391 0.519662 0.443900
2020-05-14 11:00:00 0.461842 0.418028 0.844132 0.661985
2020-05-14 12:00:00 0.661543 0.619015 0.647476 0.473730
2020-05-14 13:00:00 0.941277 0.740208 0.249476 0.097356
2020-05-14 14:00:00 0.425394 0.639996 0.093368 0.904685
2020-05-14 15:00:00 0.886753 0.153370 0.820338 0.922392
2020-05-14 16:00:00 0.253917 0.068124 0.831815 0.703694
2020-05-14 17:00:00 0.999562 0.894684 0.395017 0.862102
>>> df.to_excel('random.xlsx', sheet_name=' random number ')
>>> df = pd.read_excel('random.xlsx', sheet_name=' random number ')
>>> df
Unnamed: 0 A B C D
0 2020-05-14 08:00:00 0.760846 0.926615 0.325205 0.525448
1 2020-05-14 09:00:00 0.845306 0.176587 0.764530 0.674024
2 2020-05-14 10:00:00 0.697167 0.861391 0.519662 0.443900
3 2020-05-14 11:00:00 0.461842 0.418028 0.844132 0.661985
4 2020-05-14 12:00:00 0.661543 0.619015 0.647476 0.473730
5 2020-05-14 13:00:00 0.941277 0.740208 0.249476 0.097356
6 2020-05-14 14:00:00 0.425394 0.639996 0.093368 0.904685
7 2020-05-14 15:00:00 0.886753 0.153370 0.820338 0.922392
8 2020-05-14 16:00:00 0.253917 0.068124 0.831815 0.703694
9 2020-05-14 17:00:00 0.999562 0.894684 0.395017 0.862102

When reading data , have access to index_col Parameter specifies the first column (0 Column ) Index .

>>> df = pd.read_excel('random.xlsx', sheet_name=' random number ', index_col=0)
>>> df
A B C D
2020-05-14 08:00:00 0.760846 0.926615 0.325205 0.525448
2020-05-14 09:00:00 0.845306 0.176587 0.764530 0.674024
2020-05-14 10:00:00 0.697167 0.861391 0.519662 0.443900
2020-05-14 11:00:00 0.461842 0.418028 0.844132 0.661985
2020-05-14 12:00:00 0.661543 0.619015 0.647476 0.473730
2020-05-14 13:00:00 0.941277 0.740208 0.249476 0.097356
2020-05-14 14:00:00 0.425394 0.639996 0.093368 0.904685
2020-05-14 15:00:00 0.886753 0.153370 0.820338 0.922392
2020-05-14 16:00:00 0.253917 0.068124 0.831815 0.703694
2020-05-14 17:00:00 0.999562 0.894684 0.395017 0.862102

4.7.2 Reading and writing HDF file

Write data to HDF When you file , Need to use key Parameter specifies the name of the dataset . If HDF The file already exists ,to_hdf() New data sets will be written in append mode .

>>> idx = pd.date_range('08:00:00', periods=10, freq='H')
>>> df = pd.DataFrame(np.random.rand(10,4),columns=list('ABCD'),index=idx)
>>> df
A B C D
2020-05-14 08:00:00 0.677705 0.644192 0.664254 0.207009
2020-05-14 09:00:00 0.211001 0.596230 0.080490 0.526014
2020-05-14 10:00:00 0.333805 0.687243 0.938533 0.524056
2020-05-14 11:00:00 0.975474 0.575015 0.717171 0.820018
2020-05-14 12:00:00 0.236850 0.955453 0.483227 0.297570
2020-05-14 13:00:00 0.945418 0.977319 0.807121 0.526502
2020-05-14 14:00:00 0.902363 0.106375 0.744314 0.445091
2020-05-14 15:00:00 0.931304 0.253368 0.567823 0.199252
2020-05-14 16:00:00 0.168369 0.916201 0.669356 0.155653
2020-05-14 17:00:00 0.511406 0.277680 0.332807 0.141315
>>> df.to_hdf('random.h5', key='random')
>>> df = pd.read_hdf('random.h5', key='random')
>>> df
A B C D
2020-05-14 08:00:00 0.677705 0.644192 0.664254 0.207009
2020-05-14 09:00:00 0.211001 0.596230 0.080490 0.526014
2020-05-14 10:00:00 0.333805 0.687243 0.938533 0.524056
2020-05-14 11:00:00 0.975474 0.575015 0.717171 0.820018
2020-05-14 12:00:00 0.236850 0.955453 0.483227 0.297570
2020-05-14 13:00:00 0.945418 0.977319 0.807121 0.526502
2020-05-14 14:00:00 0.902363 0.106375 0.744314 0.445091
2020-05-14 15:00:00 0.931304 0.253368 0.567823 0.199252
2020-05-14 16:00:00 0.168369 0.916201 0.669356 0.155653
2020-05-14 17:00:00 0.511406 0.277680 0.332807 0.141315
版权声明
本文为[Tianyuan prodigal son]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database