From Xiaobai to master, here is a guide to pandas

A5 Huang Zhong 2021-02-22 22:16:28
xiaobai master guide pandas

selected from Medium

author :Rudolf H?hn Heart of machine compilation

Participate in : Li SHIMENG 、 Zhang Qian

In this paper , The author from Pandas Here's the introduction to , Step by step Pandas Development status of 、 Memory optimization and so on . This is a best practice tutorial , It's suitable for use Pandas Readers of , It's also suitable for Xiaobai who hasn't used it but wants to use it .

Through this paper , You will hopefully find one or more uses pandas A new way of coding .

This article includes the following :
  • Pandas Current situation of the development of ;
  • Memory optimization ;
  • Indexes ;
  • Methods the chain ;
  • Random prompt .

In reading this article , I suggest that you read the documentation string for each function you don't understand (docstrings). ordinary Google Search and a few seconds Pandas Reading of documents , Will make your reading experience more enjoyable .

Pandas The definition and status quo of

What is? Pandas?

Pandas It's a 「 Open source 、 Yes BSD Open source protocol library , It's for Python Programming languages provide high performance 、 Easy to use data architecture and data analysis tools 」. All in all , It provides what is called DataFrame and Series( For those using Panel For people who , They have been abandoned ) Data abstraction , Fast access to data by managing indexes 、 Perform analysis and transformation operations , You can even draw ( use matplotlib Back end ).

Pandas The latest version of is v0.25.0 (

 From Xiaobai to master , Here is a copy of Pandas Getting started

Pandas Is gradually upgrading to 1.0 edition , And to do that , It changes a lot of the details that people are used to .Pandas One of the core developers of Marc Garcia Made a very interesting speech ——「 trend Pandas 1.0」.

Speech Links :

Sum up in one sentence ,Pandas v1.0 Mainly improved stability ( Time series ) And removed the unused code base ( Such as SparseDataFrame).


Let's get started ! choice 「1985 To 2016 Suicide rates in every country over the past two years 」 As a toy dataset . This data set is simple enough , But it's enough to get you started Pandas.

Dataset Links :

Before delving into the code , If you want to reproduce the results , To prepare the data, execute the following code first , Make sure the column name and type are correct .

import pandas as pdimport numpy as npimport os# to download = 'path/to/folder/'df = (pd.read_csv(filepath_or_buffer=os.path.join(data_path, 'master.csv')) .rename(columns={'suicides/100k pop' : 'suicides_per_100k', ' gdp_for_year ($) ' : 'gdp_year', 'gdp_per_capita ($)' : 'gdp_capita', 'country-year' : 'country_year'}) .assign(gdp_year=lambda _df: _df['gdp_year'].str.replace(',','').astype(np.int64)) )

Tips : If you read a large file , stay read_csv( The parameter is set to chunksize=N, This will return an output DataFrame Object iterator .

Here are some descriptions of this dataset :

 From Xiaobai to master , Here is a copy of Pandas Getting started

>>> df.columnsIndex(['country', 'year', 'sex', 'age', 'suicides_no', 'population', 'suicides_per_100k', 'country_year', 'HDI for year', 'gdp_year', 'gdp_capita', 'generation'], dtype='object')

Here you are 101 A country 、 Year from 1985 To 2016、 Two genders 、 Six years and six age groups . There are some ways to get this information :

It can be used unique() and nunique() Get the unique value in the column ( Or the number of unique values );
>>> df['generation'].unique()array(['Generation X', 'Silent', 'G.I. Generation', 'Boomers', 'Millenials', 'Generation Z'], dtype=object)>>> df['country'].nunique()101

It can be used describe() Output different statistics for each column ( For example, the minimum value 、 Maximum 、 Average 、 Total number, etc ), If specified include='all', The number of unique elements and the number of the most frequent elements will be output for each column target ;

 From Xiaobai to master , Here is a copy of Pandas Getting started

It can be used head() and tail() To visualize a small part of the data frame .

By these means , You can quickly understand the form file being analyzed .

Memory optimization

Before processing the data , It's an important step to understand the data and choose the right type for each column of the data frame .

In the internal ,Pandas Store data frames as different types of numpy Array ( For example, a float64 matrix , One int32 matrix ).

There are two ways to significantly reduce memory consumption .
import pandas as pddef mem_usage(df: pd.DataFrame) -> str: """This method styles the memory usage of a DataFrame to be readable as MB. Parameters ---------- df: pd.DataFrame Data frame to measure. Returns ------- str Complete memory usage as a string formatted for MB. """ return f'{df.memory_usage(deep=True).sum() / 1024 ** 2 : 3.2f} MB'def convert_df(df: pd.DataFrame, deep_copy: bool = True) -> pd.DataFrame: """Automatically converts columns that are worth stored as ``categorical`` dtype. Parameters ---------- df: pd.DataFrame Data frame to convert. deep_copy: bool Whether or not to perform a deep copy of the original data frame. Returns ------- pd.DataFrame Optimized copy of the input data frame. """ return df.copy(deep=deep_copy).astype({ col: 'category' for col in df.columns if df[col].nunique() / df[col].shape[0] < 0.5})

Pandas Put forward a kind called memory_usage() Methods , This method can analyze the memory consumption of data frame . In the code , Appoint deep=True To ensure that the actual system usage is taken into account .


Understand the type of column ( Very important . It can save up to... In two simple ways 90% Memory usage :
  • Understand the types of data frames used ;
  • Find out what types of data frames can be used to reduce memory usage ( for example ,price This column is in 0 To 59 Between , With only one decimal place , Use float64 Types can cause unnecessary memory overhead )

In addition to reducing the size of numeric types ( use int32 instead of int64) Outside ,Pandas Classification types are also proposed :

If you use R Language developers , You might think it's related to factor The type is the same .

This classification type allows you to replace duplicate values with indexes , You can also store the actual value in other places . The example in the textbook is the country . And storing the same string multiple times 「 Switzerland 」 or 「 Poland 」 Compare to , Why not simply use 0 and 1 Replace them , And stored in the dictionary ?
categorical_dict = {0: 'Switzerland', 1: 'Poland'}

Pandas Did almost the same job , All the methods have been added at the same time , You can actually use this type , And still display the name of the country .

go back to convert_df() Method , If the only value in this column is less than 50%, It automatically converts the column type to category. This number is arbitrary , But because the type conversion in the data frame means that numpy Moving data between arrays , So we have to get more than we lose .

Let's see what happens in the data .
>>> mem_usage(df)10.28 MB>>> mem_usage(df.set_index(['country', 'year', 'sex', 'age']))5.00 MB>>> mem_usage(convert_df(df))1.40 MB>>> mem_usage(convert_df(df.set_index(['country', 'year', 'sex', 'age'])))1.40 MB

By using 「 intelligence 」 converter , Data frames use almost less memory 10 times ( To be exact 7.34 times ).


Pandas It is powerful. , But there is also a price to pay . When you load DataFrame when , It creates an index and stores the data in numpy Array . What's the meaning of this? ? Once the data frame is loaded , As long as the index is properly managed , You can access the data quickly .

There are two main ways to access data , They are accessed through index and query respectively . As the case may be , You can only choose one of them . But in most cases , Indexes ( And multiple indexes ) Are the best choices . Let's look at the following example :
>>> %%time>>> df.query('country == "Albania" and year == 1987 and sex == "male" and age == "25-34 years"')CPU times: user 7.27 ms, sys: 751 ?s, total: 8.02 ms# ==================>>> %%time>>> mi_df.loc['Albania', 1987, 'male', '25-34 years']CPU times: user 459 ?s, sys: 1 ?s, total: 460 ?s

what ? Speed up 20 times ?

You're going to ask yourself , How long does it take to create this multi index ?
%%timemi_df = df.set_index(['country', 'year', 'sex', 'age'])CPU times: user 10.8 ms, sys: 2.2 ms, total: 13 ms

The time to access data through a query is 1.5 times . If you want to retrieve data only once ( This rarely happens ), Query is the right way . otherwise , You must insist on using the index ,CPU I'll thank you for that .

.set_index(drop=False) Allow columns used as new indexes not to be deleted .

.loc[]/.iloc[] Method can read data frame very well , But you can't modify the data frame . If you need to build it manually ( Like using loops ), Then we have to consider other data structures ( Like a dictionary 、 List etc. ), After all the data is ready , establish DataFrame. otherwise , about DataFrame Every new line in ,Pandas Will update the index , This is not a simple hash map .
>>> (pd.DataFrame({'a':range(2), 'b': range(2)}, index=['a', 'a']) .loc['a']) a ba 0 0a 1 1

therefore , Unsorted indexes can degrade performance . To check whether the index has been sorted and sort it , There are two main methods :
%%time>>> mi_df.sort_index()CPU times: user 34.8 ms, sys: 1.63 ms, total: 36.5 ms>>> mi_df.index.is_monotonicTrue

For more details, see :
  • Pandas Advanced index User Guide :;
  • Pandas The index code in the library :

Methods the chain

Use DataFrame The method chain of is to link multiple returns DataFrame The behavior of the method , So they all come from DataFrame Class method . In the present Pandas In the version , The purpose of using method chain is not to store intermediate variables and avoid the following situations :
import numpy as npimport pandas as pddf = pd.DataFrame({'a_column': [1, -999, -999], 'powerless_column': [2, 3, 4], 'int_column': [1, 1, -1]}) df['a_column'] = df['a_column'].replace(-999, np.nan) df['power_column'] = df['powerless_column'] ** 2 df['real_column'] = df['int_column'].astype(np.float64) df = df.apply(lambda _df: _df.replace(4, np.nan)) df = df.dropna(how='all')

Replace... With the chain below :
df = (pd.DataFrame({'a_column': [1, -999, -999], 'powerless_column': [2, 3, 4], 'int_column': [1, 1, -1]}) .assign(a_column=lambda _df: _df['a_column'].replace(-999, np.nan)) .assign(power_column=lambda _df: _df['powerless_column'] ** 2) .assign(real_column=lambda _df: _df['int_column'].astype(np.float64)) .apply(lambda _df: _df.replace(4, np.nan)) .dropna(how='all') )

Tell the truth , The second code is more beautiful and concise .

 From Xiaobai to master , Here is a copy of Pandas Getting started

The toolbox of the method chain is made up of different methods ( such as apply、assign、loc、query、pipe、groupby as well as agg) Composed of , The output of these methods is DataFrame Object or Series object ( or DataFrameGroupBy).

The best way to understand them is to actually use them . A simple example :
(df .groupby('age') .agg({'generation':'unique'}) .rename(columns={'generation':'unique_generation'})# Recommended from v0.25# .agg(unique_generation=('generation', 'unique')))

Get a simple chain of all unique age tags in each age range

 From Xiaobai to master , Here is a copy of Pandas Getting started

In the resulting data frame ,「 Age 」 Columns are indexes .

In addition to learning 「X generation 」 Covering three age groups , Break down the chain . The first step is to group the age groups . This method returns a DataFrameGroupBy object , In this object , Each group is aggregated by selecting a unique age tag for the group .

under these circumstances , The aggregation method is 「unique」 Method , But it can also accept any ( anonymous ) function .

stay 0.25 In the version ,Pandas Introduced the use of agg New method of :
(df .groupby(['country', 'year']) .agg({'suicides_per_100k': 'sum'}) .rename(columns={'suicides_per_100k':'suicides_sum'})# Recommended from v0.25# .agg(suicides_sum=('suicides_per_100k', 'sum')) .sort_values('suicides_sum', ascending=False) .head(10))

With sort values (sort_values) and head Get the top 10 countries and years with suicide rates
(df .groupby(['country', 'year']) .agg({'suicides_per_100k': 'sum'}) .rename(columns={'suicides_per_100k':'suicides_sum'})# Recommended from v0.25# .agg(suicides_sum=('suicides_per_100k', 'sum')) .nlargest(10, columns='suicides_sum'))

With sort values nlargest Get the top 10 countries and years with suicide rates

In these examples , The output is the same : There are two indicators ( Country and year ) Of MultiIndex Of DataFrame, And then there's the sorted 10 A new column with a maximum value suicides_sum.

 From Xiaobai to master , Here is a copy of Pandas Getting started

「 Country 」 and 「 year 」 Columns are indexes .

nlargest(10) Than sort_values(ascending=False).head(10) More effective .

Another interesting way is unstack:, This method allows you to rotate the index level .

(mi_df .loc[('Switzerland', 2000)] .unstack('sex') [['suicides_no', 'population']])

 From Xiaobai to master , Here is a copy of Pandas Getting started

「age」 It's the index , Column 「suicides_no」 and 「population」 There's a second horizontal column 「sex」.

Next method pipe It's one of the most common methods . This method allows pipelining ( As in the shell Script ) Perform more operations than chains .

A simple but powerful use of pipes is to record different information .
def log_head(df, head_count=10): print(df.head(head_count)) return dfdef log_columns(df): print(df.columns) return dfdef log_shape(df): print(f'shape = {df.shape}') return df

and pipe Different record functions used together .

for instance , We want to verify and year Compared with ,country_year Whether it is right :
(df .assign(valid_cy=lambda _serie: _serie.apply( lambda _row: re.split(r'(?=\d{4})', _row['country_year'])[1] == str(_row['year']), axis=1)) .query('valid_cy == False') .pipe(log_shape))

Used to verify 「country_year」 The pipe of the year in the column .

The output of the pipeline is DataFrame, But it can also be in standard output (console/REPL) Print in .
shape = (0, 13)

You can also use different pipe.
(df .pipe(log_shape) .query('sex == "female"') .groupby(['year', 'country']) .agg({'suicides_per_100k':'sum'}) .pipe(log_shape) .rename(columns={'suicides_per_100k':'sum_suicides_per_100k_female'})# Recommended from v0.25# .agg(sum_suicides_per_100k_female=('suicides_per_100k', 'sum')) .nlargest(n=10, columns=['sum_suicides_per_100k_female']))

Countries and years with the highest number of female suicides .

Generated DataFrame As shown below :

 From Xiaobai to master , Here is a copy of Pandas Getting started

The index is 「 year 」 and 「 Country 」.

The standard output is printed as follows :
shape = (27820, 12)shape = (2321, 1)

In addition to logging to the console ,pipe You can also apply functions directly to the columns of the data frame .
from sklearn.preprocessing import MinMaxScalerdef norm_df(df, columns): return df.assign(**{col: MinMaxScaler().fit_transform(df[[col]].values.astype(float)) for col in columns}) for sex in ['male', 'female']: print(sex) print( df .query(f'sex == "{sex}"') .groupby(['country']) .agg({'suicides_per_100k': 'sum', 'gdp_year': 'mean'}) .rename(columns={'suicides_per_100k':'suicides_per_100k_sum', 'gdp_year': 'gdp_year_mean'}) # Recommended in v0.25 # .agg(suicides_per_100k=('suicides_per_100k_sum', 'sum'), # gdp_year=('gdp_year_mean', 'mean')) .pipe(norm_df, columns=['suicides_per_100k_sum', 'gdp_year_mean']) .corr(method='spearman') ) print('\n')

Is the number of suicides related to GDP It's related to the decline of ? Whether it's related to gender ?

The above code is printed in the console as follows :
male suicides_per_100k_sum gdp_year_meansuicides_per_100k_sum 1.000000 0.421218gdp_year_mean 0.421218 1.000000
female suicides_per_100k_sum gdp_year_meansuicides_per_100k_sum 1.000000 0.452343gdp_year_mean 0.452343 1.000000

Delve into code .norm_df() Will a DataFrame Harmony MinMaxScaling Expand the list of columns as input . Use a dictionary to understand , Create a dictionary {column_name: method, …}, Then unzip it to assign() The parameters of the function (colunmn_name=method, …).

In this special case ,min-max Scaling doesn't change the corresponding output :, It is only used for parameters .

stay ( distant ?) future , Slow assessment (lazy evaluation) It could be in the chain of methods , So it might be a good idea to invest in the chain .

Last ( Random ) The technique of

The following tips are useful , But it doesn't apply to any part of the front :

itertuples() You can traverse the rows of a data frame more efficiently ;
>>> %%time>>> for row in df.iterrows(): continueCPU times: user 1.97 s, sys: 17.3 ms, total: 1.99 s>>> for tup in df.itertuples(): continueCPU times: user 55.9 ms, sys: 2.85 ms, total: 58.8 ms

Be careful :tup It's a namedtuple

join() It was used merge(); stay Jupyter In notebook , Write... At the beginning of the code block %%time, Time can be measured effectively ;UInt8 class : Supports... With integers NaN value ;

remember , Any dense I/O( For example, expand the large CSV Storage ) It works better with low-level methods ( Use as much as possible Python Core function of ).

There are also some useful methods and data structures that are not covered in this article , These methods and data structures are well worth the time to understand :

PivotTable :

The time series / Date function :;

mapping :


I hope you can because of this short article , Better understanding Pandas How it works , as well as Pandas The current situation of library development . This article also shows different tools for optimizing data frame memory and quickly analyzing data . I hope for you now , The concept of index and search can be clearer . Last , You can also try to use the method chain to write a longer chain .

Here are some notes :

Except for all the code in this article , It also includes simple data index data frame (df) And multiple index data frames (mi_df) Timing index of performance .

 From Xiaobai to master , Here is a copy of Pandas Getting started

Practice makes perfect , So continue to practice your skills , And help us build a better world .

PS: Sometimes it's pure Numpy Will be faster .

Link to the original text :

本文为[A5 Huang Zhong]所创,转载请带上原文链接,感谢

  1. 使用Python开发DeFi项目
  2. python 函数详解
  3. Python工程师是做什么的?前景如何?
  4. Python - zip() 函数
  5. 30 周年生日,Python 先驱是怎么评价这门语言的?
  6. python将excel自适应导入数据库
  7. 从小白到大师,这里有一份Pandas入门指南
  8. [Python] 茎叶图和复合饼图的画法
  9. [Python interface automation] - regular use case parameterization
  10. Translation: practical Python Programming 02_ 02_ Containers
  11. Two years of Java, to write Python and go
  12. Translation: practical Python Programming 02_ 02_ Containers
  13. Two years of Java, to write Python and go
  14. Python-geoplot 空间核密度估计图绘制
  15. Python-seaborn 经济学人经典图表仿制
  16. python空间绘图- regionmask掩膜操作示例
  17. Python 空间绘图 - Cartopy 经纬度添加
  18. Python-pykrige包-克里金(Kriging)插值计算及可视化绘制
  19. Python 批量重采样、掩膜、坡度提取
  20. python - 多种交通方式可达圈分析
  21. Python 空间绘图 - 房价气泡图绘制
  22. Translation: practical Python Programming 02_ 02_ Containers
  23. Research on Portfolio Optimization Based on particle swarm optimization
  24. Ubuntu deploying Django project
  25. Two years of Java, write Python and go without byte beating
  26. Translation: practical Python Programming 02_ 02_ Containers
  27. So learn python, grandfather learned! Introduction to super simple Python
  28. python3 多线程 与 mongo亿级消费日志数据 新鲜demo 【优化第一版】
  29. Summary of Chinese word segmentation based on Jieba
  30. I've heard it n times, but I'm not impressed. After reading this, you'll understand
  31. Summary of Chinese word segmentation based on Jieba
  32. From movie art to Python code to realize God's reverse thinking mode
  33. Summary of Chinese word segmentation based on Jieba
  34. ARIMA模型预测CO2浓度时间序列-python实现
  35. Python belongs to back-end development or front-end development? Introduction to Python!
  36. python isinstance()
  37. I've heard it n times, but I'm not impressed. After reading this, you'll understand
  38. This article will familiarize you with the transformation process of Python - & gt; cafe - & gt; om model
  39. 如何用Python一键修改上万个文件名
  40. One day quick start to Python
  41. Python 学习笔记: List
  42. 翻译:《实用的Python编程》02_03_Formatting
  43. Is there any age requirement for learning Python? Is 30 OK?
  44. Professor Tsinghua! The most complete Python tutorial in 12 hours (free sharing at the end of the article)
  45. Using Python to develop defi project
  46. Detailed explanation of Python function
  47. Python 可变类型作为函数默认参数时的副作用
  48. What do Python engineers do? What's their future?
  49. 这是我见过最好的Python教程:十分钟带你认识Python
  50. Python欢喜冤家:爬虫与反爬虫带着处理方案来给大家拜年了
  51. Python - zip() function
  52. 写Python会遇到如下的错误:ModuleNotFoundError: No module named 'email.mime'; 'email' is not a package
  53. Python类的调用以及私有和公有属性方法的调用
  54. Python类的专有方法
  55. Python基础之:数字字符串和列表
  56. How did Python pioneers evaluate this language on their 30th birthday?
  57. Python基础之:数字字符串和列表
  58. Python基础之:数字字符串和列表
  59. 窥探未来不是梦,python数据分析轻松实现
  60. This article will familiarize you with the transformation process of Python - & gt; cafe - & gt; om model