Memo on pandas data visualization

memo pandas data visualization


author |Rashida Nasrin Sucky compile |VK source |Towards Data Science

We use python Of pandas The library is mainly used for data operation in data analysis , But we can also use Pandas Data visualization . You don't even need to import Matplotlib library .

Pandas It can be used in the back end Matplotlib And visualize it for you . It makes it very easy to plot with data frame columns .Pandas Use than Matplotlib Higher level API. therefore , It can draw with fewer lines of code .

I'm going to start with basic drawing using random data , Then go to a more advanced drawing with real data sets .

In this tutorial , I will use Jupyter Notebook Environmental Science . If you do not install , You can simply use Google Colab Notebook. You don't even need to install Pandas. It's already installed for us .

If you want to install a Jupyter Notebook, That's a good idea, too .

For data scientists , This is a great package , And it's free .

install pandas Use :

pip install pandas

Or in your anaconda On

conda install pandas

So you're ready

pandas visualization

We're going to start with the basics .

Straight line diagram

First, import. pandas. then , Let's use it pandas Make a basic series , Draw a straight line .

import pandas as pd
a = pd.Series([40, 34, 30, 22, 28, 17, 19, 20, 13, 9, 15, 10, 7, 3])
a.plot()

The most basic and simple diagram is ready ! see , How easy it is . We can improve .

I will add :

Change the size of a graphic , Make the chart bigger ,

Change the default blue color

Show title

Change the default font size for these numbers on the axis

a.plot(figsize=(8, 6), color='green', title = 'Line Plot', fontsize=12)

In this tutorial , We're going to learn more style skills .

Area map

I'll use the same data a Draw an area map here ,

I can use .plot Method and pass a parameter type to specify the type of drawing I want , for example :

a.plot(kind='area')

Or I can write like this

a.plot.area()

Both of the methods I mentioned above will create this diagram :

Area maps are more meaningful , And it looks better when there are multiple variables in it . therefore , I'm going to make more Series, Make a data frame , And draw an area map from it .

b = pd.Series([45, 22, 12, 9, 20, 34, 28, 19, 26, 38, 41, 24, 14, 32])
c = pd.Series([25, 38, 33, 38, 23, 12, 30, 37, 34, 22, 16, 24, 12, 9])
d = pd.DataFrame({'a':a, 'b': b, 'c': c})

Let's put this data frame “d” Draw an area map ,

d.plot.area(figsize=(8, 6), title='Area Plot')

You don't have to accept these default colors . Let's change these colors , Add some more styles .

d.plot.area(alpha=0.4, color=['coral', 'purple', 'lightgreen'],figsize=(8, 6), title='Area Plot', fontsize=12)

“alpha” Parameter adds some translucent appearance to the drawing .

When we have overlapping areas 、 Histogram or dense scatter plot , It seems to be very useful .

**plot()** It can be executed 11 Types of drawing :

  1. line
  2. area
  3. bar
  4. barh
  5. pie
  6. box
  7. hexbin
  8. hist
  9. kde
  10. density
  11. scatter

I want to show the usage of all these different graphs . So , I'm going to use CDC's NHANES Data sets . I downloaded this dataset , And put it with this Jupyter Notebook Put it in the same folder . Please feel free to download the dataset and follow :https://github.com/rashida048/Datasets/blob/master/nhanes_2015_2016.csv

Import the dataset here :

df = pd.read_csv('nhanes_2015_2016.csv')
df.head()

This dataset has 30 Column 5735 That's ok .

Before you start drawing , It's important to check the columns of the dataset :

df.columns

Output :

Index(['SEQN', 'ALQ101', 'ALQ110', 'ALQ130', 'SMQ020', 'RIAGENDR', 'RIDAGEYR', 'RIDRETH1', 'DMDCITZN', 'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ', 'WTINT2YR', 'SDMVPSU', 'SDMVSTRA', 'INDFMPIR', 'BPXSY1', 'BPXDI1', 'BPXSY2', 'BPXDI2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC', 'BMXWAIST', 'HIQ210', 'DMDEDUC2x', 'DMDMARTLx'], dtype='object')

The name of the column may look strange . But don't worry . I will continue to explain the meaning of columns . We don't use all columns . We're going to use some of them to practice these graphs .

Histogram

I'm going to use the weight of the population to make a basic histogram

df['BMXWT'].hist()

As a reminder , Histogram provides frequency distribution . The picture above shows about 1825 The human body is heavy 75. The biggest weight is in 49 To 99 Between .

What if I want to put a couple of bars on one graph ?

I'm going to use weight 、 Height and body mass index (BMI) Draw three histograms in a graph .

df[['BMXWT', 'BMXHT', 'BMXBMI']].plot.hist(stacked=True, bins=20, fontsize=12, figsize=(10, 8))

But if you want three different histograms , You can also use just one line of code , like this :

df[['BMXWT', 'BMXHT', 'BMXBMI']].hist(bins=20,figsize=(10, 8))

It can be more dynamic !

We are ' BPXSY1 ' There's blood pressure data in the column , stay ' DMDEDUC2 ' There are educational data in the column . If we want to examine the distribution of blood pressure for each education level population , It can also be done in one line of code .

But before that , I want to replace... With a more meaningful string value 'DMDEDUC2' The value of the column :

df["DMDEDUC2x"] = df.DMDEDUC2.replace({1: "less than 9", 2: "9-11", 3: "HS/GED", 4: "Some college/AA", 5: "College", 7: "Refused", 9: "Don't know"})

Now do the histogram

df[['DMDEDUC2x', 'BPXSY1']].hist(by='DMDEDUC2x', figsize=(18, 12))

see ! We just need a line of code to get the distribution of blood pressure levels for each education level !

Bar chart

Now let's look at how blood pressure changes with marital status . This time I'm going to make a bar chart . Same as before , I'm going to replace... With a more meaningful string “DMDMARTL” The value of the column .

df["DMDMARTLx"] = df.DMDMARTL.replace({1: "Married", 2: "Widowed", 3: "Divorced", 4: "Separated", 5: "Never married", 6: "Living w/partner", 77: "Refused"})

To draw a bar chart , We need to preprocess the data . That is to group the data according to different marital status , And take the average of each group . Here I use the same line of code to process data and drawings .

df.groupby('DMDMARTLx')['BPXSY1'].mean().plot(kind='bar', rot=45, fontsize=10, figsize=(8, 6))

Here we use “rot” Parameter will x Mark rotation 45 degree . otherwise , They're going to be too confused .

If you will , You can also flatten it ,

df.groupby('DMDEDUC2x')['BPXSY1'].mean().plot(kind='barh', rot=45, fontsize=10, figsize=(8, 6))

I want to draw a bar graph with multiple variables . We have a column , There's the ethnic origin of the population . Look at people's weight 、 Does height and body mass index change with ethnic origin , It's going to be an interesting thing .

To draw this picture , We need to put these three columns ( weight 、 Height and body mass index ) Group by ethnic origin and average .

df_bmx = df.groupby('RIDRETH1')['BMXWT', 'BMXHT', 'BMXBMI'].mean().reset_index()

This time I don't have the data to change ethnic origin . I keep the numbers the same . Let's start now ,

df_bmx.plot(x = 'RIDRETH1',
y=['BMXWT', 'BMXHT', 'BMXBMI'],
kind = 'bar',
color = ['lightblue', 'red', 'yellow'],
fontsize=10)

It seems that the fourth race is a little higher than the others . But there was no significant difference between them .

We can also take different parameters ( weight 、 Height and body mass index ) Put it all together .

df_bmx.plot(x = 'RIDRETH1',
y=['BMXWT', 'BMXHT', 'BMXBMI'],
kind = 'bar', stacked=True,
color = ['lightblue', 'red', 'yellow'],
fontsize=10)

The pie chart

I want to see if there's a relationship between marital status and education level .

I need to group marital status by education level , And count the population in each marital status group by educational level . It sounds too wordy , Right ? Let's see :

df_edu_marit = df.groupby('DMDEDUC2x')['DMDMARTL'].count()
pd.Series(df_edu_marit)

Use this Series It's easy to draw pie charts :

ax = pd.Series(df_edu_marit).plot.pie(subplots=True, label='',
labels = ['College Education', 'high school',
'less than high school', 'Some college',
'HS/GED', 'Unknown'],
figsize = (8, 6),
colors = ['lightgreen', 'violet', 'coral', 'skyblue', 'yellow', 'purple'], autopct = '%.2f')

Here I add some style parameters . Please feel free to try more style parameters .

boxplot

for example , I'm going to use body mass index 、 Leg and arm length data make a boxplot .

color = {'boxes': 'DarkBlue', 'whiskers': 'coral',
'medians': 'Black', 'caps': 'Green'}
df[['BMXBMI', 'BMXLEG', 'BMXARML']].plot.box(figsize=(8, 6),color=color)

Scatter plot

For a simple scatter plot , I want to see the BMI (“BMXBMI”) And blood pressure (“BPXSY1”) Whether there is any relationship between .

df.head(300).plot(x='BMXBMI', y= 'BPXSY1', kind = 'scatter')

I only use 300 Data , Because if I use all the data , The scatter plot becomes too dense , Incomprehensible . But you can use alpha Parameter makes it translucent .

Now? , Let's draw a slightly more advanced scatter plot with the same line of code .

This time I'm going to add some color shadows . I'm going to draw a scatter plot , Put the weight in x On the shaft , Put the height on y On the shaft .

I'll also add the length of my legs . But the length of the leg is shown in shadow . If the leg is longer , The shadow will be darker , Otherwise the shadow will be lighter .

df.head(500).plot.scatter(x= 'BMXWT', y = 'BMXHT', c ='BMXLEG', s=50, figsize=(8, 6))

It shows the relationship between weight and height . You can see if there is any relationship between leg length and height and weight .

Another way to add a third parameter is to increase the size of the particles . ad locum , I put the height on x On the shaft , The weight is y On the shaft , Body mass index as an indicator of particle size .

df.head(200).plot.scatter(x= 'BMXHT', y = 'BMXWT',
s =df['BMXBMI'][:200] * 7,
alpha=0.5, color='purple',
figsize=(8, 6))

The dots here indicate BMI The lower , Larger dots indicate BMI Higher .

hexagon

This is another beautiful visual effect , The dot is a hexagon . When the data is too dense , It's very useful to put them in boxes . As you can see , In the first two graphs , I only use 500 and 200 Data , Because if I put all the data in the dataset , Then the drawing becomes too dense , Unable to understand or get any information from it .

under these circumstances , It's very useful to use spatial distribution . I'm using hexbin, The data will be represented in a hexagon . Each hexagon is a box that represents the density of the box . Here's a basic hexpin Example .

df.plot.hexbin(x='BMXARMC', y='BMXLEG', gridsize= 20)

ad locum , Darker colors indicate higher data density , Lighter colors indicate lower data density .

Does that sound like a histogram ? Yes , Right ? It's expressed in color , Instead of histogram .

If we add an extra parameter 'C', The distribution will change . It's no longer like a histogram .

Parameters “C” Specify each (x, y) Position of coordinates , Add up each hexagon box , And then use reduce_C_function Conduct reduce. If not specified reduce_C_function, By default, it uses np.mean. You can define it as np.mean, np.max, np.sum, np.std wait

For more information , See documentation :https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.hexbin.html

Here is an example :

df.plot.hexbin(x='BMXARMC', y='BMXLEG', C = 'BMXHT',
reduce_C_function=np.max,
gridsize=15,
figsize=(8,6))

The dark color of the hexagon means ,np.max There is a higher value , You can see that I use np.max As reduce_C_function. We can use color maps instead of coloring colors :

df.plot.hexbin(x='BMXARMC', y='BMXLEG', C = 'BMXHT',
reduce_C_function=np.max,
gridsize=15,
figsize=(8,6),
cmap = 'viridis')

It looks beautiful , Right ? And there's a lot of information .

Some advanced visualizations

I explained above some of the basic graphics that people use to process data in their daily lives . But data scientists need more .pandas The library also has some more advanced visualizations . It can provide more information in a single line of code .

Scatter matrix

Scatter matrices are very useful . It provides a lot of information in a graph . It can be used in general data analysis or feature engineering in machine learning . Let's start with an example . I'll explain later .

from pandas.plotting import scatter_matrix
scatter_matrix(df[['BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML']], alpha = 0.2, figsize=(10, 8), diagonal = 'kde')

I use five features here . I get the relationship between all five variables . In the diagonal , It gives a density map of each individual feature . In my next example , We're going to talk more about density maps .

KDE Or density map

structure KDE Graph or kernel density map is to provide the probability distribution of sequence or column in data frame . Let's look at the weight variable (“BMXWT”) Probability distribution of .

df['BMXWT'].plot.kde()

You can see several probability distributions in a graph . ad locum , I gave the height in the same picture 、 Weight and BMI Probability distribution of :

df[['BMXWT', 'BMXHT', 'BMXBMI']].plot.kde(figsize = (8, 6))

You can also use the other style parameters described earlier . I like to keep it simple .

Parallel_coordinates

It's a great way to show multidimensional data . It clearly shows clusters ( If there is ). for example , I want to see men and women in height 、 Is there any difference between body weight and body mass index . Let's check .

from pandas.plotting import parallel_coordinates
parallel_coordinates(df[['BMXWT', 'BMXHT', 'BMXBMI', 'RIAGENDR']].dropna().head(200), 'RIAGENDR', color=['blue', 'violet'])

You can see men and women in weight 、 Height and BMI The obvious difference on . here ,1 It's men ,2 It's a woman .

Bootstrap_plot

This is a very important graph of research and statistical analysis . This will save a lot of statistical analysis time .Bootstrap_plot Used to evaluate the uncertainty of a given data set .

This function gets a random sample of the specified size . Then calculate the average value of the sample 、 Median and median . This process is repeated a specified number of times .

Here I use BMI The data creates a Bootstrap_plot

from pandas.plotting import bootstrap_plot
bootstrap_plot(df['BMXBMI'], size=100, samples=1000, color='skyblue')

here , The sample size is 100, The number of samples is 1000. therefore , We randomly selected 100 Data samples to calculate the average 、 Median and median . The process repeats 1000 Time .

For statisticians and researchers , It's an extremely important process , It's also a time-saving process .

Conclusion

I want to do it for pandas Make a memo by visualizing the data . however , If you use matplotlib and seaborn, There are more options or visualization types . But if you deal with data , We use these basic types of visualization in our daily lives . take pandas Using this visualization will make your code simpler , And save a lot of code .

Link to the original text :https://towardsdatascience.com/an-ultimate-cheat-sheet-for-data-visualization-in-pandas-4010e1b16b5c

Welcome to join us AI Blog station : http://panchuang.net/

sklearn Machine learning Chinese official documents : http://sklearn123.com/

Welcome to pay attention to pan Chuang blog resource summary station : http://docs.panchuang.net/

版权声明
本文为[Artificial intelligence meets pioneer]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database