author |Rashida Nasrin Sucky compile |VK source |Towards Data Science
We use python Of pandas The library is mainly used for data operation in data analysis , But we can also use Pandas Data visualization . You don't even need to import Matplotlib library .
Pandas It can be used in the back end Matplotlib And visualize it for you . It makes it very easy to plot with data frame columns .Pandas Use than Matplotlib Higher level API. therefore , It can draw with fewer lines of code .
I'm going to start with basic drawing using random data , Then go to a more advanced drawing with real data sets .
In this tutorial , I will use Jupyter Notebook Environmental Science . If you do not install , You can simply use Google Colab Notebook. You don't even need to install Pandas. It's already installed for us .
If you want to install a Jupyter Notebook, That's a good idea, too .
For data scientists , This is a great package , And it's free .
install pandas Use :
pip install pandas
Or in your anaconda On
conda install pandas
So you're ready
We're going to start with the basics .
First, import. pandas. then , Let's use it pandas Make a basic series , Draw a straight line .
import pandas as pd
a = pd.Series([40, 34, 30, 22, 28, 17, 19, 20, 13, 9, 15, 10, 7, 3])
a.plot()
The most basic and simple diagram is ready ! see , How easy it is . We can improve .
I will add :
Change the size of a graphic , Make the chart bigger ,
Change the default blue color
Show title
Change the default font size for these numbers on the axis
a.plot(figsize=(8, 6), color='green', title = 'Line Plot', fontsize=12)
In this tutorial , We're going to learn more style skills .
I'll use the same data a Draw an area map here ,
I can use .plot Method and pass a parameter type to specify the type of drawing I want , for example :
a.plot(kind='area')
Or I can write like this
a.plot.area()
Both of the methods I mentioned above will create this diagram :
Area maps are more meaningful , And it looks better when there are multiple variables in it . therefore , I'm going to make more Series, Make a data frame , And draw an area map from it .
b = pd.Series([45, 22, 12, 9, 20, 34, 28, 19, 26, 38, 41, 24, 14, 32])
c = pd.Series([25, 38, 33, 38, 23, 12, 30, 37, 34, 22, 16, 24, 12, 9])
d = pd.DataFrame({'a':a, 'b': b, 'c': c})
Let's put this data frame “d” Draw an area map ,
d.plot.area(figsize=(8, 6), title='Area Plot')
You don't have to accept these default colors . Let's change these colors , Add some more styles .
d.plot.area(alpha=0.4, color=['coral', 'purple', 'lightgreen'],figsize=(8, 6), title='Area Plot', fontsize=12)
“alpha” Parameter adds some translucent appearance to the drawing .
When we have overlapping areas 、 Histogram or dense scatter plot , It seems to be very useful .
plot() It can be executed 11 Types of drawing :
I want to show the usage of all these different graphs . So , I'm going to use CDC's NHANES Data sets . I downloaded this dataset , And put it with this Jupyter Notebook Put it in the same folder . Please feel free to download the dataset and follow :https://github.com/rashida048/Datasets/blob/master/nhanes_2015_2016.csv
Import the dataset here :
df = pd.read_csv('nhanes_2015_2016.csv')
df.head()
This dataset has 30 Column 5735 That's ok .
Before you start drawing , It's important to check the columns of the dataset :
df.columns
Output :
Index(['SEQN', 'ALQ101', 'ALQ110', 'ALQ130', 'SMQ020', 'RIAGENDR', 'RIDAGEYR', 'RIDRETH1', 'DMDCITZN', 'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ', 'WTINT2YR', 'SDMVPSU', 'SDMVSTRA', 'INDFMPIR', 'BPXSY1', 'BPXDI1', 'BPXSY2', 'BPXDI2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC', 'BMXWAIST', 'HIQ210', 'DMDEDUC2x', 'DMDMARTLx'], dtype='object')
The name of the column may look strange . But don't worry . I will continue to explain the meaning of columns . We don't use all columns . We're going to use some of them to practice these graphs .
I'm going to use the weight of the population to make a basic histogram
df['BMXWT'].hist()
As a reminder , Histogram provides frequency distribution . The picture above shows about 1825 The human body is heavy 75. The biggest weight is in 49 To 99 Between .
What if I want to put a couple of bars on one graph ?
I'm going to use weight 、 Height and body mass index (BMI) Draw three histograms in a graph .
df[['BMXWT', 'BMXHT', 'BMXBMI']].plot.hist(stacked=True, bins=20, fontsize=12, figsize=(10, 8))
But if you want three different histograms , You can also use just one line of code , like this :
df[['BMXWT', 'BMXHT', 'BMXBMI']].hist(bins=20,figsize=(10, 8))
It can be more dynamic !
We are ' BPXSY1 ' There's blood pressure data in the column , stay ' DMDEDUC2 ' There are educational data in the column . If we want to examine the distribution of blood pressure for each education level population , It can also be done in one line of code .
But before that , I want to replace... With a more meaningful string value 'DMDEDUC2' The value of the column :
df["DMDEDUC2x"] = df.DMDEDUC2.replace({1: "less than 9", 2: "9-11", 3: "HS/GED", 4: "Some college/AA", 5: "College", 7: "Refused", 9: "Don't know"})
Now do the histogram
df[['DMDEDUC2x', 'BPXSY1']].hist(by='DMDEDUC2x', figsize=(18, 12))
see ! We just need a line of code to get the distribution of blood pressure levels for each education level !
Now let's look at how blood pressure changes with marital status . This time I'm going to make a bar chart . Same as before , I'm going to replace... With a more meaningful string “DMDMARTL” The value of the column .
df["DMDMARTLx"] = df.DMDMARTL.replace({1: "Married", 2: "Widowed", 3: "Divorced", 4: "Separated", 5: "Never married", 6: "Living w/partner", 77: "Refused"})
To draw a bar chart , We need to preprocess the data . That is to group the data according to different marital status , And take the average of each group . Here I use the same line of code to process data and drawings .
df.groupby('DMDMARTLx')['BPXSY1'].mean().plot(kind='bar', rot=45, fontsize=10, figsize=(8, 6))
Here we use “rot” Parameter will x Mark rotation 45 degree . otherwise , They're going to be too confused .
If you will , You can also flatten it ,
df.groupby('DMDEDUC2x')['BPXSY1'].mean().plot(kind='barh', rot=45, fontsize=10, figsize=(8, 6))
I want to draw a bar graph with multiple variables . We have a column , There's the ethnic origin of the population . Look at people's weight 、 Does height and body mass index change with ethnic origin , It's going to be an interesting thing .
To draw this picture , We need to put these three columns ( weight 、 Height and body mass index ) Group by ethnic origin and average .
df_bmx = df.groupby('RIDRETH1')['BMXWT', 'BMXHT', 'BMXBMI'].mean().reset_index()
This time I don't have the data to change ethnic origin . I keep the numbers the same . Let's start now ,
df_bmx.plot(x = 'RIDRETH1',
y=['BMXWT', 'BMXHT', 'BMXBMI'],
kind = 'bar',
color = ['lightblue', 'red', 'yellow'],
fontsize=10)
It seems that the fourth race is a little higher than the others . But there was no significant difference between them .
We can also take different parameters ( weight 、 Height and body mass index ) Put it all together .
df_bmx.plot(x = 'RIDRETH1',
y=['BMXWT', 'BMXHT', 'BMXBMI'],
kind = 'bar', stacked=True,
color = ['lightblue', 'red', 'yellow'],
fontsize=10)
I want to see if there's a relationship between marital status and education level .
I need to group marital status by education level , And count the population in each marital status group by educational level . It sounds too wordy , Right ? Let's see :
df_edu_marit = df.groupby('DMDEDUC2x')['DMDMARTL'].count()
pd.Series(df_edu_marit)
Use this Series It's easy to draw pie charts :
ax = pd.Series(df_edu_marit).plot.pie(subplots=True, label='',
labels = ['College Education', 'high school',
'less than high school', 'Some college',
'HS/GED', 'Unknown'],
figsize = (8, 6),
colors = ['lightgreen', 'violet', 'coral', 'skyblue', 'yellow', 'purple'], autopct = '%.2f')
Here I add some style parameters . Please feel free to try more style parameters .
for example , I'm going to use body mass index 、 Leg and arm length data make a boxplot .
color = {'boxes': 'DarkBlue', 'whiskers': 'coral',
'medians': 'Black', 'caps': 'Green'}
df[['BMXBMI', 'BMXLEG', 'BMXARML']].plot.box(figsize=(8, 6),color=color)
For a simple scatter plot , I want to see the BMI (“BMXBMI”) And blood pressure (“BPXSY1”) Whether there is any relationship between .
df.head(300).plot(x='BMXBMI', y= 'BPXSY1', kind = 'scatter')
I only use 300 Data , Because if I use all the data , The scatter plot becomes too dense , Incomprehensible . But you can use alpha Parameter makes it translucent .
Now? , Let's draw a slightly more advanced scatter plot with the same line of code .
This time I'm going to add some color shadows . I'm going to draw a scatter plot , Put the weight in x On the shaft , Put the height on y On the shaft .
I'll also add the length of my legs . But the length of the leg is shown in shadow . If the leg is longer , The shadow will be darker , Otherwise the shadow will be lighter .
df.head(500).plot.scatter(x= 'BMXWT', y = 'BMXHT', c ='BMXLEG', s=50, figsize=(8, 6))
It shows the relationship between weight and height . You can see if there is any relationship between leg length and height and weight .
Another way to add a third parameter is to increase the size of the particles . ad locum , I put the height on x On the shaft , The weight is y On the shaft , Body mass index as an indicator of particle size .
df.head(200).plot.scatter(x= 'BMXHT', y = 'BMXWT',
s =df['BMXBMI'][:200] * 7,
alpha=0.5, color='purple',
figsize=(8, 6))
The dots here indicate BMI The lower , Larger dots indicate BMI Higher .
This is another beautiful visual effect , The dot is a hexagon . When the data is too dense , It's very useful to put them in boxes . As you can see , In the first two graphs , I only use 500 and 200 Data , Because if I put all the data in the dataset , Then the drawing becomes too dense , Unable to understand or get any information from it .
under these circumstances , It's very useful to use spatial distribution . I'm using hexbin, The data will be represented in a hexagon . Each hexagon is a box that represents the density of the box . Here's a basic hexpin Example .
df.plot.hexbin(x='BMXARMC', y='BMXLEG', gridsize= 20)
ad locum , Darker colors indicate higher data density , Lighter colors indicate lower data density .
Does that sound like a histogram ? Yes , Right ? It's expressed in color , Instead of histogram .
If we add an extra parameter 'C', The distribution will change . It's no longer like a histogram .
Parameters “C” Specify each (x, y) Position of coordinates , Add up each hexagon box , And then use reduce_C_function Conduct reduce. If not specified reduce_C_function, By default, it uses np.mean. You can define it as np.mean, np.max, np.sum, np.std wait
For more information , See documentation :https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.hexbin.html
Here is an example :
df.plot.hexbin(x='BMXARMC', y='BMXLEG', C = 'BMXHT',
reduce_C_function=np.max,
gridsize=15,
figsize=(8,6))
The dark color of the hexagon means ,np.max There is a higher value , You can see that I use np.max As reduce_C_function. We can use color maps instead of coloring colors :
df.plot.hexbin(x='BMXARMC', y='BMXLEG', C = 'BMXHT',
reduce_C_function=np.max,
gridsize=15,
figsize=(8,6),
cmap = 'viridis')
It looks beautiful , Right ? And there's a lot of information .
I explained above some of the basic graphics that people use to process data in their daily lives . But data scientists need more .pandas The library also has some more advanced visualizations . It can provide more information in a single line of code .
Scatter matrices are very useful . It provides a lot of information in a graph . It can be used in general data analysis or feature engineering in machine learning . Let's start with an example . I'll explain later .
from pandas.plotting import scatter_matrix
scatter_matrix(df[['BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML']], alpha = 0.2, figsize=(10, 8), diagonal = 'kde')
I use five features here . I get the relationship between all five variables . In the diagonal , It gives a density map of each individual feature . In my next example , We're going to talk more about density maps .
structure KDE Graph or kernel density map is to provide the probability distribution of sequence or column in data frame . Let's look at the weight variable (“BMXWT”) Probability distribution of .
df['BMXWT'].plot.kde()
You can see several probability distributions in a graph . ad locum , I gave the height in the same picture 、 Weight and BMI Probability distribution of :
df[['BMXWT', 'BMXHT', 'BMXBMI']].plot.kde(figsize = (8, 6))
You can also use the other style parameters described earlier . I like to keep it simple .
It's a great way to show multidimensional data . It clearly shows clusters ( If there is ). for example , I want to see men and women in height 、 Is there any difference between body weight and body mass index . Let's check .
from pandas.plotting import parallel_coordinates
parallel_coordinates(df[['BMXWT', 'BMXHT', 'BMXBMI', 'RIAGENDR']].dropna().head(200), 'RIAGENDR', color=['blue', 'violet'])
You can see men and women in weight 、 Height and BMI The obvious difference on . here ,1 It's men ,2 It's a woman .
This is a very important graph of research and statistical analysis . This will save a lot of statistical analysis time .Bootstrap_plot Used to evaluate the uncertainty of a given data set .
This function gets a random sample of the specified size . Then calculate the average value of the sample 、 Median and median . This process is repeated a specified number of times .
Here I use BMI The data creates a Bootstrap_plot:
from pandas.plotting import bootstrap_plot
bootstrap_plot(df['BMXBMI'], size=100, samples=1000, color='skyblue')
here , The sample size is 100, The number of samples is 1000. therefore , We randomly selected 100 Data samples to calculate the average 、 Median and median . The process repeats 1000 Time .
For statisticians and researchers , It's an extremely important process , It's also a time-saving process .
I want to do it for pandas Make a memo by visualizing the data . however , If you use matplotlib and seaborn, There are more options or visualization types . But if you deal with data , We use these basic types of visualization in our daily lives . take pandas Using this visualization will make your code simpler , And save a lot of code .
Link to the original text :https://towardsdatascience.com/an-ultimate-cheat-sheet-for-data-visualization-in-pandas-4010e1b16b5c
Welcome to join us AI Blog station : http://panchuang.net/
sklearn Machine learning Chinese official documents : http://sklearn123.com/
Welcome to pay attention to pan Chuang blog resource summary station : http://docs.panchuang.net/