Descriptive statistics It's about data description and aggregation . It uses two main methods ：

The quantitative method is numerical Describe and summarize data .

Visualization methods Through the chart , diagram , Histograms and other graphics to illustrate the data .

Generally in the process of data analysis , Getting the data doesn't go straight to modeling , But first do descriptive analysis to have a general grasp of the data , Many subsequent modeling directions are further determined by descriptive analysis . So apart from Excel/R We can do descriptive analysis in the process .

This article will explain in detail how to use python The quantitative part of descriptive analysis :

mean value

Median

variance

Standard deviation

skewness

Percentiles

The correlation

As for the visualization part, please refer to my previous explanation pyecharts The article , Of course, it will be introduced later echarts as well as ggplot2 Methods .

### It involves python library

Python statistics It's built-in for descriptive statistics Python library . If your data set is not too large , Or you can't rely on importing other libraries , You can use it .

NumPy It's a third-party library for digital computing , Optimized for using one-dimensional and multi-dimensional arrays . Its main type is called array type

`ndarray`

. The library contains many methods for statistical analysis .SciPy Is based on NumPy Third party libraries for Scientific Computing . And NumPy comparison , It provides other functions , Include

`scipy.stats`

Statistical analysis .Getting started - SciPy.orgPandas Is based on NumPy Third party library for numerical calculation . It's good at dealing with

`Series`

The labeled dimension of an object （1D） Data and 2D with objects （2D） data`DataFrame`

.Matplotlib It's a third-party library for data visualization . Usually with NumPy,SciPy and Pandas Use a combination of

* Start *

*Start*

First import all the packages

`import math`

import statistics

import numpy as np

import scipy.stats

import pandas as pd

* Create data *

`x`

and `x_with_nan`

All are list. The difference is `x_with_nan`

Contains a `nan`

value . That is, null value ( Missing value ), Such data is very common in analysis . So in python in , Create a nan Values can be in the following ways

`float('nan')`

math.nan

np.nan

Of course, the null values created by these three methods are equivalent

But is it really equal , Two nan It's not equal , let me put it another way , It can't be compared , I'll tell you the story later .

next , We use numpy and pandas To create two dimensions numpy arrays and pandas series

** mean value **

What's the definition of mean , Don't say ,R Directly inside mean() Can , And in the python, Don't import packages , How to calculate ：

You can also use it python The built-in statistics function of

But if the data contains nan, Then it will return to nan

`>>> mean_ = statistics.mean(x_with_nan)`

>>> mean_

nan

If you use numpy

`>>> mean_ = np.mean(y)`

>>> mean_

8.7

In the example above ,`mean()`

It's a function , But you can also use the corresponding method

`>>> mean_ = y.mean()`

>>> mean_

8.7

If you include nan,numpy And will return to nan, So if you want to ignore nan, have access to np.nanmean()

`>>> np.mean(y_with_nan)`

nan

>>> np.nanmean(y_with_nan)

8.7

pandas There are also corresponding methods , however , By default ,`.mean()`

stay Pandas Ignored in nan value ：

`mean_ = z.mean()`

mean_

>>> z_with_nan.mean()

8.7

* Median *

Compare the mean and median , It's a way to detect outliers and asymmetries in data . Average or median is more useful for you , Depending on the context of the particular problem . Instead of using package calculations ：

`>>> n = len(x)`

>>> if n % 2:

... median_ = sorted(x)[round(0.5*(n-1))]

... else:

... x_ord, index = sorted(x), round(0.5 * n)

... median_ = 0.5 * (x_ord[index-1] + x_ord[index])

...

>>> median_

4

Other methods

`>>> median_ = np.median(y)`

>>> median_

4.0

>>> np.nanmedian(y_with_nan)

4.0

* variance *

*variance*

The significance of variance is not too clear , stay Excel Use... Directly stdev function , But why python Middle computation ？ I remember when I was asked to use it in the second postgraduate examination python How to calculate variance without importing package ？

`>>> n = len(x)`

>>> mean_ = sum(x) / n

>>> var_ = sum((item - mean_)**2 for item in x) / (n - 1)

>>> var_

123.19999999999999

Of course, a simpler way is to use functions directly , However, there are nan Or will return nan

`>>> var_ = statistics.variance(x)`

>>> var_

123.2

>>> statistics.variance(x_with_nan)

nan

Put it in numpy It's simpler inside , have access to np.var() perhaps .var()

`>>> var_ = np.var(y, ddof=1)`

>>> var_

123.19999999999999

>>> var_ = y.var(ddof=1)

>>> var_

123.19999999999999

here ddof That is, the degree of freedom should be set to 1 Is unbiased . That is to say, the denominator uses n-1 Replace n. If there is nan What do I do ？ return nan, But you can use np.nanvar() skip nan, however ddof Still set it to 1

`>>> np.var(y_with_nan, ddof=1)`

nan

>>> y_with_nan.var(ddof=1)

nan

>>> np.nanvar(y_with_nan, ddof=1)

123.19999999999999

** Standard deviation **

With variance , The standard deviation is easy to calculate

`# Direct calculation `

>>> std_ = var_ ** 0.5

>>> std_

11.099549540409285

# Use the built-in package

>>> std_ = statistics.stdev(x)

>>> std_

11.099549540409287

numpy It's also very easy to calculate

`>>> np.std(y, ddof=1)`

11.099549540409285

>>> y.std(ddof=1)

11.099549540409285

>>> np.std(y_with_nan, ddof=1)

nan

>>> y_with_nan.std(ddof=1)

nan

>>> np.nanstd(y_with_nan, ddof=1) # skip nan,ddof Or if 1 Oh

11.099549540409285

* skewness (skew)*

skewness （*skewness*） Also known as skewness 、 The coefficient of skewness , It is a measure of the direction and degree of skewness of the distribution of statistical data , It is a numerical feature of the degree of asymmetry in the distribution of statistical data . Skewness is the use of 3 The moment of order defines , The formula of skewness is ：

The data we studied before are relatively symmetrical data , But the image above shows an asymmetric data set , The first group is represented by green dots , The second group is represented by white dots . Usually , Negative skewness The value indicates that there is a dominant tail on the left , You can see in the first set . Positive skewness value Corresponding to the long or long tail on the right , You can see in the second group . If the skewness is close to 0（ for example , Be situated between -0.5 and 0.5 Between ）, Then the dataset is considered to be very symmetric .

So don't rely on the third package , How to calculate skewness . You can calculate the size of the data set first n, Sample mean mean And standard deviation std Then we use the formula to calculate

`>>> x = [8.0, 1, 2.5, 4, 28.0]`

>>> n = len(x)

>>> mean_ = sum(x) / n

>>> var_ = sum((item - mean_)**2 for item in x) / (n - 1)

>>> std_ = var_ ** 0.5

>>> skew_ = (sum((item - mean_)**3 for item in x)

... * n / ((n - 1) * (n - 2) * std_**3))

>>> skew_

1.9470432273905929

We can see that the skewness is positive , therefore x The tail is on the right .

You can also use third-party packages to calculate

`>>> y, y_with_nan = np.array(x), np.array(x_with_nan)`

>>> scipy.stats.skew(y, bias=False)

1.9470432273905927

>>> scipy.stats.skew(y_with_nan, bias=False)

nan

>>> z, z_with_nan = pd.Series(x), pd.Series(x_with_nan)

>>> z.skew()

1.9470432273905924

>>> z_with_nan.skew()

1.9470432273905924

* Percentiles (Percentiles)*

If you sort a set of data from small to large , And calculate the corresponding cumulative percentile , Then the value of the data corresponding to a certain percentile is called the percentile of the percentile . Can be expressed as ： A group of n The observations are in numerical order . Such as , be in p% The value of the position is called p Percentiles . Each dataset has three Four percentile , This is the percentile that divides the data set into four parts ：

First quartile (Q1), also called “ Lower quartile ”, It is equal to the number of all values in the sample arranged from small to large 25% The number of .

second quartile (Q2), also called “ Median ”, It is equal to the number of all values in the sample arranged from small to large 50% The number of .

third quartile (Q3), also called “ Larger quartile ”, It is equal to the number of all values in the sample arranged from small to large 75% The number of .

The difference between the third quartile and the first quartile is also called the interquartile distance （InterQuartile Range,IQR）.

So in python How to calculate the quantile in it . have access to statistics.quantiles()

`>>> x = [-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]`

>>> statistics.quantiles(x, n=2)

[8.0]

>>> statistics.quantiles(x, n=4, method='inclusive')

[0.1, 8.0, 21.0]

You can see in the first line ,8 Namely x The median , And in the second case ,0.1 and 21 It's a sample 25％ and 75％ quantile . Third party packages can also be used numpy Calculation

`>>> np.percentile(y, [25, 50, 75])`

array([ 0.1, 8. , 21. ])

>>> np.median(y)

8.0

# skip nan

>>> y_with_nan = np.insert(y, 2, np.nan)

>>> y_with_nan

array([-5. , -1.1, nan, 0.1, 2. , 8. , 12.8, 21. , 25.8, 41. ])

>>> np.nanpercentile(y_with_nan, [25, 50, 75])

array([ 0.1, 8. , 21. ])

pandas You can also use `.quantile()`

Calculation , You need to provide quantile values as parameters . The value can be 0 To 1 A number or sequence of numbers between .

`>>> z, z_with_nan = pd.Series(y), pd.Series(y_with_nan)`

>>> z.quantile(0.05)

-3.44

>>> z.quantile(0.95)

34.919999999999995

>>> z.quantile([0.25, 0.5, 0.75])

0.25 0.1

0.50 8.0

0.75 21.0

dtype: float64

>>> z_with_nan.quantile([0.25, 0.5, 0.75])

0.25 0.1

0.50 8.0

0.75 21.0

dtype: float64

* Range (Ranges)*

*Range (Ranges)*

The range of data is the difference between the largest and smallest elements in the dataset . You can do this through a function np.ptp() get :

`>>> np.ptp(y)`

27.0

>>> np.ptp(z)

27.0

>>> np.ptp(y_with_nan)

nan

>>> np.ptp(z_with_nan)

27.0

* Descriptive statistical summary *

*Descriptive statistical summary*

stay SciPy and Pandas Provides a single function or method call to quickly obtain descriptive statistics .

`>>> result = scipy.stats.describe(y, ddof=1, bias=False)`

>>> result

DescribeResult(nobs=9, minmax=(-5.0, 41.0), mean=11.622222222222222, variance=228.75194444444446, skewness=0.9249043136685094, kurtosis=0.14770623629658886)

`describe()`

The return contains the following information ：

nobs： The number of observations or elements in a dataset

minmax： Maximum and minimum values of data

mean： The average of the data set

variance： The variance of the data set

skewness： Skewness of the data set

kurtosis： Kurtosis of data sets

`>>> result.nobs`

9

>>> result.minmax[0] # Min

-5.0

>>> result.minmax[1] # Max

41.0

>>> result.mean

11.622222222222222

>>> result.variance

228.75194444444446

>>> result.skewness

0.9249043136685094

>>> result.kurtosis

0.14770623629658886

pandas There are similar functions .describe()：

`>>> result = z.describe()`

>>> result

count 9.000000 # The number of elements in the dataset

mean 11.622222 # The average of the data set

std 15.124548 # The standard deviation of the data set

min -5.000000

25% 0.100000 # The quartile of the data set

50% 8.000000

75% 21.000000

max 41.000000

dtype: float64

* The correlation *

*The correlation*

The statistical significance of related lines is not explained too much , But be careful , Correlation can only be judged from the data , Can't explain causality ！！！

Measuring correlation mainly uses covariance and Correlation coefficient ：

So let's first recreate the data

`>>> x = list(range(-10, 11))`

>>> y = [0, 2, 2, 2, 2, 3, 3, 6, 7, 4, 7, 6, 6, 9, 4, 5, 5, 10, 11, 12, 14]

>>> x_, y_ = np.array(x), np.array(y)

>>> x__, y__ = pd.Series(x_), pd.Series(y_)

Calculation covariance

`>>> n = len(x)`

>>> mean_x, mean_y = sum(x) / n, sum(y) / n

>>> cov_xy = (sum((x[k] - mean_x) * (y[k] - mean_y) for k in range(n))

... / (n - 1))

>>> cov_xy

19.95

numpyh and pandas There are functions that can return covariance matrix cov()

`# numpy`

>>> cov_matrix = np.cov(x_, y_)

>>> cov_matrix

array([[38.5 , 19.95 ],

[19.95 , 13.91428571]])

# pandas

>>> cov_xy = x__.cov(y__)

>>> cov_xy

19.95

>>> cov_xy = y__.cov(x__)

>>> cov_xy

19.95

Calculation The correlation coefficient

What we're talking about here is pearson The correlation coefficient .Pearson The correlation coefficient （Pearson CorrelationCoefficient） It is used to measure whether two data sets are on the same line , It's used to measure the linear relationship between the fixed distance variables . The formula is