Descriptive statistics It's about data description and aggregation . It uses two main methods :
The quantitative method is numerical Describe and summarize data .
Visualization methods Through the chart , diagram , Histograms and other graphics to illustrate the data .
Generally in the process of data analysis , Getting the data doesn't go straight to modeling , But first do descriptive analysis to have a general grasp of the data , Many subsequent modeling directions are further determined by descriptive analysis . So apart from Excel/R We can do descriptive analysis in the process .
This article will explain in detail how to use python The quantitative part of descriptive analysis :
mean value
Median
variance
Standard deviation
skewness
Percentiles
The correlation
As for the visualization part, please refer to my previous explanation pyecharts The article , Of course, it will be introduced later echarts as well as ggplot2 Methods .
It involves python library
Python statistics It's built-in for descriptive statistics Python library . If your data set is not too large , Or you can't rely on importing other libraries , You can use it .
NumPy It's a third-party library for digital computing , Optimized for using one-dimensional and multi-dimensional arrays . Its main type is called array type
ndarray
. The library contains many methods for statistical analysis .SciPy Is based on NumPy Third party libraries for Scientific Computing . And NumPy comparison , It provides other functions , Include
scipy.stats
Statistical analysis .Getting started - SciPy.orgPandas Is based on NumPy Third party library for numerical calculation . It's good at dealing with
Series
The labeled dimension of an object (1D) Data and 2D with objects (2D) dataDataFrame
.Matplotlib It's a third-party library for data visualization . Usually with NumPy,SciPy and Pandas Use a combination of
Start
First import all the packages
import math
import statistics
import numpy as np
import scipy.stats
import pandas as pd
Create data
x
and x_with_nan
All are list. The difference is x_with_nan
Contains a nan
value . That is, null value ( Missing value ), Such data is very common in analysis . So in python in , Create a nan Values can be in the following ways
float('nan')
math.nan
np.nan
Of course, the null values created by these three methods are equivalent
But is it really equal , Two nan It's not equal , let me put it another way , It can't be compared , I'll tell you the story later .
next , We use numpy and pandas To create two dimensions numpy arrays and pandas series
mean value
What's the definition of mean , Don't say ,R Directly inside mean() Can , And in the python, Don't import packages , How to calculate :
You can also use it python The built-in statistics function of
But if the data contains nan, Then it will return to nan
>>> mean_ = statistics.mean(x_with_nan)
>>> mean_
nan
If you use numpy
>>> mean_ = np.mean(y)
>>> mean_
8.7
In the example above ,mean()
It's a function , But you can also use the corresponding method
>>> mean_ = y.mean()
>>> mean_
8.7
If you include nan,numpy And will return to nan, So if you want to ignore nan, have access to np.nanmean()
>>> np.mean(y_with_nan)
nan
>>> np.nanmean(y_with_nan)
8.7
pandas There are also corresponding methods , however , By default ,.mean()
stay Pandas Ignored in nan value :
mean_ = z.mean()
mean_
>>> z_with_nan.mean()
8.7
Median
Compare the mean and median , It's a way to detect outliers and asymmetries in data . Average or median is more useful for you , Depending on the context of the particular problem . Instead of using package calculations :
>>> n = len(x)
>>> if n % 2:
... median_ = sorted(x)[round(0.5*(n-1))]
... else:
... x_ord, index = sorted(x), round(0.5 * n)
... median_ = 0.5 * (x_ord[index-1] + x_ord[index])
...
>>> median_
4
Other methods
>>> median_ = np.median(y)
>>> median_
4.0
>>> np.nanmedian(y_with_nan)
4.0
variance
The significance of variance is not too clear , stay Excel Use... Directly stdev function , But why python Middle computation ? I remember when I was asked to use it in the second postgraduate examination python How to calculate variance without importing package ?
>>> n = len(x)
>>> mean_ = sum(x) / n
>>> var_ = sum((item - mean_)**2 for item in x) / (n - 1)
>>> var_
123.19999999999999
Of course, a simpler way is to use functions directly , However, there are nan Or will return nan
>>> var_ = statistics.variance(x)
>>> var_
123.2
>>> statistics.variance(x_with_nan)
nan
Put it in numpy It's simpler inside , have access to np.var() perhaps .var()
>>> var_ = np.var(y, ddof=1)
>>> var_
123.19999999999999
>>> var_ = y.var(ddof=1)
>>> var_
123.19999999999999
here ddof That is, the degree of freedom should be set to 1 Is unbiased . That is to say, the denominator uses n-1 Replace n. If there is nan What do I do ? return nan, But you can use np.nanvar() skip nan, however ddof Still set it to 1
>>> np.var(y_with_nan, ddof=1)
nan
>>> y_with_nan.var(ddof=1)
nan
>>> np.nanvar(y_with_nan, ddof=1)
123.19999999999999
Standard deviation
With variance , The standard deviation is easy to calculate
# Direct calculation
>>> std_ = var_ ** 0.5
>>> std_
11.099549540409285
# Use the built-in package
>>> std_ = statistics.stdev(x)
>>> std_
11.099549540409287
numpy It's also very easy to calculate
>>> np.std(y, ddof=1)
11.099549540409285
>>> y.std(ddof=1)
11.099549540409285
>>> np.std(y_with_nan, ddof=1)
nan
>>> y_with_nan.std(ddof=1)
nan
>>> np.nanstd(y_with_nan, ddof=1) # skip nan,ddof Or if 1 Oh
11.099549540409285
skewness (skew)
skewness (skewness) Also known as skewness 、 The coefficient of skewness , It is a measure of the direction and degree of skewness of the distribution of statistical data , It is a numerical feature of the degree of asymmetry in the distribution of statistical data . Skewness is the use of 3 The moment of order defines , The formula of skewness is :
The data we studied before are relatively symmetrical data , But the image above shows an asymmetric data set , The first group is represented by green dots , The second group is represented by white dots . Usually , Negative skewness The value indicates that there is a dominant tail on the left , You can see in the first set . Positive skewness value Corresponding to the long or long tail on the right , You can see in the second group . If the skewness is close to 0( for example , Be situated between -0.5 and 0.5 Between ), Then the dataset is considered to be very symmetric .
So don't rely on the third package , How to calculate skewness . You can calculate the size of the data set first n, Sample mean mean And standard deviation std Then we use the formula to calculate
>>> x = [8.0, 1, 2.5, 4, 28.0]
>>> n = len(x)
>>> mean_ = sum(x) / n
>>> var_ = sum((item - mean_)**2 for item in x) / (n - 1)
>>> std_ = var_ ** 0.5
>>> skew_ = (sum((item - mean_)**3 for item in x)
... * n / ((n - 1) * (n - 2) * std_**3))
>>> skew_
1.9470432273905929
We can see that the skewness is positive , therefore x The tail is on the right .
You can also use third-party packages to calculate
>>> y, y_with_nan = np.array(x), np.array(x_with_nan)
>>> scipy.stats.skew(y, bias=False)
1.9470432273905927
>>> scipy.stats.skew(y_with_nan, bias=False)
nan
>>> z, z_with_nan = pd.Series(x), pd.Series(x_with_nan)
>>> z.skew()
1.9470432273905924
>>> z_with_nan.skew()
1.9470432273905924
Percentiles (Percentiles)
If you sort a set of data from small to large , And calculate the corresponding cumulative percentile , Then the value of the data corresponding to a certain percentile is called the percentile of the percentile . Can be expressed as : A group of n The observations are in numerical order . Such as , be in p% The value of the position is called p Percentiles . Each dataset has three Four percentile , This is the percentile that divides the data set into four parts :
First quartile (Q1), also called “ Lower quartile ”, It is equal to the number of all values in the sample arranged from small to large 25% The number of .
second quartile (Q2), also called “ Median ”, It is equal to the number of all values in the sample arranged from small to large 50% The number of .
third quartile (Q3), also called “ Larger quartile ”, It is equal to the number of all values in the sample arranged from small to large 75% The number of .
The difference between the third quartile and the first quartile is also called the interquartile distance (InterQuartile Range,IQR).
So in python How to calculate the quantile in it . have access to statistics.quantiles()
>>> x = [-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]
>>> statistics.quantiles(x, n=2)
[8.0]
>>> statistics.quantiles(x, n=4, method='inclusive')
[0.1, 8.0, 21.0]
You can see in the first line ,8 Namely x The median , And in the second case ,0.1 and 21 It's a sample 25% and 75% quantile . Third party packages can also be used numpy Calculation
>>> np.percentile(y, [25, 50, 75])
array([ 0.1, 8. , 21. ])
>>> np.median(y)
8.0
# skip nan
>>> y_with_nan = np.insert(y, 2, np.nan)
>>> y_with_nan
array([-5. , -1.1, nan, 0.1, 2. , 8. , 12.8, 21. , 25.8, 41. ])
>>> np.nanpercentile(y_with_nan, [25, 50, 75])
array([ 0.1, 8. , 21. ])
pandas You can also use .quantile()
Calculation , You need to provide quantile values as parameters . The value can be 0 To 1 A number or sequence of numbers between .
>>> z, z_with_nan = pd.Series(y), pd.Series(y_with_nan)
>>> z.quantile(0.05)
-3.44
>>> z.quantile(0.95)
34.919999999999995
>>> z.quantile([0.25, 0.5, 0.75])
0.25 0.1
0.50 8.0
0.75 21.0
dtype: float64
>>> z_with_nan.quantile([0.25, 0.5, 0.75])
0.25 0.1
0.50 8.0
0.75 21.0
dtype: float64
Range (Ranges)
The range of data is the difference between the largest and smallest elements in the dataset . You can do this through a function np.ptp() get :
>>> np.ptp(y)
27.0
>>> np.ptp(z)
27.0
>>> np.ptp(y_with_nan)
nan
>>> np.ptp(z_with_nan)
27.0
Descriptive statistical summary
stay SciPy and Pandas Provides a single function or method call to quickly obtain descriptive statistics .
>>> result = scipy.stats.describe(y, ddof=1, bias=False)
>>> result
DescribeResult(nobs=9, minmax=(-5.0, 41.0), mean=11.622222222222222, variance=228.75194444444446, skewness=0.9249043136685094, kurtosis=0.14770623629658886)
describe()
The return contains the following information :
nobs: The number of observations or elements in a dataset
minmax: Maximum and minimum values of data
mean: The average of the data set
variance: The variance of the data set
skewness: Skewness of the data set
kurtosis: Kurtosis of data sets
>>> result.nobs
9
>>> result.minmax[0] # Min
-5.0
>>> result.minmax[1] # Max
41.0
>>> result.mean
11.622222222222222
>>> result.variance
228.75194444444446
>>> result.skewness
0.9249043136685094
>>> result.kurtosis
0.14770623629658886
pandas There are similar functions .describe():
>>> result = z.describe()
>>> result
count 9.000000 # The number of elements in the dataset
mean 11.622222 # The average of the data set
std 15.124548 # The standard deviation of the data set
min -5.000000
25% 0.100000 # The quartile of the data set
50% 8.000000
75% 21.000000
max 41.000000
dtype: float64
The correlation
The statistical significance of related lines is not explained too much , But be careful , Correlation can only be judged from the data , Can't explain causality !!!
Measuring correlation mainly uses covariance and Correlation coefficient :
So let's first recreate the data
>>> x = list(range(-10, 11))
>>> y = [0, 2, 2, 2, 2, 3, 3, 6, 7, 4, 7, 6, 6, 9, 4, 5, 5, 10, 11, 12, 14]
>>> x_, y_ = np.array(x), np.array(y)
>>> x__, y__ = pd.Series(x_), pd.Series(y_)
Calculation covariance
>>> n = len(x)
>>> mean_x, mean_y = sum(x) / n, sum(y) / n
>>> cov_xy = (sum((x[k] - mean_x) * (y[k] - mean_y) for k in range(n))
... / (n - 1))
>>> cov_xy
19.95
numpyh and pandas There are functions that can return covariance matrix cov()
# numpy
>>> cov_matrix = np.cov(x_, y_)
>>> cov_matrix
array([[38.5 , 19.95 ],
[19.95 , 13.91428571]])
# pandas
>>> cov_xy = x__.cov(y__)
>>> cov_xy
19.95
>>> cov_xy = y__.cov(x__)
>>> cov_xy
19.95
Calculation The correlation coefficient
What we're talking about here is pearson The correlation coefficient .Pearson The correlation coefficient (Pearson CorrelationCoefficient) It is used to measure whether two data sets are on the same line , It's used to measure the linear relationship between the fixed distance variables . The formula is