How to describe your data

osc_ np3y0rbq 2021-01-23 12:43:32
data


Descriptive statistics It's about data description and aggregation . It uses two main methods :

  1. The quantitative method is numerical Describe and summarize data .

  2. Visualization methods Through the chart , diagram , Histograms and other graphics to illustrate the data .

Generally in the process of data analysis , Getting the data doesn't go straight to modeling , But first do descriptive analysis to have a general grasp of the data , Many subsequent modeling directions are further determined by descriptive analysis . So apart from Excel/R We can do descriptive analysis in the process .

This article will explain in detail how to use python The quantitative part of descriptive analysis :

  • mean value

  • Median

  • variance

  • Standard deviation

  • skewness

  • Percentiles

  • The correlation

As for the visualization part, please refer to my previous explanation pyecharts The article , Of course, it will be introduced later echarts as well as ggplot2 Methods .

It involves python library

  • Python statistics It's built-in for descriptive statistics Python library . If your data set is not too large , Or you can't rely on importing other libraries , You can use it .

  • NumPy It's a third-party library for digital computing , Optimized for using one-dimensional and multi-dimensional arrays . Its main type is called array type ndarray. The library contains many methods for statistical analysis .

  • SciPy Is based on NumPy Third party libraries for Scientific Computing . And NumPy comparison , It provides other functions , Include scipy.stats Statistical analysis .Getting started - SciPy.org

  • Pandas Is based on NumPy Third party library for numerical calculation . It's good at dealing with Series The labeled dimension of an object (1D) Data and 2D with objects (2D) data DataFrame.

  • Matplotlib It's a third-party library for data visualization . Usually with NumPy,SciPy and Pandas Use a combination of


Start

First import all the packages

import math
import statistics
import numpy as np
import scipy.stats
import pandas as pd



Create data

x and x_with_nan All are list. The difference is x_with_nan Contains a nan value . That is, null value ( Missing value ), Such data is very common in analysis . So in python in , Create a nan Values can be in the following ways

float('nan')
math.nan
np.nan

Of course, the null values created by these three methods are equivalent

But is it really equal , Two nan It's not equal , let me put it another way , It can't be compared , I'll tell you the story later .

next , We use numpy and pandas To create two dimensions numpy arrays and pandas series

mean value

What's the definition of mean , Don't say ,R Directly inside mean() Can , And in the python, Don't import packages , How to calculate :

You can also use it python The built-in statistics function of

But if the data contains nan, Then it will return to nan

>>> mean_ = statistics.mean(x_with_nan)
>>> mean_
nan

If you use numpy

>>> mean_ = np.mean(y)
>>> mean_
8.7

In the example above ,mean() It's a function , But you can also use the corresponding method

>>> mean_ = y.mean()
>>> mean_
8.7

If you include nan,numpy And will return to nan, So if you want to ignore nan, have access to np.nanmean()

>>> np.mean(y_with_nan)
nan
>>> np.nanmean(y_with_nan)
8.7


pandas There are also corresponding methods , however , By default ,.mean() stay Pandas Ignored in nan value :

mean_ = z.mean()
mean_
>>> z_with_nan.mean()
8.7


Median

Compare the mean and median , It's a way to detect outliers and asymmetries in data . Average or median is more useful for you , Depending on the context of the particular problem . Instead of using package calculations :

>>> n = len(x)
>>> if n % 2:
...     median_ = sorted(x)[round(0.5*(n-1))]
... else:
...     x_ord, index = sorted(x), round(0.5 * n)
...     median_ = 0.5 * (x_ord[index-1] + x_ord[index])
...
>>> median_
4







Other methods

>>> median_ = np.median(y)
>>> median_
4.0
>>> np.nanmedian(y_with_nan)
4.0



variance

The significance of variance is not too clear , stay Excel Use... Directly stdev function , But why python Middle computation ? I remember when I was asked to use it in the second postgraduate examination python How to calculate variance without importing package ?

>>> n = len(x)
>>> mean_ = sum(x) / n
>>> var_ = sum((item - mean_)**2 for item in x) / (n - 1)
>>> var_
123.19999999999999



Of course, a simpler way is to use functions directly , However, there are nan Or will return nan

>>> var_ = statistics.variance(x)
>>> var_
123.2
>>> statistics.variance(x_with_nan)
nan



Put it in numpy It's simpler inside , have access to np.var() perhaps .var()

>>> var_ = np.var(y, ddof=1)
>>> var_
123.19999999999999
>>> var_ = y.var(ddof=1)
>>> var_
123.19999999999999





here ddof That is, the degree of freedom should be set to 1 Is unbiased . That is to say, the denominator uses n-1 Replace n. If there is nan What do I do ? return nan, But you can use np.nanvar() skip nan, however ddof Still set it to 1

>>> np.var(y_with_nan, ddof=1)
nan
>>> y_with_nan.var(ddof=1)
nan
>>> np.nanvar(y_with_nan, ddof=1)
123.19999999999999




Standard deviation

With variance , The standard deviation is easy to calculate

# Direct calculation 
>>> std_ = var_ ** 0.5
>>> std_
11.099549540409285
# Use the built-in package
>>> std_ = statistics.stdev(x)
>>> std_
11.099549540409287






numpy It's also very easy to calculate

>>> np.std(y, ddof=1)
11.099549540409285
>>> y.std(ddof=1)
11.099549540409285
>>> np.std(y_with_nan, ddof=1)
nan
>>> y_with_nan.std(ddof=1)
nan
>>> np.nanstd(y_with_nan, ddof=1) # skip nan,ddof Or if 1 Oh
11.099549540409285








skewness (skew)

skewness (skewness) Also known as skewness 、 The coefficient of skewness , It is a measure of the direction and degree of skewness of the distribution of statistical data , It is a numerical feature of the degree of asymmetry in the distribution of statistical data . Skewness is the use of 3 The moment of order defines , The formula of skewness is :

The data we studied before are relatively symmetrical data , But the image above shows an asymmetric data set , The first group is represented by green dots , The second group is represented by white dots . Usually , Negative skewness The value indicates that there is a dominant tail on the left , You can see in the first set . Positive skewness value Corresponding to the long or long tail on the right , You can see in the second group . If the skewness is close to 0( for example , Be situated between -0.5 and 0.5 Between ), Then the dataset is considered to be very symmetric .

So don't rely on the third package , How to calculate skewness . You can calculate the size of the data set first n, Sample mean mean And standard deviation std Then we use the formula to calculate

>>> x = [8.0, 1, 2.5, 4, 28.0]
>>> n = len(x)
>>> mean_ = sum(x) / n
>>> var_ = sum((item - mean_)**2 for item in x) / (n - 1)
>>> std_ = var_ ** 0.5
>>> skew_ = (sum((item - mean_)**3 for item in x)
...          * n / ((n - 1) * (n - 2) * std_**3))
>>> skew_
1.9470432273905929







We can see that the skewness is positive , therefore x The tail is on the right .

You can also use third-party packages to calculate

>>> y, y_with_nan = np.array(x), np.array(x_with_nan)
>>> scipy.stats.skew(y, bias=False)
1.9470432273905927
>>> scipy.stats.skew(y_with_nan, bias=False)
nan
>>> z, z_with_nan = pd.Series(x), pd.Series(x_with_nan)
>>> z.skew()
1.9470432273905924
>>> z_with_nan.skew()
1.9470432273905924








Percentiles (Percentiles)

If you sort a set of data from small to large , And calculate the corresponding cumulative percentile , Then the value of the data corresponding to a certain percentile is called the percentile of the percentile . Can be expressed as : A group of n The observations are in numerical order . Such as , be in p% The value of the position is called p Percentiles . Each dataset has three Four percentile , This is the percentile that divides the data set into four parts :

  • First quartile (Q1), also called “ Lower quartile ”, It is equal to the number of all values in the sample arranged from small to large 25% The number of .

  • second quartile (Q2), also called “ Median ”, It is equal to the number of all values in the sample arranged from small to large 50% The number of .

  • third quartile (Q3), also called “ Larger quartile ”, It is equal to the number of all values in the sample arranged from small to large 75% The number of .

The difference between the third quartile and the first quartile is also called the interquartile distance (InterQuartile Range,IQR).

So in python How to calculate the quantile in it . have access to statistics.quantiles()

>>> x = [-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]
>>> statistics.quantiles(x, n=2)
[8.0]
>>> statistics.quantiles(x, n=4, method='inclusive')
[0.1, 8.0, 21.0]



You can see in the first line ,8 Namely x The median , And in the second case ,0.1 and 21 It's a sample 25% and 75% quantile . Third party packages can also be used numpy Calculation

>>> np.percentile(y, [25, 50, 75])
array([ 0.1,  8. , 21. ])
>>> np.median(y)
8.0
# skip nan
>>> y_with_nan = np.insert(y, 2, np.nan)
>>> y_with_nan
array([-5. , -1.1,  nan,  0.1,  2. ,  8. , 12.8, 21. , 25.8, 41. ])
>>> np.nanpercentile(y_with_nan, [25, 50, 75])
array([ 0.1,  8. , 21. ])








pandas You can also use .quantile() Calculation , You need to provide quantile values as parameters . The value can be 0 To 1 A number or sequence of numbers between .

>>> z, z_with_nan = pd.Series(y), pd.Series(y_with_nan)
>>> z.quantile(0.05)
-3.44
>>> z.quantile(0.95)
34.919999999999995
>>> z.quantile([0.25, 0.5, 0.75])
0.25     0.1
0.50     8.0
0.75    21.0
dtype: float64
>>> z_with_nan.quantile([0.25, 0.5, 0.75])
0.25     0.1
0.50     8.0
0.75    21.0
dtype: float64













Range (Ranges)

The range of data is the difference between the largest and smallest elements in the dataset . You can do this through a function np.ptp() get :

>>> np.ptp(y)
27.0
>>> np.ptp(z)
27.0
>>> np.ptp(y_with_nan)
nan
>>> np.ptp(z_with_nan)
27.0






Descriptive statistical summary

stay SciPy and Pandas Provides a single function or method call to quickly obtain descriptive statistics .

>>> result = scipy.stats.describe(y, ddof=1, bias=False)
>>> result
DescribeResult(nobs=9, minmax=(-5.0, 41.0), mean=11.622222222222222, variance=228.75194444444446, skewness=0.9249043136685094, kurtosis=0.14770623629658886)

describe()  The return contains the following information :

  • nobs: The number of observations or elements in a dataset

  • minmax: Maximum and minimum values of data

  • mean: The average of the data set

  • variance: The variance of the data set

  • skewness: Skewness of the data set

  • kurtosis: Kurtosis of data sets

>>> result.nobs
9
>>> result.minmax[0]  # Min
-5.0
>>> result.minmax[1]  # Max
41.0
>>> result.mean
11.622222222222222
>>> result.variance
228.75194444444446
>>> result.skewness
0.9249043136685094
>>> result.kurtosis
0.14770623629658886












pandas There are similar functions .describe():

>>> result = z.describe()
>>> result
count     9.000000 # The number of elements in the dataset
mean     11.622222 # The average of the data set
std      15.124548 # The standard deviation of the data set
min      -5.000000
25%       0.100000 # The quartile of the data set
50%       8.000000
75%      21.000000
max      41.000000
dtype: float64









The correlation

The statistical significance of related lines is not explained too much , But be careful , Correlation can only be judged from the data , Can't explain causality !!!

Measuring correlation mainly uses covariance and Correlation coefficient :

So let's first recreate the data

>>> x = list(range(-10, 11))
>>> y = [0, 2, 2, 2, 2, 3, 3, 6, 7, 4, 7, 6, 6, 9, 4, 5, 5, 10, 11, 12, 14]
>>> x_, y_ = np.array(x), np.array(y)
>>> x__, y__ = pd.Series(x_), pd.Series(y_)


Calculation covariance

>>> n = len(x)
>>> mean_x, mean_y = sum(x) / n, sum(y) / n
>>> cov_xy = (sum((x[k] - mean_x) * (y[k] - mean_y) for k in range(n))
...           / (n - 1))
>>> cov_xy
19.95




numpyh and pandas There are functions that can return covariance matrix cov()

# numpy
>>> cov_matrix = np.cov(x_, y_)
>>> cov_matrix
array([[38.5       , 19.95      ],
      [19.95      , 13.91428571]])
# pandas
>>> cov_xy = x__.cov(y__)
>>> cov_xy
19.95
>>> cov_xy = y__.cov(x__)
>>> cov_xy
19.95










Calculation The correlation coefficient

What we're talking about here is pearson The correlation coefficient .Pearson The correlation coefficient (Pearson CorrelationCoefficient) It is used to measure whether two data sets are on the same line , It's used to measure the linear relationship between the fixed distance variables . The formula is

版权声明
本文为[osc_ np3y0rbq]所创,转载请带上原文链接,感谢
https://pythonmana.com/2021/01/20210123123954585b.html

  1. Mandatory conversion of Python data type
  2. Django reported an error: 'key' ID 'not found in' xxx '. Choices are: xxx'
  3. Python 400 sets of large video, starting from the right direction to learn, a complete set to you
  4. 只需十四步:从零开始掌握Python机器学习(附资源)
  5. Just 14 steps: Master Python machine learning from scratch (resources attached)
  6. Python|文件读写
  7. 安利一个Python界神奇得网站
  8. Python | file reading and writing
  9. Amway is a marvelous website in Python world
  10. 第二热门语言:从入门到精通,Python数据科学简洁教程
  11. The second popular language: from introduction to mastery, python data science concise tutorial
  12. 以我的亲身经历,聊聊学python的流程,同时推荐学python的书
  13. With my own experience, I'd like to talk about the process of learning Python and recommend books for learning python
  14. 以我的亲身经历,聊聊学python的流程,同时推荐学python的书
  15. With my own experience, I'd like to talk about the process of learning Python and recommend books for learning python
  16. Django url 路由匹配过程
  17. Django URL routing matching process
  18. 强者一出,谁与争锋?与Python相比,C++的运行速度究竟有多快?
  19. Who will fight against the strong? How fast is C + + running compared with Python?
  20. python 学习体会
  21. Experience of learning Python
  22. python7、8章
  23. Chapter 7 and 8 of Python
  24. python bool和str转换
  25. python——循环(for循环、while循环)及练习
  26. python变量和常量命名、注释规范
  27. python自定义异常捕获异常处理异常
  28. python 类型转换与数值操作
  29. python 元组(tuple)和列表(list)区别
  30. 解决python tkinter 与 sleep 延迟问题
  31. python字符串截取操作
  32. Python bool and STR conversion
  33. Python -- loop (for loop, while loop) and Practice
  34. Specification for naming and annotating variables and constants in Python
  35. Python custom exception capture exception handling exception
  36. Python type conversion and numerical operation
  37. The difference between tuple and list in Python
  38. Solve the delay problem of Python Tkinter and sleep
  39. Python string interception operation
  40. Python 100天速成中文教程,GitHub标星7700
  41. Python 100 day quick Chinese course, GitHub standard star 7700
  42. 以我的親身經歷,聊聊學python的流程,同時推薦學python的書
  43. With my own experience, I'd like to talk about the process of learning Python and recommend books for learning python
  44. python爬虫获取起点中文网人气排行Top100(快速入门,新手必备!)
  45. Python crawler to get the starting point of Chinese network popularity ranking Top100 (quick start, novice necessary!)
  46. 【Python常用包】itertools
  47. Itertools
  48. (国内首发)最新python初学者上手练习
  49. (国内首发)最新python初学者上手练习
  50. (first in China) the latest practice for beginners of Python
  51. (first in China) the latest practice for beginners of Python
  52. (数据科学学习手札104)Python+Dash快速web应用开发——回调交互篇(上)
  53. (data science learning notes 104) Python + dash rapid web application development -- callback interaction (Part 1)
  54. (数据科学学习手札104)Python+Dash快速web应用开发——回调交互篇(上)
  55. (data science learning notes 104) Python + dash rapid web application development -- callback interaction (Part 1)
  56. (資料科學學習手札104)Python+Dash快速web應用開發——回撥互動篇(上)
  57. (materials science learning notes 104) Python + dash rapid web application development -- callback interaction (Part 1)
  58. Python OpenCV 图片高斯模糊
  59. Python OpenCV image Gaussian blur
  60. Stargan V2: converse image synthesis for multiple domains reading notes and Python code analysis