Detailed explanation of Q-Q diagram principle and python implementation

Data cobbler 2021-01-23 16:40:02
detailed explanation q-q diagram principle


Reading guide Prior to 《 Data mining concept and technology The first 2 Chapter 》 In this article we introduce Q-Q The concept of graph , And by calling off the shelf python function , It's drawn Q-Q chart , Verified Q-Q Two main functions of graphs ,1. Test whether a column of data conforms to the normal distribution 2. Test whether two columns of data conform to the same distribution . This article will be a more comprehensive introduction to QQ The principle of the graph and their own handwriting function to achieve the drawing process
The code file for this article (jupyter) And data files can be in our official account " Data cobbler " In reply to "QQ chart " obtain

Q-Q What is the picture

QQ The picture is quantile-quantile( quantile - Q-Q Plot ) For short , It also has two main functions :
1. Test whether a column of data conforms to the normal distribution
2. Test whether two columns of data conform to the same distribution
 
Q-Q The principle of graph
We need to find out Q-Q The principle of graph , Let's first introduce the concept of quantile . Here we quote the introduction of Baidu Encyclopedia :
quantile , A point in a continuous distribution function , This point corresponds to probability p. If the probability 0<p<1, A random variable X Or the quantile of its probability distribution Za, It means meeting the conditions p(X≤Za)=α The real number .
 
What...?? Does it feel a little abstract , take it easy , Let's move on to the example of quantiles - Percentiles .
 
Percentiles , Statistical terms , If you sort a set of data from small to large , And calculate the corresponding cumulative percentile , Then the value of the data corresponding to a certain percentile is called the percentile of the percentile . Can be expressed as : A group of n The observations are in numerical order . Such as , be in p% The value of the position is called p Percentiles .
Let me give you an example : There are many students in grade three of junior high school 1000 Famous student , The final exam results are arranged from high to low , Ranking the first 10 Classmate , Right across the school 1000 Of the famous students 1% It's about , His score is the second highest in the final exam 1 Percentiles , Write it down as P1, Empathy , The first 20 The score corresponding to the first student is the second 2 Percentiles P2, ... The first 990 The score of the first student is the third 99 Percentiles P99.
 
that Q-Q The principle of the graph is , By comparing the quantile of a column of sample data with that of a column of data with known distribution , So as to test the distribution of data . therefore , Q-Q The two functions of the graph are to compare whether the quantiles of two columns of data are distributed in the y=x In a straight line . When the number of rows in two columns is the same , First, sort the two columns of data from high to low , Just draw a scatter diagram , When the number of rows in two columns is different , The percentile of each column needs to be calculated separately , Then scatter the percentiles of the two columns of data , Check whether the scatter plot is distributed in the y=x Near the straight line .
 
Test data for normal distribution
Our next example data and references are from kaggle Of Students Performance in Exams , You can reply to our official account. "QQ chart " To get it .
Let's start with our dataset . The data is 1000 That's ok , 8 Column , Each line represents the attribute information of a student , The last three are three subjects , Namely : 'math score', 'reading score', 'writing score' , We will only apply the scores in the last three columns , Verify the normal distribution of students' scores
Let's start by calling python Bao Laihua QQ chart Test for normal distribution
You can see , Scatter plot of the total score of three subjects and the value of standard normal distribution , It is basically distributed near a straight line , It can be considered that students' scores accord with normal distribution , But it's not a standard normal distribution , Detailed analysis we will talk about later .
Next , We use hand animation Q-Q chart To test whether the data conform to the normal distribution
When the number of rows in two columns is the same , Draw a scatter diagram of the two sorted numerical values directly
You can see , We drew almost the same picture as the statistical package QQ chart ( The difference between the far left and the far right , It is because the normally distributed variables are randomly generated , It's different every time )
 
Draw a scatter diagram of the two sorted numerical values directly , It doesn't seem to reflect it Q-Q The essence of graph , We'll take it next 0 To 100 Of 500 Quantiles , Draw a picture to see the situation
We drew almost the same picture as above , But you can see , The points on the right are below the line , The results are intuitive , Because the data of normal distribution need to have some relatively large numbers on the far right , But the student's grades were reduced by the total score 300 It's limited , This confirms a saying , Xueba can only test 100 Because the full score is only 100
 
Test whether two columns of data conform to the same distribution
Two columns have the same number of rows
When the number of data rows in two columns is different
You can see , 'math score' and 'reading score' The quantiles of the two columns are distributed in y=x Near the straight line , We can think that two columns of data conform to one distribution .
 
The difference between normal distribution and standard normal distribution
When testing whether the data conform to the normal distribution above , We said , The students' scores accord with the normal distribution , But it's not a standard normal distribution . Because if you look closely, you will find that , The scatter plot is not along the y=x Line distribution , But along y=ax+b Distribution , namely , A straight line with intercept and slope .
Q-Q The scatter plot is along the y=x When distributed , In accordance with the standard normal distribution
Q-Q Edge of scatter plot y=ax+b When distributed , In line with the normal distribution , But it is not standard normal distribution
You can see , The scatter diagram is basically y = ax+b The distribution in the vicinity of , Now we can say ,score_tol The column conforms to normal distribution , But it's not the standard Zhengtai distribution .
It is generally believed ,Q-Q The scatter of a graph needs to be distributed in y=x The normal distribution is considered near the straight line , Why are scatter plots distributed in y=ax+b near , It can still be said that , score_tol What about the normal distribution ? because , As you can see from the diagram , score_tol Columns can be written as normally distributed columns val The linear function of score_tol = a * val + b And the nature of normal distribution determines the distribution , If a variable x It follows a normal distribution , So his function ax+b It also accords with normal distribution .
In this paper, the Python Code and data files can be in our official account. " Data cobbler " In reply to "QQ chart " a
take
 
Official account : Data cobbler ; Get more

author : Fan Xiaojiang

to examine : Plasterer

edit : Mori craftsman

版权声明
本文为[Data cobbler]所创,转载请带上原文链接,感谢
https://pythonmana.com/2021/01/20210123163613402R.html

  1. Experience of learning Python
  2. python7、8章
  3. Chapter 7 and 8 of Python
  4. python bool和str转换
  5. python——循环(for循环、while循环)及练习
  6. python变量和常量命名、注释规范
  7. python自定义异常捕获异常处理异常
  8. python 类型转换与数值操作
  9. python 元组(tuple)和列表(list)区别
  10. 解决python tkinter 与 sleep 延迟问题
  11. python字符串截取操作
  12. Python bool and STR conversion
  13. Python -- loop (for loop, while loop) and Practice
  14. Specification for naming and annotating variables and constants in Python
  15. Python custom exception capture exception handling exception
  16. Python type conversion and numerical operation
  17. The difference between tuple and list in Python
  18. Solve the delay problem of Python Tkinter and sleep
  19. Python string interception operation
  20. Python 100天速成中文教程,GitHub标星7700
  21. Python 100 day quick Chinese course, GitHub standard star 7700
  22. 以我的親身經歷,聊聊學python的流程,同時推薦學python的書
  23. With my own experience, I'd like to talk about the process of learning Python and recommend books for learning python
  24. python爬虫获取起点中文网人气排行Top100(快速入门,新手必备!)
  25. Python crawler to get the starting point of Chinese network popularity ranking Top100 (quick start, novice necessary!)
  26. 【Python常用包】itertools
  27. Itertools
  28. (国内首发)最新python初学者上手练习
  29. (国内首发)最新python初学者上手练习
  30. (first in China) the latest practice for beginners of Python
  31. (first in China) the latest practice for beginners of Python
  32. (数据科学学习手札104)Python+Dash快速web应用开发——回调交互篇(上)
  33. (data science learning notes 104) Python + dash rapid web application development -- callback interaction (Part 1)
  34. (数据科学学习手札104)Python+Dash快速web应用开发——回调交互篇(上)
  35. (data science learning notes 104) Python + dash rapid web application development -- callback interaction (Part 1)
  36. (資料科學學習手札104)Python+Dash快速web應用開發——回撥互動篇(上)
  37. (materials science learning notes 104) Python + dash rapid web application development -- callback interaction (Part 1)
  38. Python OpenCV 图片高斯模糊
  39. Python OpenCV image Gaussian blur
  40. Stargan V2: converse image synthesis for multiple domains reading notes and Python code analysis
  41. 零基础入门Python:基本命令、函数、数据结构
  42. Python: basic commands, functions and data structures
  43. 毫无基础的人如何入门Python?从入门到进阶三份教程,拿走不谢
  44. How can a person without foundation get into Python? From the introduction to the advanced three tutorials, take away
  45. Python设计模式面向对象编程
  46. Python design pattern object oriented programming
  47. Python设计模式面向对象编程
  48. Python design pattern object oriented programming
  49. 怎么样描述你的数据——用python做描述性分析
  50. GitHub上3k+star的python爬虫库你了解吗?详解MechanicalSoup爬虫库
  51. python数据分析——在python中实现线性回归
  52. 疫情来袭,30分钟学会用python开发部署疫情可视化网站
  53. How to describe your data
  54. Do you know the python crawler Library of 3K + star on GitHub? Mechanical soup crawler Library
  55. Python data analysis -- realizing linear regression in Python
  56. When the epidemic strikes, learn to develop and deploy the visualization website of epidemic situation with Python in 30 minutes
  57. 手机上利用python进行数据分析——创建自己的远程jupyter notebook
  58. python数据类型的强制转换
  59. Using Python for data analysis on mobile phones -- creating your own remote jupyter notebook
  60. Mandatory conversion of Python data type