【 Reading guide 】 Prior to 《 Data mining concept and technology The first 2 Chapter 》 In this article we introduce Q-Q The concept of graph , And by calling off the shelf python function , It's drawn Q-Q chart , Verified Q-Q Two main functions of graphs ,1. Test whether a column of data conforms to the normal distribution 2. Test whether two columns of data conform to the same distribution . This article will be a more comprehensive introduction to QQ The principle of the graph and their own handwriting function to achieve the drawing process
The code file for this article (jupyter) And data files can be in our official account " Data cobbler " In reply to "QQ chart " obtain
Q-Q What is the picture
QQ The picture is quantile-quantile( quantile - Q-Q Plot ) For short , It also has two main functions ：
1. Test whether a column of data conforms to the normal distribution
2. Test whether two columns of data conform to the same distribution
Q-Q The principle of graph
We need to find out Q-Q The principle of graph , Let's first introduce the concept of quantile . Here we quote the introduction of Baidu Encyclopedia ：
quantile , A point in a continuous distribution function , This point corresponds to probability p. If the probability 0<p<1, A random variable X Or the quantile of its probability distribution Za, It means meeting the conditions p(X≤Za)=α The real number .
What...?? Does it feel a little abstract , take it easy , Let's move on to the example of quantiles - Percentiles .
Percentiles , Statistical terms , If you sort a set of data from small to large , And calculate the corresponding cumulative percentile , Then the value of the data corresponding to a certain percentile is called the percentile of the percentile . Can be expressed as ： A group of n The observations are in numerical order . Such as , be in p% The value of the position is called p Percentiles .
Let me give you an example ： There are many students in grade three of junior high school 1000 Famous student , The final exam results are arranged from high to low , Ranking the first 10 Classmate , Right across the school 1000 Of the famous students 1% It's about , His score is the second highest in the final exam 1 Percentiles , Write it down as P1, Empathy , The first 20 The score corresponding to the first student is the second 2 Percentiles P2, ... The first 990 The score of the first student is the third 99 Percentiles P99.
that Q-Q The principle of the graph is , By comparing the quantile of a column of sample data with that of a column of data with known distribution , So as to test the distribution of data . therefore , Q-Q The two functions of the graph are to compare whether the quantiles of two columns of data are distributed in the y=x In a straight line . When the number of rows in two columns is the same , First, sort the two columns of data from high to low , Just draw a scatter diagram , When the number of rows in two columns is different , The percentile of each column needs to be calculated separately , Then scatter the percentiles of the two columns of data , Check whether the scatter plot is distributed in the y=x Near the straight line .
Test data for normal distribution
Our next example data and references are from kaggle Of Students Performance in Exams , You can reply to our official account. "QQ chart " To get it .
Let's start with our dataset . The data is 1000 That's ok , 8 Column , Each line represents the attribute information of a student , The last three are three subjects , Namely : 'math score', 'reading score', 'writing score' , We will only apply the scores in the last three columns , Verify the normal distribution of students' scores
Let's start by calling python Bao Laihua QQ chart Test for normal distribution
You can see , Scatter plot of the total score of three subjects and the value of standard normal distribution , It is basically distributed near a straight line , It can be considered that students' scores accord with normal distribution , But it's not a standard normal distribution , Detailed analysis we will talk about later .
Next , We use hand animation Q-Q chart To test whether the data conform to the normal distribution
When the number of rows in two columns is the same , Draw a scatter diagram of the two sorted numerical values directly
You can see , We drew almost the same picture as the statistical package QQ chart ( The difference between the far left and the far right , It is because the normally distributed variables are randomly generated , It's different every time )
Draw a scatter diagram of the two sorted numerical values directly , It doesn't seem to reflect it Q-Q The essence of graph , We'll take it next 0 To 100 Of 500 Quantiles , Draw a picture to see the situation
We drew almost the same picture as above , But you can see , The points on the right are below the line , The results are intuitive , Because the data of normal distribution need to have some relatively large numbers on the far right , But the student's grades were reduced by the total score 300 It's limited , This confirms a saying , Xueba can only test 100 Because the full score is only 100
Test whether two columns of data conform to the same distribution
Two columns have the same number of rows
When the number of data rows in two columns is different
You can see , 'math score' and 'reading score' The quantiles of the two columns are distributed in y=x Near the straight line , We can think that two columns of data conform to one distribution .
The difference between normal distribution and standard normal distribution
When testing whether the data conform to the normal distribution above , We said , The students' scores accord with the normal distribution , But it's not a standard normal distribution . Because if you look closely, you will find that , The scatter plot is not along the y=x Line distribution , But along y=ax+b Distribution , namely , A straight line with intercept and slope .
Q-Q The scatter plot is along the y=x When distributed , In accordance with the standard normal distribution
Q-Q Edge of scatter plot y=ax+b When distributed , In line with the normal distribution , But it is not standard normal distribution
You can see , The scatter diagram is basically y = ax+b The distribution in the vicinity of , Now we can say ,score_tol The column conforms to normal distribution , But it's not the standard Zhengtai distribution .
It is generally believed ,Q-Q The scatter of a graph needs to be distributed in y=x The normal distribution is considered near the straight line , Why are scatter plots distributed in y=ax+b near , It can still be said that , score_tol What about the normal distribution ? because , As you can see from the diagram , score_tol Columns can be written as normally distributed columns val The linear function of score_tol = a * val + b And the nature of normal distribution determines the distribution , If a variable x It follows a normal distribution , So his function ax+b It also accords with normal distribution .
In this paper, the Python Code and data files can be in our official account. " Data cobbler " In reply to "QQ chart " a
Official account ： Data cobbler ; Get more
author ： Fan Xiaojiang
to examine ： Plasterer
edit ： Mori craftsman