Reading guide Prior to 《 Data mining concept and technology The first 2 Chapter 》 In this article we introduce Q-Q The concept of graph , And by calling off the shelf python function , It's drawn Q-Q chart , Verified Q-Q Two main functions of graphs ,1. Test whether a column of data conforms to the normal distribution 2. Test whether two columns of data conform to the same distribution . This article will be a more comprehensive introduction to QQ The principle of the graph and their own handwriting function to achieve the drawing process
The code file for this article (jupyter) And data files can be in our official account " Data cobbler " In reply to "QQ chart " obtain

Q-Q What is the picture

QQ The picture is quantile-quantile( quantile - Q-Q Plot ) For short , It also has two main functions :
1. Test whether a column of data conforms to the normal distribution
2. Test whether two columns of data conform to the same distribution
Q-Q The principle of graph
We need to find out Q-Q The principle of graph , Let's first introduce the concept of quantile . Here we quote the introduction of Baidu Encyclopedia :
quantile , A point in a continuous distribution function , This point corresponds to probability p. If the probability 0<p<1, A random variable X Or the quantile of its probability distribution Za, It means meeting the conditions p(X≤Za)=α The real number .
What...?? Does it feel a little abstract , take it easy , Let's move on to the example of quantiles - Percentiles .
Percentiles , Statistical terms , If you sort a set of data from small to large , And calculate the corresponding cumulative percentile , Then the value of the data corresponding to a certain percentile is called the percentile of the percentile . Can be expressed as : A group of n The observations are in numerical order . Such as , be in p% The value of the position is called p Percentiles .
Let me give you an example : There are many students in grade three of junior high school 1000 Famous student , The final exam results are arranged from high to low , Ranking the first 10 Classmate , Right across the school 1000 Of the famous students 1% It's about , His score is the second highest in the final exam 1 Percentiles , Write it down as P1, Empathy , The first 20 The score corresponding to the first student is the second 2 Percentiles P2, ... The first 990 The score of the first student is the third 99 Percentiles P99.
that Q-Q The principle of the graph is , By comparing the quantile of a column of sample data with that of a column of data with known distribution , So as to test the distribution of data . therefore , Q-Q The two functions of the graph are to compare whether the quantiles of two columns of data are distributed in the y=x In a straight line . When the number of rows in two columns is the same , First, sort the two columns of data from high to low , Just draw a scatter diagram , When the number of rows in two columns is different , The percentile of each column needs to be calculated separately , Then scatter the percentiles of the two columns of data , Check whether the scatter plot is distributed in the y=x Near the straight line .
Test data for normal distribution
Our next example data and references are from kaggle Of Students Performance in Exams , You can reply to our official account. "QQ chart " To get it .
Let's start with our dataset . The data is 1000 That's ok , 8 Column , Each line represents the attribute information of a student , The last three are three subjects , Namely : 'math score', 'reading score', 'writing score' , We will only apply the scores in the last three columns , Verify the normal distribution of students' scores
Let's start by calling python Bao Laihua QQ chart Test for normal distribution
You can see , Scatter plot of the total score of three subjects and the value of standard normal distribution , It is basically distributed near a straight line , It can be considered that students' scores accord with normal distribution , But it's not a standard normal distribution , Detailed analysis we will talk about later .
Next , We use hand animation Q-Q chart To test whether the data conform to the normal distribution
When the number of rows in two columns is the same , Draw a scatter diagram of the two sorted numerical values directly
You can see , We drew almost the same picture as the statistical package QQ chart ( The difference between the far left and the far right , It is because the normally distributed variables are randomly generated , It's different every time )
Draw a scatter diagram of the two sorted numerical values directly , It doesn't seem to reflect it Q-Q The essence of graph , We'll take it next 0 To 100 Of 500 Quantiles , Draw a picture to see the situation
We drew almost the same picture as above , But you can see , The points on the right are below the line , The results are intuitive , Because the data of normal distribution need to have some relatively large numbers on the far right , But the student's grades were reduced by the total score 300 It's limited , This confirms a saying , Xueba can only test 100 Because the full score is only 100
Test whether two columns of data conform to the same distribution
Two columns have the same number of rows
When the number of data rows in two columns is different
You can see , 'math score' and 'reading score' The quantiles of the two columns are distributed in y=x Near the straight line , We can think that two columns of data conform to one distribution .
The difference between normal distribution and standard normal distribution
When testing whether the data conform to the normal distribution above , We said , The students' scores accord with the normal distribution , But it's not a standard normal distribution . Because if you look closely, you will find that , The scatter plot is not along the y=x Line distribution , But along y=ax+b Distribution , namely , A straight line with intercept and slope .
Q-Q The scatter plot is along the y=x When distributed , In accordance with the standard normal distribution
Q-Q Edge of scatter plot y=ax+b When distributed , In line with the normal distribution , But it is not standard normal distribution
You can see , The scatter diagram is basically y = ax+b The distribution in the vicinity of , Now we can say ,score_tol The column conforms to normal distribution , But it's not the standard Zhengtai distribution .
It is generally believed ,Q-Q The scatter of a graph needs to be distributed in y=x The normal distribution is considered near the straight line , Why are scatter plots distributed in y=ax+b near , It can still be said that , score_tol What about the normal distribution ? because , As you can see from the diagram , score_tol Columns can be written as normally distributed columns val The linear function of score_tol = a * val + b And the nature of normal distribution determines the distribution , If a variable x It follows a normal distribution , So his function ax+b It also accords with normal distribution .
In this paper, the Python Code and data files can be in our official account. " Data cobbler " In reply to "QQ chart " a
Official account : Data cobbler ; Get more

author : Fan Xiaojiang

to examine : Plasterer

edit : Mori craftsman

Q-Q The detailed explanation of diagram principle and its application Python More articles on Implementation

  1. python The first idea of compiling WeChat official account is detailed.

    Preface Meitu Xiu Xiu has been adjusting its WeChat official account for the first time. , The effect is not satisfactory , All the time , The final image was cut out a large part , Missing some key information , Very angry , So I wanted to use it python Write a program , The first model of WeChat official account ...

  2. 【 turn 】VLAN The principle,

    1. Why VLAN 1.1  What is? VLAN? VLAN(Virtual LAN), Translated into Chinese “ Virtual LAN ”.LAN It could be a network of a few home computers , It can also be an enterprise network composed of hundreds of computers .V ...

  3. Tencent technology sharing :GIF A detailed introduction to motion picture technology and mobile phones QQ Dynamic expression compression technology practice

    This article is from Tencent front-end development engineer “ wendygogogo” Technology sharing , Author's self-evaluation :“ stay Web The front end touches crawls rolls hits the yard farmer one , A rookie with a passion for Technology , Work hard for your hand Q Building blocks and building blocks for the future .” 1.GIF The history of formats GIF ( Gr ...

  4. VLAN The principle, [ Reprint ] bridge -- Switch --- Router

    come from : One . What is bridging           Bridge works in OSI The second data link layer of network reference model , Is a kind of ...

  5. Machine learning classical algorithm detailed explanation and Application Python Realization -- be based on SMO Of SVM classifier

    original text : Support vector machine is basically the best supervised learning algorithm , Because its English name is support vector  ...

  6. Skip List( Skip list ) Principle explanation and Implementation 【 turn 】

    from : Skip List( Skip list ) Principle explanation and Implementation The content framework of this paper : §1 Skip List Introduce §2 Skip List Definition ...

  7. I2C Detailed explanation of basic principles

    Let's learn today I2C signal communication ~ I2C(Inter-Intergrated Circuit) refer to IC(Intergrated Circuit) Between (Inter) communication mode . As shown in the figure above, so there are a lot of peripherals ...

  8. Lock it “ Lightweight lock ” The principle, (Lightweight Locking)

    As we all know ,Java Multithreading security of is based on Lock Realized by mechanism , and Lock The performance is often unsatisfactory . as a result of ,monitorenter And monitorexit These two control multithreading synchronization bytecode The original language , yes JVM rely on ...

  9. Influxdb The principle,

    This paper belongs to <InfluxDB Series of tutorials > Article series , The series includes the following 15 part : InfluxDB learning InfluxDB Installation and introduction of InfluxDB learning InfluxDB Basic concepts of Infl ...

  10. Nginx And PHP-FPM Detailed explanation of operation principle

    Catalog 1. Agents and reverse agents 1. Forward agency : visit 2. Reverse proxy : Load balancing through reverse proxy 2. First time to know Nginx And PHP-FPM 1. Nginx What is it? 2. CGI And FastCG ...

Random recommendation

  1. Odyssey.js – Create interactive , An attractive story

    Odyssey.js It's an open source tool , It can make your map , Narration and other multimedia combine into a beautiful story . It's easy to create a new story , Demand is nothing more than a modern society Web Browser and a good idea . You can use ready-made templates to control and design fine ...

  2. Android Mobile phone guard 11-- Form pop up PopupWindow

    protected void showPopupWindow(View view) { View popupView = View.inflate(this, R.layout.popupwindow ...

  3. modify oracle The database is in archive mode

    Reference blog : Oracle Divided into non archive mode (NOARCHIVELOG)   And Archive Mode (ARCHI ...

  4. Mysql Arrangement of important configuration parameters 2 Let's start to optimize my.conf file ( The optimization here is only in mysql Optimization of itself , Before the installation also need to be optimized ) cat /et ...

  5. Pencil OJ 01 Preparation for development

    operating system ubuntu-12.04.5-desktop-amd64.iso Basic applications Node 0.12.7 MongoDB 3.0.4 Robomongo 0.8.4 Atom Reference material OJ hu ...

  6. tabindex attribute

    1. tabindex Usage of : You can set tab The order in which keys move in a control . The following elements support tabindex attribute :<a> <input> <textarea> <are ...

  7. Math object Recollection

    Math Objects are used to perform mathematical tasks . 1. Use Math The syntax of properties and methods : var pi_value=Math.PI; var sqrt_value=Math.sqrt(15); notes :Math object ...

  8. UICollectionView adapter iPhone 7 Plus

    UICollectionView adapter iPhone 7 Plus demand : Place horizontally on the screen 5 A square picture , The width of each picture is equal , The seamless arrangement covers a screen width . Seemingly simple requirements . use UICollect ...

  9. rnn-nlp- Word prediction

    import reader import numpy as np import tensorflow as tf # Data parameters DATA_PATH = 'simple-examples/data/' ...

  10. app Development technology research

    l  Application system for consumers and the public , It is mainly divided into 3 It's a mainstream channel : 1.   web application 2.    Based on Tencent wechat api Wechat built app 3.    Mobile app ll In mobile terminal app aspect , Through research , Current mainstream ...