Hands on pandas, take you to play with data (4) -- data cleaning

Look at the future 2021-04-08 10:02:36
hands pandas play data data


 Insert picture description here

Descriptive Approach

Before we talk about data cleaning , I want to talk about some descriptive methods first . Otherwise, when we talk about cleaning later, it will be a bit abrupt .

S.No. function describe
1 count() The number of non null Observations
2 sum() Summation of values
3 mean() The meaning of value
4 median() The middle value of value
5 mode() The pattern of values
6 std() The standard deviation of values
7 min() minimum value
8 max() Maximum
9 abs() The absolute value
10 prod() The product of value
11 cumsum() Cumulative sum
12 cumprod() Cumulative products

Which one doesn't work ?

Let's take a look at an example :

df = pd.DataFrame(d)
print(df.std())

Pseudo code , For reference only .

notes :

because DataFrame It's a heterogeneous data structure . Common operations are not available for all functions .

sum(),cumsum() And other functions can handle numbers and characters at the same time ( or ) String data elements , And there will be no mistakes . Although in practice , Role aggregation has never been used universally , But these functions don't throw any exceptions .

When DataFrame Contains the number of characters or strings ​​ According to the time image abs(),cumprod() And so on will throw an exception , Because these operations cannot be performed .


Summary data

describe() The function is used to calculate DataFrame Summary of statistics for Columns .

Execute this function , We can get the data :

 Age Rating
count 12.000000 12.000000
mean 31.833333 3.743333
std 9.232682 0.661628
min 23.000000 2.560000
25% 25.000000 3.230000
50% 29.500000 3.790000
75% 35.500000 4.132500
max 51.000000 4.800000

You know what , Look at the left column , You don't have to look at anything else .

This function gives the average value , Standard deviation and IQR value . and , Function to exclude character columns , And give a summary of the number column .

This function has one parameter to note :include.
Look at the name of the parameter, you should guess it . contain , Yes , If we have a lot of columns , I don't want to see all the column descriptions , That's a lot of trouble ? So I gave it to include This parameter is for you to choose .

include The default is “ Numbers ”, Of course , In addition to being customizable , The system also provides the following options :


object - Summarize character string Columns
number - Summary number column
all - Put all the columns together ( It should not be passed as a list value )

pandas Processing text data

Pandas Provides a set of string functions , You can easily manipulate string data . most important of all , These functions ignore ( Or exclude ) The loss of / NaN value .

S.No function describe
1 lower() take Series / Index Convert the string in to lowercase letters .
2 upper() take Series / Index To uppercase .
3 len() Calculate the length of the string ().
4 strip() Help remove from both sides Series / index Space for each string in ( Include line breaks ).
5 split(’ ') Split each string with a given pattern .
6 cat(sep=’ ') Join the sequence with the given separator / Index element .
7 get_dummies() use One-Hot Encoded Value returns DataFrame.
8 contains(pattern) If the substring is contained in the element , Returns the Boolean value of each element True, Otherwise return to False.
9 replace(a,b) Replacement value One And value b .
10 repeat(value) Repeat each element a specified number of times .
11 count(pattern) Returns the number of occurrences of the pattern in each element .
12 startswith(pattern) If Series / Index The elements in start with patterns , Then return to true.
13 endswith(pattern) If Series / Index Elements in end with patterns , Then return to true.
14 find(pattern) Return to the first position where the pattern first appears .
15 findall(pattern) Returns a list of all occurrences of the pattern .
16 swapcase Replace the box / Replace .
17 islower() Check Series / Index Whether all characters in each string in are lowercase . Returns a Boolean value
18 isupper() Check Series / Index Whether all characters in each string in are uppercase . Returns a Boolean value .
19 isnumeric() Check Series / Index Whether all characters in each string in are numbers . Returns a Boolean value .

Check for missing values

To detect missing values more easily ( And across different arrays dtype),Pandas Provides isnull() and notnull() function , They are, too Series and DataFrame Object method (python of use NaN(Not a Number) Indicates missing data ):

df = pd.DataFrame([[1,5,8],[2,np.nan,np.nan],[2,3,np.nan],[np.nan,np.nan,np.nan]])
 0 1 2
0 1.0 5.0 8.0
1 2.0 NaN NaN
2 2.0 3.0 NaN
3 NaN NaN NaN

Determine whether there is empty data in the column

axis=0, Representative column ,axis=1 On behalf of the line

 View line :df.isnull().any(axis=1)
View columns :df.isnull().any(axis=0)

I've made a line by line judgment. Take a look at it :

0 False
1 True
2 True
3 True
dtype: bool

Take another look at this :

 View line :df.notnull().all(axis=1)
View columns :df.notnull().all(axis=0)

This is the case. , As long as there's non empty data , Will be judged as True.


The above methods , You can use the negative sign “~” To reverse .

print(~df.isnull().any(axis = 1))

It can also be done through loc() Method .

Let's say I want to take out all the non empty data rows , It can be done in this way :

df = df.loc[~df.isnull().any(axis = 1)]
 0 1 2
0 1.0 5.0 8.0

As for this loc() Method , I'll say later .


You can also specify a column to judge null values :

print(df[1].isnull()) # Determine the null value of a column 
print(df[1].isnull().value_counts()) # Count the number of null values in a column 

cleaning 、 Fill in missing values

Clean up rows with missing values / Column

Relatively speaking, there is a very direct way , All the lines with null values will be directly 、 Column :

df = pd.DataFrame([[1,5,8],[2,np.nan,np.nan],[2,3,np.nan],[np.nan,np.nan,np.nan]])
df = df.dropna()
print(df)

Without any additional conditions , As long as there are null values in your line , A line of cleaning up .

 0 1 2
0 1.0 5.0 8.0

If it is to be sorted out by columns ? Then add :

df = pd.DataFrame([[1,5,8],[2,np.nan,np.nan],[2,3,np.nan],[np.nan,np.nan,np.nan]])
df = df.dropna(axis=1)
print(df)

ah , I'm sorry to tell you , It's all cleaned up , There's nothing left , Because every column has a null value ...

Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]

Delete rows under bad value tolerance / Column

good , Now tell me , You think it's tolerable to have one or two bad values in a business , What can I do then ? Then I have to do it for you :

# As long as there is n One value is good , Just stay :
df = pd.DataFrame([[1,5,8],[np.nan,np.nan,np.nan],[2,3,np.nan],[np.nan,np.nan,np.nan]])
df = df.dropna(thresh=1) # n
print(df)
 0 1 2
0 1.0 5.0 8.0
2 2.0 3.0 NaN

Right . Well, if it's not what you want , Then I can't help it .


Delete specified row / Column

What is the , Delete the specified column ? Delete specified row ? Let's have a try , Grope for .

df = pd.DataFrame([[1,5,8],[np.nan,np.nan,np.nan],[2,3,np.nan],[np.nan,np.nan,np.nan]])
df = df.drop(labels=1)
print(df)
 0 1 2
0 1.0 5.0 8.0
2 2.0 3.0 NaN
3 NaN NaN NaN

Na , I deleted the first column .

Wonderful is one's Art !!!

df = pd.DataFrame([[1,5,8],[np.nan,np.nan,np.nan],[2,3,np.nan],[np.nan,np.nan,np.nan]])
df = df.drop(columns=2)
print(df)

sorry , What's deleted above is a line , Here is the deletion column ...

 0 1
0 1.0 5.0
1 NaN NaN
2 2.0 3.0
3 NaN NaN

Fill in missing values

Now let's switch the dataset back .

Then fill in the missing values :

df = pd.DataFrame([[1,5,np.nan],[2,np.nan,np.nan],[2,3,np.nan],[np.nan,np.nan,np.nan]])
df = df.fillna(value=0) # Fill in the missing value with the specified value 
print(df)
 0 1 2
0 1.0 5.0 0.0
1 2.0 0.0 0.0
2 2.0 3.0 0.0
3 0.0 0.0 0.0

Fill a column with the average of a column :

df = pd.DataFrame([[1,5,np.nan],[2,np.nan,np.nan],[2,3,np.nan],[np.nan,np.nan,np.nan]])
print(df)
df[1] = df.fillna(df[1].mean())
print(df)
 0 1 2
0 1.0 5.0 NaN
1 2.0 NaN NaN
2 2.0 3.0 NaN
3 NaN NaN NaN
0 1 2
0 1.0 5.0 1.0
1 2.0 NaN 2.0
2 2.0 3.0 2.0
3 NaN NaN NaN

Why don't you try the second column ?

Na , Try not specifying columns :

df = pd.DataFrame([[1,5,np.nan],[2,np.nan,np.nan],[2,3,np.nan],[np.nan,np.nan,np.nan]])
print(df)
df = df.fillna(df.mean())
print(df)

Top down filling :

df = df.fillna(method='ffill')
print(df)
 0 1 2
0 1.0 5.0 NaN
1 2.0 NaN NaN
2 2.0 3.0 NaN
3 NaN NaN NaN
0 1 2
0 1.0 5.0 NaN
1 2.0 5.0 NaN
2 2.0 3.0 NaN
3 2.0 3.0 NaN

There are top-down, there are bottom-up :

df = df.fillna(method='bfill')
print(df)
 0 1 2
0 1.0 5.0 NaN
1 2.0 NaN NaN
2 2.0 3.0 NaN
3 NaN NaN NaN
0 1 2
0 1.0 5.0 NaN
1 2.0 3.0 NaN
2 2.0 3.0 NaN
3 NaN NaN NaN

Remove white space from data

# Create data with spaces 
dict1 = {
"name": [" Xiaohong ", " Xiao Ming ", " Xiao Zhang "], "age": [16, 17, 18], "city": [" Beijing ", " Hangzhou ", " Shanghai "]}
df2 = pd.DataFrame(dict1, columns=["name", "age", "city"])
print(df2)
# Clear spaces 
df2["city"] = df2["city"].map(str.strip)
print(df2)
 name age city
0 Xiaohong 16 Beijing
1 Xiao Ming 17 Hangzhou
2 Xiao Zhang 18 Shanghai
name age city
0 Xiaohong 16 Beijing
1 Xiao Ming 17 Hangzhou
2 Xiao Zhang 18 Shanghai

duplicate removal

If you get a data set , A very large , You feel like there's a lot of repetition in it , Want to do a wave of de duplication , What do I do ?

One more drop_duplicates Not yet. .

Play with another dataset , I'm tired of using that all the time .

df = pd.DataFrame({
'Country':[1,1,2,12,34,23,45,34,23,12,2,3,4,1],
'Income':[1,1,2,10000, 10000, 5000, 5002, 40000, 50000, 8000, 5000,3000,15666,1],
'Age':[1,1,2,50, 43, 34, 40, 25, 25, 45, 32,12,32,1],
'group':[1,1,2,'a','b','s','d','f','g','h','a','d','a',1]})
 Country Income Age group
0 1 1 1 1
1 1 1 1 1
2 2 2 2 2
3 12 10000 50 a
4 34 10000 43 b
5 23 5000 34 s
6 45 5002 40 d
7 34 40000 25 f
8 23 50000 25 g
9 12 8000 45 h
10 2 5000 32 a
11 3 3000 12 d
12 4 15666 32 a
13 1 1 1 1

Go straight to the heavy :

df.drop_duplicates(inplace=True) #inplace=True Modify the original table 
 Country Income Age group
0 1 1 1 1
2 2 2 2 2
3 12 10000 50 a
4 34 10000 43 b
5 23 5000 34 s
6 45 5002 40 d
7 34 40000 25 f
8 23 50000 25 g
9 12 8000 45 h
10 2 5000 32 a
11 3 3000 12 d
12 4 15666 32 a

One column is missing .

Look at the index in the data table , When we use drop_duplicates When deleting duplicate lines , The index values corresponding to duplicate rows are also deleted by default , in other words , The index value has changed .

How can we solve this problem ?

df.drop_duplicates(inplace=True)
df = df.reset_index(drop=True)
print(df)
 Country Income Age group
0 1 1 1 1
1 2 2 2 2
2 12 10000 50 a
3 34 10000 43 b
4 23 5000 34 s
5 45 5002 40 d
6 34 40000 25 f
7 23 50000 25 g
8 12 8000 45 h
9 2 5000 32 a
10 3 3000 12 d
11 4 15666 32 a

If you want to specify the reserved duplicate lines ( The default is the first line ), have access to keep Parameters : There's usually nothing for you to choose from , Or first, Or last.


De duplicate the specified data column :

df.drop_duplicates(inplace=True,subset = ['Age'],keep='last')
df = df.reset_index(drop=True)
print(df)
0 2 2 2 2
1 12 10000 50 a
2 34 10000 43 b
3 23 5000 34 s
4 45 5002 40 d
5 23 50000 25 g
6 12 8000 45 h
7 3 3000 12 d
8 4 15666 32 a
9 1 1 1 1

If you want more lines ?
What is this operation called ? Think about the primary key in a database .

df.drop_duplicates(inplace=True,subset = ['Age','group'],keep='last')
df = df.reset_index(drop=True)
print(df)
 Country Income Age group
0 2 2 2 2
1 12 10000 50 a
2 34 10000 43 b
3 23 5000 34 s
4 45 5002 40 d
5 34 40000 25 f
6 23 50000 25 g
7 12 8000 45 h
8 3 3000 12 d
9 4 15666 32 a
10 1 1 1 1

If nothing goes wrong , That's all for this article ,see you!!!
 Insert picture description here

版权声明
本文为[Look at the future]所创,转载请带上原文链接,感谢
https://pythonmana.com/2021/04/20210408100052149N.html

  1. Pandas-二进制操作
  2. python入门教程14-01 (python语法入门之python内存泄露)
  3. Web4 - get flag through Python
  4. python-web5
  5. Pandas binary operation
  6. python入门教程13-06 (python语法入门之视图、触发器、事务、存储过程、函数)
  7. python入门教程13-07 (python语法入门之ORM框架SQLAlchemy)
  8. python入门教程13-08 (python语法入门之python索引原理与慢查询优化)
  9. 定投指数到底能不能赚钱?Python 来告诉你答案
  10. Python入门学习之:10分钟1500访问量
  11. Getting started with Python 14-01
  12. 用 Python 画哆啦 A 梦
  13. Python 图表利器 pyecharts
  14. 用 Python 抓取公号文章保存成 HTML
  15. Introduction to Python 13-06 (view, trigger, transaction, stored procedure, function of introduction to Python syntax)
  16. Getting started with Python 13-07 (ORM framework Sqlalchemy for getting started with Python syntax)
  17. Introduction to Python 13-08
  18. Can fixed investment index make money? Python will tell you the answer
  19. Introduction to Python: 1500 visits in 10 minutes
  20. 用 Python 获取股市交易数据
  21. Drawing Doraemon in Python
  22. Python charts
  23. 用 Python 来了解一下《安家》
  24. 用 Python 抓取公号文章保存成 PDF
  25. 用 Python 生成炫酷二维码及解析
  26. Using Python to grab articles with public number and save them as HTML
  27. Getting stock market trading data with Python
  28. Learn about settle down in Python
  29. Using Python to grab articles with public number and save them as PDF
  30. Using Python to generate cool two dimensional code and analysis
  31. 20210225-1 Python错误与异常
  32. 20210225-1 Python errors and exceptions
  33. 使用Python拆分、合并PDF
  34. Using Python to split and merge pdf
  35. 真工程师:20块钱做了张「名片」,可以跑Linux和Python
  36. Implementation of LSB steganography based on MATLAB and python
  37. Real Engineer: 20 yuan to make a "business card", can run Linux and python
  38. python修改微信和支付宝步数
  39. Python changes WeChat and Alipay steps
  40. Python空间分析| 01 利用Python计算全局莫兰指数(Global Moran's I)
  41. Python spatial analysis | 01 using Python to calculate global Moran's index
  42. python入门教程13-05 (python语法入门之数据备份、pymysql模块)
  43. Introduction to Python 13-05 (data backup and pymysql module of introduction to Python syntax)
  44. pandas如何操作Excel?还不会的,看此一篇足矣
  45. How does panda operate excel? Not yet. This is enough
  46. 用python连接数据库模拟用户登录
  47. Using Python to connect database to simulate user login
  48. python入门教程13-04 (语法入门之记录相关操作)
  49. Introduction to Python 13-04
  50. python入门教程13-03 (python语法入门之表相关操作)
  51. Introduction to Python 13-03
  52. python的多线程的网络爬虫,待改进
  53. Python multithreaded web crawler, to be improved
  54. 常见加密算法的Python实现:
  55. Python implementation of common encryption algorithms:
  56. python刷题-核桃的数量
  57. Number of walnuts
  58. Python爬虫知乎文章,采集新闻60秒
  59. Python crawler knows articles and collects news for 60 seconds
  60. Python爬虫知乎文章,采集新闻60秒