The data set used in this article is about the skills and value of a football player csv surface , Contains 60 Multiple fields . Dataset download link : Data sets
1、DataFrame.info()
This function can output some specific information read into the table . This is very helpful for accelerating data preprocessing .
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('dataset/soccer/train.csv')
print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10441 entries, 0 to 10440
Data columns (total 65 columns):
id 10441 non-null int64
club 10441 non-null int64
league 10441 non-null int64
birth_date 10441 non-null object
height_cm 10441 non-null int64
weight_kg 10441 non-null int64
nationality 10441 non-null int64
potential 10441 non-null int64
...
dtypes: float64(12), int64(50), object(3)
memory usage: 5.2+ MB
None
2、DataFrame.query()
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('dataset/soccer/train.csv')
print(data.query('lw>cf')) # The two methods are equivalent
print(data[data.lw > data.cf]) # The two methods are equivalent
3、DataFrame.value_counts()
This function can count the frequency of different values in a column .
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('dataset/soccer/train.csv')
print(data.work_rate_att.value_counts())
Medium 7155
High 2762
Low 524
Name: work_rate_att, dtype: int64
4、DataFrame.sort_values()
According to the value of a column to sort, output .
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('dataset/soccer/train.csv')
print(data.sort_values(['sho']).head(5))
5、DataFrame.groupby()
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('dataset/soccer/train.csv')
potential_mean = data['potential'].groupby(data['nationality']).mean().head(5)
print(potential_mean)
nationality
1 74.945338
2 72.914286
3 67.892857
4 69.000000
5 70.024242
Name: potential, dtype: float64
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('dataset/soccer/train.csv')
potential_mean = data['potential'].head(20).groupby([data['nationality'], data['club']]).mean()
print(potential_mean)
nationality club
1 148 76
461 72
5 83 64
29 593 68
43 213 67
51 258 62
52 112 68
54 604 81
63 415 70
64 359 74
78 293 73
90 221 70
96 80 72
101 458 67
111 365 64
379 83
584 65
138 9 72
155 543 72
163 188 71
Name: potential, dtype: int64
It is worth noting that , After the grouping function Use one size() Function can return a result with a group size .
potential_mean = data['potential'].head(200).groupby([data['nationality'], data['club']]).size()
nationality club
1 148 1
43 213 1
51 258 1
52 112 1
54 604 1
78 293 1
96 80 1
101 458 1
155 543 1
163 188 1
Name: potential, dtype: int64
6、DataFrame.agg()
This function is usually used in groupby Function .
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('dataset/soccer/train.csv')
potential_mean = data['potential'].head(10).groupby(data['nationality']).agg(['max', 'min'])
print(potential_mean)
max min
nationality
1 76 76
43 67 67
51 62 62
52 68 68
54 81 81
78 73 73
96 72 72
101 67 67
155 72 72
163 71 71
7、DataFrame.apply()
Apply a function to a column or row , Can greatly speed up processing .
import pandas as pd
import matplotlib.pyplot as plt
# Returns the year in the player's date of birth
def birth_date_deal(birth_date):
year = birth_date.split('/')[2]
return year
data = pd.read_csv('dataset/soccer/train.csv')
result = data['birth_date'].apply(birth_date_deal).head()
print(result)
0 96
1 84
2 99
3 88
4 80
Name: birth_date, dtype: object
Of course if you use lambda Function words , Code will be more concise :
data = pd.read_csv('dataset/soccer/train.csv')
result = data['birth_date'].apply(lambda x: x.split('/')[2]).head()
print(result)