[Pandas] A primer on Pandas processing csv file datasets (neural network/machine learning algorithm data preprocessing)

little girl 2022-08-06 06:33:04 阅读数:436



The data collected with a certain boss iscsv格式的,Haven't dealt with it beforecsv格式的数据.When I used it to write neural network training, I stepped on a lot of pits,这里记录一下,It is also convenient for later people to learn.


处理csvThere should be quite a few packages of files,这里就做一个pandas的教程了(其他的没用过hhhh).Here I take one of my data as an example to demonstrate some common processing methods.


  1. 语句:
    origin_data = pd.read_csv("origin_data.csv", na_values=" NaN")
  2. csvNull values ​​in the file(NaN)是什么? 这里是一个大坑.I recommend everyone to read itcsvWhen I use the following parameters,Set missing values ​​uniformly to "NaN".In this way, if you need to manually filter out missing values ​​later, you can index to the position.之前试过,如果不设置这个参数,缺失值不是False、0、"NaN"中的任何一个.
  3. 结果:

dataframeIndex a column

pandas读进来的csvThe data will be encapsulated into a calldataframe的格式,This format can be converted to numpy数组.Let's see how it works firstdataframe.

  1. 语句: 使用data.nameto index a column by label.
  2. 结果:


  1. 语句:delKeyword tagging removes a column
    del origin_data["Weight change"]
  2. 结果: 可以看到"Weight change"A column has been deleted


对于缺失值,In general, interpolation can be used to complete or directly discard the data.这里以删除NaNThe row where the value is located is an example to demonstrate.

  1. 语句:.dropna()方法,Delete by defaultNaN值的行.可以设置.dropna(axis=1)删除有NaN值的列.Other usages can be consulted by yourself.This usage is the most common.
    origin_data = origin_data.dropna()
  2. 结果: You can see that there are fewer lines,没有NaN值了.


After doing some processing on the data,The index of the data is likely to be messed up directly.比如这里:We deleted some lines,So the index is discontinuous.At this time, if we traverse the data according to the index, an error will be reported.Therefore, it is generally necessary to reset the index after the data is processed.

  1. 语句: 这里重点说一下drop参数.drop参数为TrueIndicates that it is not necessary to drop the index column directly,Then reset the order.drop参数为FalseIndicates to reset the index,and keep the index column.
    origin_data = origin_data.reset_index(drop=True)
  2. 结果:

Modify the value conditionally

We are doing data preprocessing,Need to convert some non-numeric values ​​to numbers.比如性别、省市等.Here is an example of gender,我希望把M/F转化为0/1,for the neural network to process.

  1. 语句:.loc[row, flag]Get the data that needs to be indexed,The value is then modified by conditional judgment
    for i in range(len(origin_data)):
    origin_data.loc[i, 'Sex'] = 1 if origin_data.loc[i, 'Sex'] == "F" else 0
  2. 结果: Here I have changed the data of two columns,结果如图所示
版权声明:本文为[little girl]所创,转载请带上原文链接,感谢。 https://pythonmana.com/2022/218/202208060519291274.html