Quick learning of Python -- 6 days of dataset and dataloader

The sky is full of stars_ 2020-11-13 00:39:00
quick learning python days dataset

Pytorch Usually use Dataset and DataLoader These two utility classes are used to build the data pipeline .

Dataset Defines the content of the dataset , It's equivalent to a list like data structure , Having a definite length , The ability to index elements in a dataset .

and DataLoader It defines press batch How to load the dataset , It is an implementation of __iter__ The iteratable object of the method , Output one... Per iteration batch The data of .

DataLoader Able to control batch Size ,batch The sampling method of the elements in , And will be batch The method of organizing the results into the required input form of the model , And it can use multiple processes to read data .

In most cases , Users just need to implement Dataset Of __len__ Methods and __getitem__ Method , You can easily build your own dataset , And use the default data pipeline to load .

One ,Dataset and DataLoader summary

1, Get one batch Data steps

Let's think about getting one from a dataset batch What steps do you need for your data .

( It is assumed that the features and labels of the dataset are expressed as tensors respectively X and Y, Data sets can be represented as (X,Y), Assume batch The size is m)

1, First we need to determine the length of the dataset n.

The results are similar :n = 1000.

2, Then we go from 0 To n-1 From the range of m Number (batch size ).

Assume m=4, The result is a list , similar :indices = [1,4,8,9]

3, And then we take this from the dataset m The number of elements corresponding to the subscript .

The result is a list of tuples , similar :samples = [(X[1],Y[1]),(X[4],Y[4]),(X[8],Y[8]),(X[9],Y[9])]

4, Finally, we organize the results into two tensors as output .

The result is two tensors , similar batch = (features,labels),

among  features = torch.stack([X[1],X[4],X[8],X[9]])

labels = torch.stack([Y[1],Y[4],Y[8],Y[9]])

2,Dataset and DataLoader The division of functions of

Above mentioned 1 The length of the dataset is determined by Dataset Of __len__  Method .

The first 2 A step from 0 To n-1 From the range of m The method of counting is by DataLoader Of  sampler and  batch_sampler Parameter specified .

sampler Parameter specifies the single element sampling method , Generally no user settings are required , The program defaults to DataLoader Parameters of shuffle=True Random sampling is used ,shuffle=False Sequential sampling is used .

batch_sampler Parameters organize multiple sampled elements into a list , Generally no user settings are required , The default method is DataLoader Parameters of drop_last=True The last length of the data set cannot be discarded batch The number of batches divisible by size , stay drop_last=False Keep the last batch when .

The first 3 The core logic of this step is to get the elements in the dataset according to the subscript By Dataset Of  __getitem__ Method .

The first 4 The logic of the steps is DataLoader Parameters of collate_fn Appoint . In general, there is no need for user settings .

3,Dataset and DataLoader The main interface of

Here are Dataset and DataLoader The core interface logic pseudo code of , Not completely consistent with the source code .

import torch
class Dataset(object):
def __init__(self):
def __len__(self):
raise NotImplementedError
def __getitem__(self,index):
raise NotImplementedError
class DataLoader(object):
def __init__(self,dataset,batch_size,collate_fn,shuffle = True,drop_last = False):
self.dataset = dataset
self.sampler =torch.utils.data.RandomSampler if shuffle else \
self.batch_sampler = torch.utils.data.BatchSampler
self.sample_iter = self.batch_sampler(
batch_size = batch_size,drop_last = drop_last)
def __next__(self):
indices = next(self.sample_iter)
batch = self.collate_fn([self.dataset[i] for i in indices])
return batch

Two , Use Dataset Create a dataset

Dataset The common ways to create data sets are :

  • Use torch.utils.data.TensorDataset according to Tensor Create a dataset (numpy Of array,Pandas Of DataFrame I need to convert to Tensor).

  • Use torchvision.datasets.ImageFolder Create an image dataset based on the picture catalog .

  • Inherit torch.utils.data.Dataset Create a custom dataset .

Besides , You can also use

  • torch.utils.data.random_split Divide a data set into multiple copies , It is often used to segment training sets , Validation set and test set .

  • call Dataset The addition operator of (+) Combine multiple data sets into one data set .

1, according to Tensor Create a dataset

import numpy as np
import torch
from torch.utils.data import TensorDataset,Dataset,DataLoader,random_split
# according to Tensor Create a dataset
from sklearn import datasets
iris = datasets.load_iris()
ds_iris = TensorDataset(torch.tensor(iris.data),torch.tensor(iris.target))
# It is divided into training set and prediction set
n_train = int(len(ds_iris)*0.8)
n_valid = len(ds_iris) - n_train
ds_train,ds_valid = random_split(ds_iris,[n_train,n_valid])
# Use DataLoader Load data set
dl_train,dl_valid = DataLoader(ds_train,batch_size = 8),DataLoader(ds_valid,batch_size = 8)
for features,labels in dl_train:

tensor([[5.5000, 2.3000, 4.0000, 1.3000],
[6.0000, 2.2000, 5.0000, 1.5000],
[5.1000, 3.5000, 1.4000, 0.2000],
[5.6000, 2.8000, 4.9000, 2.0000],
[6.4000, 3.2000, 4.5000, 1.5000],
[6.9000, 3.1000, 5.4000, 2.1000],
[5.1000, 3.7000, 1.5000, 0.4000],
[4.7000, 3.2000, 1.6000, 0.2000]], dtype=torch.float64) tensor([1, 2, 0, 2, 1, 2, 0, 0])
# Demonstrate the addition operator (`+`) The combined effect of
ds_data = ds_train + ds_valid
print('len(ds_train) = ',len(ds_train))
print('len(ds_valid) = ',len(ds_valid))
print('len(ds_train+ds_valid) = ',len(ds_data))
len(ds_train) = 120
len(ds_valid) = 30
len(ds_train+ds_valid) = 150
<class 'torch.utils.data.dataset.ConcatDataset'>

2, Create an image dataset based on the picture catalog

import numpy as np
import torch
from torch.utils.data import DataLoader
from torchvision import transforms,datasets
# Define image enhancement operations
transform_train = transforms.Compose([
transforms.RandomHorizontalFlip(), # Random horizontal flip
transforms.RandomVerticalFlip(), # Random vertical flip
transforms.RandomRotation(45), # Random in 45 Degree angle inside rotation
transforms.ToTensor() # Convert to tensor
transform_valid = transforms.Compose([
# Create a dataset from the image catalog
ds_train = datasets.ImageFolder("/home/kesci/input/data6936/data/cifar2/train/",
transform = transform_train,target_transform= lambda t:torch.tensor([t]).float())
ds_valid = datasets.ImageFolder("/home/kesci/input/data6936/data/cifar2/test/",
transform = transform_train,target_transform= lambda t:torch.tensor([t]).float())


{'0_airplane': 0, '1_automobile': 1}
# Use DataLoader Load data set
dl_train = DataLoader(ds_train,batch_size = 50,shuffle = True,num_workers=3)
dl_valid = DataLoader(ds_valid,batch_size = 50,shuffle = True,num_workers=3)
for features,labels in dl_train:
torch.Size([50, 3, 32, 32])
torch.Size([50, 1])

3, Create a custom dataset

Now through inheritance Dataset Class creation imdb Custom dataset for text categorization tasks .

The general idea is as follows : First , Build a dictionary for text segmentation of training set . Then the training set text and test set text data are converted into token Word code .

Then the training set data and test set data converted into word code are divided into multiple files according to the sample , A file represents a sample .

Last , We can get the sample content of the corresponding serial number according to the file name list , To build Dataset Data sets .

import numpy as np
import pandas as pd
from collections import OrderedDict
import re,string
MAX_WORDS = 10000 # Consider only the most frequent 10000 Word
MAX_LEN = 200 # Each sample is retained 200 The length of a word
train_data_path = '/home/kesci/input/data6936/data/imdb/train.tsv'
test_data_path = '/home/kesci/input/data6936/data/imdb/test.tsv'
train_token_path = '/home/kesci/input/data6936/data/imdb/train_token.tsv'
test_token_path = '/home/kesci/input/data6936/data/imdb/test_token.tsv'
train_samples_path = '/home/kesci/input/data6936/data/imdb/train_samples/'
test_samples_path = '/home/kesci/input/data6936/data/imdb/test_samples/'

First we build dictionaries , And keep the most frequent MAX_WORDS Word .

## Building a dictionary
word_count_dict = {}
# Clean the text
def clean_text(text):
lowercase = text.lower().replace("\n"," ")
stripped_html = re.sub('<br />', ' ',lowercase)
cleaned_punctuation = re.sub('[%s]'%re.escape(string.punctuation),'',stripped_html)
return cleaned_punctuation
with open(train_data_path,"r",encoding = 'utf-8') as f:
for line in f:
label,text = line.split("\t")
cleaned_text = clean_text(text)
for word in cleaned_text.split(" "):
word_count_dict[word] = word_count_dict.get(word,0)+1
df_word_dict = pd.DataFrame(pd.Series(word_count_dict,name = "count"))
df_word_dict = df_word_dict.sort_values(by = "count",ascending =False)
df_word_dict = df_word_dict[0:MAX_WORDS-2] #
df_word_dict["word_id"] = range(2,MAX_WORDS) # Number 0 and 1 Leave the unknown words separately <unkown> And fill <padding>
word_id_dict = df_word_dict["word_id"].to_dict()

  Then we use the constructed Dictionary , Convert text to token Serial number .

# transformation token
# Fill in the text
def pad(data_list,pad_length):
padded_list = data_list.copy()
if len(data_list)> pad_length:
padded_list = data_list[-pad_length:]
if len(data_list)< pad_length:
padded_list = [1]*(pad_length-len(data_list))+data_list
return padded_list
def text_to_token(text_file,token_file):
with open(text_file,"r",encoding = 'utf-8') as fin,\
open(token_file,"w",encoding = 'utf-8') as fout:
for line in fin:
label,text = line.split("\t")
cleaned_text = clean_text(text)
word_token_list = [word_id_dict.get(word, 0) for word in cleaned_text.split(" ")]
pad_list = pad(word_token_list,MAX_LEN)
out_line = label+"\t"+" ".join([str(x) for x in pad_list])

And then token The text is segmented according to the sample , Each file holds one sample of data

# Split the sample
import os
if not os.path.exists(train_samples_path):
if not os.path.exists(test_samples_path):
def split_samples(token_path,samples_dir):
with open(token_path,"r",encoding = 'utf-8') as fin:
i = 0
for line in fin:
with open(samples_dir+"%d.txt"%i,"w",encoding = "utf-8") as fout:
i = i+1
import os
class imdbDataset(Dataset):
def __init__(self,samples_dir):
self.samples_dir = samples_dir
self.samples_paths = os.listdir(samples_dir)
def __len__(self):
return len(self.samples_paths)
def __getitem__(self,index):
path = self.samples_dir + self.samples_paths[index]
with open(path,"r",encoding = "utf-8") as f:
line = f.readline()
label,tokens = line.split("\t")
label = torch.tensor([float(label)],dtype = torch.float)
feature = torch.tensor([int(x) for x in tokens.split(" ")],dtype = torch.long)
return (feature,label)
ds_train = imdbDataset(train_samples_path)
ds_test = imdbDataset(test_samples_path)
dl_train = DataLoader(ds_train,batch_size = BATCH_SIZE,shuffle = True,num_workers=4)
dl_test = DataLoader(ds_test,batch_size = BATCH_SIZE,num_workers=4)
for features,labels in dl_train:

Finally, build a model to test whether the dataset pipeline is available .

import torch
from torch import nn
import importlib
from torchkeras import Model,summary
class Net(Model):
def __init__(self):
super(Net, self).__init__()
# Set up padding_idx Parameters will be filled in the training process token Always assign to 0 vector
self.embedding = nn.Embedding(num_embeddings = MAX_WORDS,embedding_dim = 3,padding_idx = 1)
self.conv = nn.Sequential()
self.conv.add_module("conv_1",nn.Conv1d(in_channels = 3,out_channels = 16,kernel_size = 5))
self.conv.add_module("pool_1",nn.MaxPool1d(kernel_size = 2))
self.conv.add_module("conv_2",nn.Conv1d(in_channels = 16,out_channels = 128,kernel_size = 2))
self.conv.add_module("pool_2",nn.MaxPool1d(kernel_size = 2))
self.dense = nn.Sequential()
def forward(self,x):
x = self.embedding(x).transpose(1,2)
x = self.conv(x)
y = self.dense(x)
return y
model = Net()
model.summary(input_shape = (200,),input_dtype = torch.LongTensor)



# Compile model
def accuracy(y_pred,y_true):
y_pred = torch.where(y_pred>0.5,torch.ones_like(y_pred,dtype = torch.float32),
torch.zeros_like(y_pred,dtype = torch.float32))
acc = torch.mean(1-torch.abs(y_true-y_pred))
return acc
model.compile(loss_func = nn.BCELoss(),optimizer= torch.optim.Adagrad(model.parameters(),lr = 0.02),
# Training models
dfhistory = model.fit(10,dl_train,dl_val=dl_test,log_step_freq= 200)

3、 ... and , Use DataLoader Load data set

DataLoader Able to control batch Size ,batch The sampling method of the elements in , And will be batch The method of organizing the results into the required input form of the model , And it can use multiple processes to read data .

DataLoader The function signature of is as follows .


In general , We will only configure dataset, batch_size, shuffle, num_workers, drop_last These five parameters , Other parameters use default values .

DataLoader In addition to being able to load the torch.utils.data.Dataset Outside , You can also load another dataset torch.utils.data.IterableDataset.

and Dataset Data sets are equivalent to a list with different structures ,IterableDataset It is equivalent to an iterator structure . It's more complicated , Generally, it is less used .

  • dataset : Data sets
  • batch_size: Batch size
  • shuffle: Is it out of order
  • sampler: Sample sampling function , Generally no need to set .
  • batch_sampler: Batch sampling function , Generally no need to set .
  • num_workers: Using multiple processes to read data , Number of processes set .
  • collate_fn: A function to organize batch data .
  • pin_memory: Whether it is set to lock memory or not . The default is False, Lock industry memory does not use virtual memory ( Hard disk ), Copy from lock memory to GPU It's going to be faster .
  • drop_last: Whether to discard the last sample is insufficient batch_size Batch data .
  • timeout: The maximum waiting time to load a data batch , Generally no need to set .
  • worker_init_fn: Every worker in dataset Initialization function for , Commonly used in IterableDataset. Generally not used .


ds = TensorDataset(torch.arange(1,50))
dl = DataLoader(ds,
batch_size = 10,
shuffle= True,
drop_last = True)
# Iterative data
for batch, in dl:
tensor([47, 34, 28, 37, 5, 27, 7, 43, 36, 31])
tensor([38, 26, 41, 20, 10, 14, 6, 39, 42, 15])
tensor([ 3, 1, 49, 4, 46, 24, 22, 13, 44, 35])
tensor([16, 21, 17, 29, 33, 2, 48, 23, 11, 8])
本文为[The sky is full of stars_]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database