Today, when I watch the online video for in-depth learning , Yes CIFAR-10 Data sets . When I run code happily , But I found some mistakes :
# -*- coding: utf-8 -*-
import pickle as p
import numpy as np
import os
def load_CIFAR_batch(filename):
""" load cifar One of the datasets batch """
with open(filename, 'r') as f:
datadict = p.load(f)
X = datadict['data']
Y = datadict['labels']
X = X.reshape(10000, 3, 32, 32).transpose(0, 2, 3, 1).astype("float")
Y = np.array(Y)
return X, Y
def load_CIFAR10(ROOT):
""" load cifar All data """
xs = []
ys = []
for b in range(1, 6):
f = os.path.join(ROOT, 'data_batch_%d' % (b,))
X, Y = load_CIFAR_batch(f)
xs.append(X)
ys.append(Y)
Xtr = np.concatenate(xs)
Ytr = np.concatenate(ys)
del X, Y
Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, 'test_batch'))
return Xtr, Ytr, Xte, Yte
The error code is as follows :
'gbk' codec can't decode byte 0x80 in position 0: illegal multibyte sequence
So I started to search all kinds of questions , Ask the big man , The answers on the Internet are all similar :
But it didn't solve the problem ! It's still wrong !( I've been searching all afternoon , All the answers above )
wow , When I'm desperate , I finally found a novel answer , With a try attitude , I tried :
def load_CIFAR_batch(filename):
""" load cifar One of the datasets batch """
with open(filename, 'rb') as f:
datadict = p.load(f, encoding='latin1')
X = datadict['data']
Y = datadict['labels']
X = X.reshape(10000, 3, 32, 32).transpose(0, 2, 3, 1).astype("float")
Y = np.array(Y)
return X, Y
It's a success , There is no wrong report here ! Rejoicing , I'm curious ,encoding=’latin1’ What the hell is it , I haven't seen it before ? therefore , I searched for , come to know :
Latin1 yes ISO-8859-1 Another name for , In some circumstances, writing Latin-1.ISO-8859-1 Encoding is single-byte encoding , Backwards compatible ASCII, Its coding range is 0x00-0xFF,0x00-0x7F Perfect sum between ASCII Agreement ,0x80-0x9F Between is the control character ,0xA0-0xFF Between is the text symbol .
because ISO-8859-1 The encoding range uses all the space in a single byte , In support of ISO-8859-1 Any other encoded byte stream transmitted and stored in the system will not be discarded . In other words , Treat any other encoded byte stream as ISO-8859-1 There's no problem with coding . This is a very important feature ,MySQL The default database encoding is Latin1 It's using this feature .ASCII The code is a 7 A container of bits ,ISO-8859-1 The code is a 8 A container of bits .
I haven't waited for me to be happy , After operation , Another problem was found :
memory error
What the hell? ? Memory error ! wow , It turns out that it's about data size .
X = X.reshape(10000, 3, 32, 32).transpose(0,2,3,1).astype("float")
This tells us that every batch of data is 10000 * 3 * 32 * 32, Equal to more than 3000 10000 floating points . float Data types are actually related to float64 identical , It means that each number takes up 8 Bytes . This means that each batch occupies at least 240 MB. You load 6 these (5 Training + 1 test ) The total output is close to 1.4 GB The data of .
for b in range(1, 2):
f = os.path.join(ROOT, 'data_batch_%d' % (b,))
X, Y = load_CIFAR_batch(f)
xs.append(X)
ys.append(Y)
So if possible , As shown in the above code, only one batch can be run at a time .
Only this and nothing more , Mistakes are basically fixed , Here is the correct code :
# -*- coding: utf-8 -*-
import pickle as p
import numpy as np
import os
def load_CIFAR_batch(filename):
""" load cifar One of the datasets batch """
with open(filename, 'rb') as f:
datadict = p.load(f, encoding='latin1')
X = datadict['data']
Y = datadict['labels']
X = X.reshape(10000, 3, 32, 32).transpose(0, 2, 3, 1).astype("float")
Y = np.array(Y)
return X, Y
def load_CIFAR10(ROOT):
""" load cifar All data """
xs = []
ys = []
for b in range(1, 2):
f = os.path.join(ROOT, 'data_batch_%d' % (b,))
X, Y = load_CIFAR_batch(f)
xs.append(X) # Will all batch integrated
ys.append(Y)
Xtr = np.concatenate(xs) # Make it a line vector , Final Xtr The size is (50000,32,32,3)
Ytr = np.concatenate(ys)
del X, Y
Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, 'test_batch'))
return Xtr, Ytr, Xte, Yte
import numpy as np
from julyedu.data_utils import load_CIFAR10
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10.0, 8.0)
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
# load CIFAR-10 Data sets
cifar10_dir = 'julyedu/datasets/cifar-10-batches-py'
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
# Look at some samples in the dataset : Each category shows some
print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)
By the way CIFAR-10 Data composition :