端到端深度学习项目：第1部分

构建训练数据集目录

在后面的代码中，你会注意到，在训练期间，我们不会将整个数据集加载到内存中。相反，我们将利用Keras的．flow＿from＿directory（）函数，该函数的作用是允许批处理。但是，此函数希望数据按如下方式组织到目录中：

为了按照上述格式组织图像文件，我们将使用以下简短片段：

＃ Building the dataset properly －

splits ＝［（trainX， trainY），（testX， testY），（valX， valY）］

dirnames ＝［＇training＇，＇evaluation＇，＇validation＇］

for i，（data，label） in enumerate（splits）：

outside＿dir＝dirnames［i］

for j in tqdm（range（0， len（label））， desc＝＂Iterating over images in sub folder＂）：

dir ＝ label［j］

＃ construct the path to the sub－directory

dirPath ＝ os．path．join（config．BASE＿PATH， outside＿dir， dir）

＃ if the output directory does not exist， create it

if not os．path．exists（dirPath）：

os．makedirs（dirPath）

＃ copy the img to this new directory

src＿img ＝ os．path．join（config．ORIG＿INPUT＿DATASET， data［j］）

shutil．copy（src＿img， dirPath）

当代码段运行时，你应该能够使用TQM模块查看进度，一旦完成，你将发现创建了三个子目录－dataset／training， dataset／evaluation，和dataset／validation，在每个目录中，将有两个子目录，分别用于现代房子和旧房子。

作为一个健全的检查，让我们看看我们在每个子目录中有多少图像。

trainPath ＝ os．path．join（BASE＿PATH， TRAIN）

valPath ＝ os．path．join（BASE＿PATH， VAL）

testPath ＝ os．path．join（BASE＿PATH， TEST）

totalTrain ＝ len（list（paths．list＿images（trainPath）））

totalVal ＝ len（list（paths．list＿images（valPath）））

totalTest ＝ len（list（paths．list＿images（testPath）））

print（totalTrain， totalTest， totalVal）

＊＊＊＊＊＊＊＊＊＊ OUTPUT ＊＊＊＊＊＊＊

344 46 61

注意：如果你的自定义数据位于下面描述的结构中，那么有一个名为split＿folders的python包，可用于获取图1中定义的目录结构中的数据。

dataset／

class1／

img1．jpg

img2．jpg

．．．

class2／

img3．jpg

．．．

图像预处理

由于我们处理有限的样本大小，可以使用数据增强，例如旋转，缩放图像等。

数据增强可以增加可用的训练数据量，它实际上做的是获取一个训练样本，并对其应用一个随机转换［来源］。

Keras允许使用ImageDataGenerator对亮度、旋转、缩放、剪切等进行随机增强，最好的是，所有这些都是在模型拟合期间动态完成的，也就是说，你不需要提前计算它们。

训练数据增强：

trainAug ＝ ImageDataGenerator（

rotation＿range＝90，

zoom＿range＝［0．5， 1．0］，

width＿shift＿range＝0．3，

height＿shift＿range＝0．25，

shear＿range＝0．15，

horizontal＿flip＝True，

fill＿mode＝＂nearest＂，

brightness＿range＝［0．2， 1．0］
）

大多数参数，如width＿shift、height＿shift、zoom＿range和rotation＿range，都可以直接按字面意思理解（如果不是，请查看官方Keras文档）。

一个重要的注意事项是，当你运行时，使用缩放或旋转时，一些空白区域／像素可能是在图像中创建的。

验证数据增强：

valAug ＝ ImageDataGenerator（）

你将看到，验证数据的数据增强对象时没有提供任何参数。这意味着我们将使用所有这些默认值，为0。即，我们不应用任何增强。

测试数据增强：

testAug ＝ ImageDataGenerator（）

同验证数据。

创建数据增强器

数据增强器将继续为训练期间的模型提供增强的图像。要做到这一点，我们可以使用flow＿from＿directory（）函数。

＃ Create training batches whilst creating augmented images on the fly

trainGen ＝ trainAug．flow＿from＿directory（

directory＝trainPath，

target＿size＝（224，224），

save＿to＿dir＝＇dataset／augmented／train＇，

save＿prefix＝＇train＇，

shuffle＝True

）

＃ Create val batches

valGen ＝ valAug．flow＿from＿directory（

directory＝valPath，

target＿size＝（224，224），

shuffle＝True

）

一些重要的事情需要考虑：

在每个案例中，目录都设置为训练（或验证）图像的路径。

将目标大小指定为（224x224x224），确保所有图像都将调整到这个大小。

我们还将设置save＿to＿dir作为通往目录的路径，在那里我们将保存增强的图像。这提供了一个很好的完整性检查，以查看图像是否按照它们应该的方式进行了随机变换。

最后，shuffle被设置为True，因为我们希望样本在批处理生成器中被打乱，这样当model．fit（）请求批处理时，就会给出随机样本。这样做将确保不同时代之间的批次看起来不一样，并最终使模型更加稳健。

＃ Create test batches

testGen ＝ testAug．flow＿from＿directory（

directory＝testPath，

target＿size＝（224，224），

shuffle＝False

）

除了为testGen设置正确的目录路径外，还有一件主要的事情需要考虑：

Shuffle必须设置为false。

为什么，你问？

因为，现在我们不希望样品在测试批量生成器中被打乱。只有当shuffle被设置为False时，批量才会按照提供的文件名的顺序创建。这需要在模型评估期间匹配文件名（即真实的标签，使用testGen．classes可访问）和预测的标签。

余下全文 2/4

端到端深度学习项目：第1部分

相关推荐