Image Recognition With Deep Learning: Convolutional Neural Network

In a previous article, I used a neural network with two hidden layers to train an image recognition model on CIFAR-10 dataset. This model is able classify color images in 10 categories, but with a significantly low accuracy. In fact, the training validation accuracy was at less than 48%, as shown in the following graphs:

Why does the training accuracy matter?

Training accuracy is an initial indicator of how well a model is learning from the training data. It is the ratio of correctly classified data points to the total number of data points in the training set, usually expressed as a percentage point.

A high training accuracy suggests that the model is capturing the patterns and relationships in the data effectively. However, a very high training accuracy maybe due to overfitting, which is one of the common machine learning issues. In the case of overfitting, the model memorizes the training data rather than learning the underlying patterns. This results in very poor predictions of new and unseen data.

That's not the case in our model. In fact, it is suffering from another machine issue known as underfitting. In this case, the model is struggling to achieve high training accuracy because it is too simple to capture the underlying patterns in the data, therefore it is performing poorly on both the training and test datasets.

A convolutional neural network will be used to increase model complexity and to improve accuracy. Additionally, a convolutional neural network encodes image-specific features to deal with underfitting issue that we are currently experiencing.

What is Convolutional Neural Network?

A Convolutional neural network is composed of neurons that self-optimize through learning. In many respects, it is similar to typical artificial neural network but with one distinctive feature: it encodes image-specific features, making it more suitable and more accurate for image recognition. Some of these image-specific features are analogous to tools we can find in typical software, like blurring, sharpening, or changing the brightness of an image.

The convolutional neural network model that we will create has three convolution and 3 pooling layers, all of which are fully connected.

The convolution layer ( Conv2D) will learn feature representation from the input, while other layers will compute various features of the image to generate a feature map that severs as the input to the subsequent layer.

The complete model architecture is displayed below:

model = Sequential()
model.add(Conv2D(75, (3,3), strides=1, padding="same", activation="relu",
                input_shape=(32,32,3)))

model.add(BatchNormalization())
model.add(MaxPool2D((2,2), strides=2, padding="same"))
model.add(Conv2D(50, (3,3), strides=1, padding="same", activation="relu"))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(MaxPool2D((2,2), strides=2, padding="same"))
model.add(Conv2D(25, (3,3), strides=1, padding="same", activation="relu"))
model.add(BatchNormalization())
model.add(MaxPool2D((2,2), strides=2, padding="same"))
model.add(Flatten())
model.add(Dense(units=512, activation="relu"))
model.add(Dropout(0.3))
model.add(Dense(units=num_classes, activation="softmax"))

Notice that in addition to Conv2D, other functions such as BatchNormalization, MaxPool2D, Dropout, Flatten and Dense are also used in the model architecture. Each has a specific purpose as described below:

BatchNormalization - It normalizes the input data, aiming for a mean close to zero and a standard deviation near one. This process stabilizes training, enabling the use of higher learning rates and promoting faster convergence. Additionally, it serves as a form of regularization, therefore, reducing the risk of overfitting.
MaxPool2D - It is a form of pooling operation that helps reduce the spatial dimensions (width and height) of the input, effectively reducing the computation complexity and capturing the most salient features of the input data.
Dropout - It is a regularization technique used to prevent overfitting by randomly deactivating a fraction of neurons during each training iteration (epochs). In our model, approximately 20% of the neurons will be deactivated during each training iteration. Additionally, the Dropout function introduces variability and uncertainty into the learning process, thereby preventing the model to rely too heavily on any particular neuron, making the learning more robust and generalized to unseen data.
Flatten - It reshapes the input data from a multidimensional tensor into a one-dimensional vector. This is necessary to transition from convolutional layers to fully connected layers, which require one-dimensional input. Our data is multi-dimensional, with spatial features such as width, height, and channels for colored images. The Flatten layer reduces these spatial dimensions into a single dimension while preserving the data's order.
Dense - It is a fully connected layer, as each neuron in this layer is connected to every neuron in the preceding layer. We are using Dense layer in combination with convolutional layers for a multi-class classification task required for our image recognition model.

After defining the model architecture, I compiled the model with categorical cross-entropy and ran it for 20 iterations (epochs).

model.compile(loss="categorical_crossentropy", metrics=["accuracy"])
history_cnn = model.fit(x_train, y_train, epochs=20, 
  verbose=1, validation_data=(x_test, y_test))

This significantly improved the training accuracy.

However, theres is a significant discrepancy between training and validation accuracies, as seen in the graph to the right. This could be attributed several factors including:

Overfitting.
Suboptimal hyperparameter tuning, such as learning rate, batch size, or architecture choice.
Data leakage between the training and validation dataset, which inadvertently influences the training process
An overly complex model might capture noise in the training data, leading to overfitting.

Diagnosing the issue requires more experimentation and fine tuning the model while monitoring both training and validation accuracy in addition to loss. In the right graph above, we can see that the validation loss increases at certain epochs when the training loss continues to decrease, indicating a potential overfitting problem.

To combat the overfitting issue, I will use the following image augmentation to enhance the dataset and use early stopping to reduce the gap, thereby improving the overall model performance on the validation dataset.

datagen = ImageDataGenerator(
      rotation_range=20,
      width_shift_range=0.2,
      height_shift_range=0.2,
      zoom_range=0.2,
      horizontal_flip=True,
  )

# Generate augmented images during training
train_generator = datagen.flow(x_train, y_train, batch_size=32)

early_stopping = EarlyStopping(monitor="val_loss", patience=10, 
restore_best_weights=True)

batch_size = 32
history_cnn_img_aug = model.fit(
    train_generator,
    steps_per_epoch=len(x_train) 
    epochs= 50, 
    validation_data=(x_test, y_test),
    callbacks=[early_stopping],
    verbose=1
)

The model trained for 18 epochs before early stopping kicked in. This training session exhibited less variability in training and validation accuracies; however, it resulted in an overall lower accuracy, as shown in the graphs below:

There are several other things that can be done to improve model performance. It requires training and experimentation to add and fine-tune some of the following parameters:

Increase Training Data and Diverse Examples: Incorporating more training data and a wider range of examples can enhance model generalization, especially considering that CIFAR-10 is relatively small.
Consider Different Model Architectures: Exploring alternative model architectures, such as deeper networks or utilizing pre-trained models like ResNet, may yield better results.
Implement a Learning Rate Scheduler: Employing a learning rate scheduler that gradually reduces the learning rate during training can assist the model in converging more effectively.

There may also be additional methods to enhance model accuracy and convergence. I would greatly appreciate hearing your suggestions on the next steps for this project.

Source Code On GitHub

Image Recognition With Deep Learning: Convolutional Neural Network

Why does the training accuracy matter?

What is Convolutional Neural Network?

Read next

Image Recognition With Deep Learning: Neural Networks

Image Recognition With Deep Learning: Introduction

Comments ()

Why does the training accuracy matter?

What is Convolutional Neural Network?

Read next

Comments ( )

Comments ()