Defect Detection on Ion Sources

Introduction

I started this project purely out of curiosity, and it turned out to be quite fruitful, as it is now being used in the manufacturing process. The goal of the project was to eliminate user interference and human error during the inspection process by using a convolutional neural network (CNN). The figure below shows an example of a CNN. Click on the image for further explanation.

Training CNNs from scratch can be both expensive and time-consuming. Fortunately, transfer learning can significantly reduce training costs. Here is how it works: by using pre-trained models, which have been trained on more than a million images, we can fine-tune the model for our specific data. The downside of this approach is that it does not always yield perfect results, and the models tend to be very deep!

Data Label

The first mandatory and tedious step is to label and clean up the data. There’s a common saying in data science and machine learning: ‘Trash in, trash out.’ If you feed the model images of cats but label half of them as horses, the model may end up classifying the cats as elephants. Therefore, correctly labeling the data is crucial. Since this is currently a 1.5-person job, I have decided to label the data simply as pass or fail. To streamline the labeling process, I found and modified a Python UI tool for the task, which is available on GitHub.

Data Augmentation

CNNs are data-hungry, and insufficient data can significantly reduce their performance. At the time of writing, there are only 1,300 training samples, 380 validation samples, and 200 test samples, equally distributed across classes. To increase the dataset size, we can apply data augmentation techniques such as random zoom, height shift, contrast adjustment, and/or brightness modification. Examples are shown below.

Network Architecture

As mentioned previously, transfer learning is used to reduce training time and resource consumption. The CNN architecture is shown in the figure below. The input and augmentation layers are custom-designed to work with the dataset. The base layers are taken from a pretrained model (VGG-16), excluding the input layer. The experimental layers are custom layers added during experimentation to improve model performance. Finally, the output layer consists of a single neuron that outputs a value between 0 and 1. Since there are only two classes, the output is binary—values above or equal to 0.5 will be rounded to 1, and values below 0.5 will be rounded to 0.

Design of Experiments

Multiple experiments were conducted for this study, involving hyperparameter tuning, layer modification, and generalization. All experiments followed these two steps:

Feature Extraction: During this phase, all base layers’ weights are frozen, and only the output layer and experimental layers are trained on the dataset.
Fine-Tuning: In this phase, the top layers of the base model are unfrozen for further training. The decision on which and how many layers to unfreeze is purely experimental and based on trial and error. In some models, regularization techniques are applied to mitigate overfitting.

In neural network training, one of the most critical parameters is the learning rate. It dictates how quickly the network can approach the global minimum of the training loss. If the learning rate is too high, it can cause the loss to oscillate or lead to gradient explosion. Conversely, if it is too low, the model may take an excessive amount of time to converge or struggle to escape local minima in the loss curve.

Solution: The optimal initial learning rate can be determined using a scheduler. Start with a very low learning rate (around ~1E-8) and gradually increase it after each epoch until it reaches approximately ~0.1. Once the loss-epoch curve is generated, identify the point where the loss shows the highest rate of change, and extrapolate the corresponding epoch to estimate the optimal learning rate. Refer to the results section for examples.

Trials
Model 1	No experimental layers added.
Model 2	Experimental layers: two 4096 unit, ReLu activation and fully-connected layers.
Model 3	Experimental layers: two 4096 unit, ReLu activation and fully-connected layers and 3 dropout layers and regularizers.

All models during the feature extraction process are trained until they overfit the data. The best epochs are saved using the checkpoint callback from Keras. This saved state serves as the starting point for the fine-tuning process, where the learning rate is reduced by at least one order of magnitude from the learning rate used in feature extraction.

Results

The train/validation/test split was 70/20/10 for 1851 images. The results are summarized below. Model 1 exhibited overfitting occurring after approximately 400 epochs. During the fine-tuning phase, the validation loss began increasing around 100 epochs but plateaued near 250 epochs, where the learning rate was reduced by a factor of 0.1 or set to 1e-5. Longer training is currently underway to observe the loss trends.

The model was further tuned by unfreezing all the base layers from b6.

So far, Model 1 Fine-Tuning b6 has yielded the best metrics on the new dataset, with an accuracy of 88.71% and an F1-score of 0.8868. Additionally, the metrics for the test set are more insightful compared to previous results. In earlier experiments, the models produced similar results for both feature extraction and fine-tuning phases, likely due to the small size of the test set.

		Accuracy	Precision	Recall	F1-Score
Model 1	Feature Extraction	0.882	0.886	0.882	0.881
	Fine-Tuning	0.871	0.873	0.871	0.871
	Fine-Tung b6	0.887	0.891	0.887	0.887