Recreation, U-net: Convolutional networks for biomedical image segmentation

by Jacob Stachowicz, Max Joel Söderberg and Anton Ivarsson

Figure 1: Example U-net structure

Figure 2: Sample from the HeLa data set.

## Abstract:

The subject of the project was biomedical image segmentation. Namely to reproduce the results of the paper “U-net: Convolutional networks for biomedical image
segmentation” by Olaf Ronnenberger, Philipp Fischer and Thomas Brox. After
the reproduction the goal was to implement another model for segmentation and
compare how the strategies used by Ronneberger et al. (2015) performed on the
other model. Due to time constraints the other model was not implemented. An
experiment was conducted in order to validate the data augmentation strategy used
by Ronneberger et al. (2015). A conclution could be made that the strategy was
sound. Without the strategy, our implemented U-net reached an IoU score of 0.332
and a F1 score of 0.605. With the strategy, our implemented U-net reached an
IoU score of 0.770 and a F1 score of 0.891. An algorithm was implemented to
perform another experiment for the custom loss function used by Ronneberger
et al. (2015). The experiment was not completed due to the chosen deep learning
framework being opinionated. However, studying the computed weights indicate
that the weights could help the model learn better borders between cells.

## 1. Introduction

In this project we will work with biomedical image segmentation for cell images. Segmentation of
biomedical images is useful for a different number of purposes. For instance, Punitha et al. (2018)
used a Feed Forward Neural Network to segment images of benign and malignant breast cancer
(Punitha et al. 2018). This project aims to reproduce the results of the paper “U-net: Convolutional
networks for biomedical image segmentation” by Olaf Ronneberger, Philipp Fischer and Thomas
Brox. The authors of this paper mentions how there is a large consent that successful training of
deep learning networks requires many thousands of annotated training samples. However, in this
paper the authors present a strategy that relies heavily on data augmentation instead of a large sample
size, using an network architecture called “U-net”. The purpose of the data augmentation approach
is to utilize the available annotated training samples more efficiently (Ronneberger et al. 2015). To
reproduce the results we will first implement the network. After the implementation of the network
we will validate that the strategies of Ronneberger et al. (2015), besides from the architecture gave
the improvements claimed. In addition to reproducing the results, this project will be conducted
with the intention to compare the segmentation network with another method of segmenting images.
That is, try training another network architecture for image segmentation with the same surrounding
strategies as Ronneberger et al. (2015).
Our initial assessment of the workload required to reproduce the results achieved by Ronneberger
et al. (2015) where incorrect. This flawed assessment led to us not having enough time left to compare
the results to a network with another architecture. Due to the chosen deep learning framework being
opinionated, we had trouble adding the custom loss function which Ronneberger et al. (2015) used
for instance segmentation. The custom loss function used precomputed weights for each training
sample. Even though we could not experiment with this custom loss function, we did implement an
algorithm for calculating the weights. Studying the computed weights indicate that the weights could
help the model learn better borders between cells that are stuck together. The lack of precomputed
border weights in this project forced us to focus on semantic segmentation, while Ronneberger et al.
(2015) focused on instance segmentation.
We could conclude that the strategy used for data augmentation by Ronneberger et al. (2015) was
sound. Training our U-net without the data augmentation the network achieved a IoU score of 0.332
and a F1 score of 0.605. With the data augmentation strategy we achieved a IoU score of 0.770 and a
F1 score of 0.891. These scores show that segmentation of cell images can be achieved accurately
without large amounts of data using this data augmentation strategy. All scores were calculated on
test data where the model with the lowest validation loss were used

## 2 Related work

As mentioned in the introduction, this project was partly an attempt to reproduce the results by
Ronneberger et al. (2015). Because of this, the main source of reference was the aforementioned
paper. Ronneberger et al. (2015). implemented a convolutional neural network with the architecture
as seen in figure 1.
The authors implemented data augmentation that was mostly based on elastic deformations, as
well as Gaussian noise. The authors achieved well defined borders. These well defined borders
can seemingly be accredited towards the custom loss function implemented. This custom loss
function was a weighted pixel wise binary cross entropy that took advantage of precomputed weights
for each pixel in the data set. The initial weights of the network was drawn from a Gaussian
distribution with standard deviation p
2/N where N denotes the number of incoming nodes of one
neuron(Ronneberger et al. 2015). The authors network was implemented using Caffe with MATLAB,
as well as some parts in C++.

## 3 Data

The data used in this project was the DIC-C2DH-HeLa data set (Cell Tracking Challenge, 2D+Time
Datasets n.d.). This data set consisted of 168 different transmission electron microscopy images
of HeLa cells on a flat glass as seen in figure 2. The images have the resolution of 512x512 pixels.
From the 168 images in the data set, 20 percent was used as test data and the rest was used in training
and validation. We created augmentations of both all the data intended for training and validation.
From each image 12 augmented versions where created, which lead to a 13 times larger training and
validation set.

## 4 Methods
This section explains how the network was built, trained and how the data was preprocessed.

### 4.1 Structure of the network
In this project, the convolutional network presented in Ronneberger et al. (2015) was implemented
using the Keras framework consisting of 23 convolutional layers, each with a ReLu activation function.
Due to memory constraints in the GPU, the batch size was set to 1 to be able to use large images. The
model used a Nesterov SGD optimizer with a momentum of 0.99, which allows previously seen data
to heavily influence the update in the optimization step.
In Ronneberger et al. (2015), the measurement of accuracy was done using IOU (1) which compares the intersection of the cells in relation to the union. In addition to that method, we also used F1 score
(2) to measure the accuracy in relation to precision and recall.

In image segmentation, it is possible using binary cross entropy (3) since it’s a pixel-wise classification
task

Training a large neural network on a small amount of data introduces the risk of overfitting the
network. To prevent this, a dropout was added at each layer with values ranging between 0.1 and 0.3.

### 4.2 Custom loss function
The loss function was defined as can be seen in equation 4. Equation 5 shows the formula for how the
weight matrix was calculated. Ω is the set of pixels in a given image, wc(X) a precomputed weight
map to counteract class imbalance, d1(X) and d2(X) is the distance from the pixel X to the closest
and second closest cell respectively. l is the true label of each pixel.

4.3 Preprocessing of data

Initially the label data consisted of matrices filled with the ground truth of where each cell was
located and which class label they had, (Figure 3a). Before training, these matrices were processed so
that each cell was represented as ones and the background as zero, (Figure 3b). This was necessary
since there was no logic in which class label each cell had and it also enabled the use of binary cross
entropy as a loss function.

Figure 3: (a) Unprocessed label (b) Color processed label.

One important key feature in the u-net paper was the augmentation of the training and validation
images and their corresponding labels. The main advantage this offers is the increase of training data.
The augmentation techniques used on the images in our project were rotation, zooming, changing
width or height and horizontal flip, (Figure 4). Our augmentations allowed us to increase our training data by a 12-fold, from 134 to 1447 training images, before being limited by the 16 GB’s of memory
in our training environment.

Figure 4: (a) A sample picture from the HeLa data set without augmentation (b) Augmentation of the
first image with a rotation of the y-axis

### 4.4 Training

The training of the model was conducted with a limit of 100 epochs. An early stopping strategy was
used where the training would stop if the validation loss did not improve for 5 epochs consequently.
If an early stopping occurred the best weights according the validation loss was restored. The time
required for training was substantial which limited the amount of runs. The training with data
augmentation took an entire night to run on a system with 16 GB of RAM-memory and a Nvidia
GTX 1070 graphics card with 8 GB of dedicated memory.

### 4.5 Experiment strategy

To to validate the data augmentation strategy by Ronneberger et al. (2015) our model was trained
with and without said strategy. Performance was then measured for each of these cases. The strategy
was planned for the custom loss function and the other model. However, due to time constraints the
other model was not implemented and subsequently not tested. The weights that were supposed to be
used in the custom loss function was studied, but not tested in actual training.

## 5 Experiments
In this section the results of our experiments are presented.

### 5.1 Effects of data augmentation

Figure 5 shows graphs over the change of loss through different epochs for the model with and
without augmented images. Figure 6 shows graphs over the change of F1-score through different
epochs for the model with and without data augmentation.

Figure 5: Binary cross entropy loss for each epoch through the training of the model (a) with
augmentation and (b) without augmentation.

Figure 6: F1-score for each epoch through the training of the model (a) with augmentation and (b)
without augmentation.

As shown in figure 6b, for the model without augmented data we see that the training score increases
with some instability while the validation score drops and stays around zero after 5 epochs. This
behavior indicates that overfitting occurs, which is common when there are too few data samples in
the training data set(Russell & Norvig 2010). This results in the model not being able to generalize
well on the unseen images in the validation data set. Increasing the training data using augmentation
reduces the phenomena of overfitting, which is visible in figure 6a. We see in figure 7c that the model
with the added augmented images in the training makes confident predictions that are near the ground
truth (figure 7d) while the predictions by the model without augmentations (figure 7b) are indecisive
and unclear.

Figure 7: (a) Sample form the test data set (b) The prediction made by the model without augmented
training images. (c) The prediction made by the model with augmented training data. (d) Shows the
ground truth for the sample picture. The bar to the right of the images shows the probability scale.

5.2 Custom loss function and precomputed weights

The energy function is computed by a pixel-wise soft-max over the final feature map combined with
the cross entropy loss function. Figure 8c indicates how the energy function penalizes the deviation
of each pixel from the ground truth. As mentioned in section 4.3 the weights used in the custom loss
function consists of two parts. The first part is the class imbalance, which is refereed to as wc (figure
8b). The second part takes into account the distance for each pixel to the cell borders (figure 8c). The
combined weights as seen in figure 8d indicate the use of the custom loss function could improve the
prediction around the cell borders.

Figure 8: (a) Mask image, (b) Class imbalance mask, (c) Border mask (d) Image weights

6 Conclusion

This project shows that an increase of accuracy for a u-net model with a small training set can be
achieved by increasing the amount of training data with augmentations. Even though the experiment
with the custom loss function could not be completed as intended, studying the computed weights
indicate that the weights could enhance the model’s prediction performance around and between cells.
These two experiments partially confirms the findings made in the original paper by Ronneberger
et al. (2015). Our implementation can be found at: https://gits-15.sys.kth.se/aivarss/DD2424Project.

References

Cell Tracking Challenge, 2D+Time Datasets (n.d.), http://celltrackingchallenge.net/ 2d-datasets/. Accessed: 2020-05-01.
Punitha, S., Amuthan, A. & Joseph, K. S. (2018), ‘Benign and malignant breast cancer segmentation using optimized region growing technique’, Future Computing and Informatics Journal 3(2), 348–358.
Ronneberger, O., Fischer, P. & Brox, T. (2015), U-net: Convolutional networks for biomedical image segmentation, in ‘International Conference on Medical image computing and computer-assisted intervention’, Springer, pp. 234–241.
Russell, S. & Norvig, P. (2010), Artificial Intelligence: A Modern Approach, 3 edn, Prentice Hall.