Summary - 95.28% on the test set
My best network has the following architecture:
- 8 layers of 3x3 convolutions with zero-padding with 20, 20, 50, 50, 50, 50, 50, 50 feature maps
- 4 max pooling layers with subsampling, one after every second convolutional layer
- a fully-connected layer of 500 units followed by a softmax
- mean-only batch normalization before rectifiers in convolutional layers
I trained as follows:
- all intensities are normalized to be in
- a random crop of 128x128 was taken from an image prescaled to have a minimum dimension of 150
- all layers were initialized using the method proposed by Glorot et al.
- batch size 100, RMSProp with learning rate 0.0003 and decay rate 0.999
I evaluated as follows:
- networks predictions were averaged over 5 windows of 128x128, 4 corner ones and one in the center of the imag
Main take-home messages:
- Data augmentation is extremely important.
- RMSProp is a very robust training method that allows to try different architectures.
- Similarly the initialization proposed by Glorot et al. worked very robustly for a lot of architectures.
- Depth is great, but a lot of regularization (or batch norm) might be required to benefit from it. On the other hand, shallow networks seem to have a bound on the performance that they can not possibly surpass.
- Batch normalization brings an overwhelming improvement to both the speed of the training and the final performance.
I think I could improve my results by incorporating more rotation, translation, scale and reflection invariance in the network architecture and/or using these transformationsto obtain ever more training examples. It would be especially interesting to apply add a spatial transformer module for the network.
P.S.
The network with full batch normalization converged much slower, but eventually performed better: 96.4% on the validation set. I did not measure the test performance.