Object Localization using Global Average Pooling Layers
In mid-2016, researchers at MIT demonstrated that CNNs with GAP layers (a.k.a. GAP-CNNs) that have been trained for a classification task can also be used for object localization. That is, a GAP-CNN not only tells us what object is contained in the image — it also tells us where the object is in the image, and through no additional work on our part! The localization is expressed as a heat map (referred to as a class activation map), where the color-coding scheme identifies regions that are relatively important for the GAP-CNN to perform the object identification task
The main idea is that each of the activation maps in the final layer preceding the GAP layer acts as a detector for a different pattern in the image, localized in space. To get the class activation map corresponding to an image, we need only to transform these detected patterns to detected objects.
This transformation is done by noticing each node in the GAP layer corresponds to a different activation map, and that the weights connecting the GAP layer to the final dense layer encode each activation map’s contribution to the predicted object class. To obtain the class activation map, we sum the contributions of each of the detected patterns in the activation maps, where detected patterns that are more important to the predicted object class are given more weight.
How the Code Operates:
Let’s examine the ResNet-50 architecture by executing the following line of code in the terminal: python -c ‘from keras. applications. resnet50 import ResNet50; ResNet50(). summary ()
The final few lines of output should appear as follows (Notice that unlike the VGG-16 model, the majority of the trainable parameters are not located in the fully connected layers at the top of the network!):

The Activation, AveragePooling2D, and Dense layers towards the end of the network are of the most interest to us. Note that the AveragePooling2D layer is in fact a GAP layer We’ll begin with the Activation layer. This layer contains 2048 activation maps, each with dimensions 7×7.Let Fk represent the k-th activation map, where k∈ {1,…,2048}
The following AveragePooling2D GAP layer reduces the size of the preceding layer to (1,1,2048)by taking the average of each feature map. The next Flatten layer merely flattens the input, without resulting in any change to the information contained in the previous GAP layer.
The object category predicted by ResNet-50 corresponds to a single node in the final Dense layer; and, this single node is connected to every node in the preceding Flatten layer. Let wk represent the weight connecting the k-th node in the Flatten layer to the output node corresponding to the predicted image category.

Then, in order to obtain the class activation map, we need only compute the sum: w1⋅f1+w2⋅f2+…+w2048⋅f2048.
We can plot these class activation maps for any image of our choosing, to explore the localization ability of ResNet-50.In order to permit comparison to the original image, bilinear up sampling is used to resize each activation map to 224×224. (This results in a class activation map with size 224×224.)
