Dilated Convolutions ( Deep Learning)

Arinjoy Banerjee
3 min readJun 3, 2022

Dilated Convolution is more like exploring the input data points in a wide manner or increasing the receptive field of the convolution operation.

In simple terms, dilated convolution is just a convolution applied to input with defined gaps. With these definitions, given our input is an 2D image, dilation rate k=1 is normal convolution and k=2 means skipping one pixel per input and k=4 means skipping 3 pixels. The best to see the figures below with the same k values.

The figure below shows dilated convolution on 2D data. Red dots are the inputs to a filter which is 3×3 in this example, and green area is the receptive field captured by each of these inputs. Receptive field is the implicit area captured on the initial input by each input (unit) to the next layer.

Figure 1

This is more of normal convolution, but help to capture more and more global context from input pixels without increasing the size of the parameters. This can also help to increase the spacial size of the output. But the main thing here is this increases the receptive field size exponentially with the number of layers. This is very common in the signal processing field.

Dilation is largely the same as deconvolution, except that it introduces gaps into its kernels, i.e., whereas a standard kernel would typically slide over contiguous sections of the input, it’s dilated counterpart may, for instance, “encircle” a larger section of the image while still only have as many weights/inputs as the standard form.

Dilated convolution is a way of increasing receptive view (global view) of the network exponentially and linear parameter accretion. With this purpose, it finds usage in applications cares more about integrating knowledge of the wider context with less cost.

One general use is image segmentation where each pixel is labelled by its corresponding class. In this case, the network output needs to be in the same size of the input image. Straight forward way to do is to apply convolution then add deconvolution layers to up sample. However, it introduces many more parameters to learn. Instead, dilated convolution is applied to keep the output resolutions high and it avoids the need of up sampling.

Dilated convolution is applied in domains beside vision as well. One good example is WaveNet text-to-speech solution and ByteNet learn time text translation. They both use dilated convolution in order to capture global view of the input with less parameters

Advantages of Dilated Convolution:

Larger receptive field (i.e., no loss of coverage)

Computationally efficient (as it provides a larger coverage on the same computation cost)

Lesser Memory consumption (as it skips the pooling step) implementation

No loss of resolution of the output image (as we dilate instead of performing pooling)

Structure of this convolution helps in maintaining the order of the data.

Example: just move each red block around the center by k-1 units away (if the dilation rate = k) and fill the empty slots by 0.

Max pooling is intended to be a form of down sampling and a form of Dimension reduction in so far as major piece of Receptive Fields.

The major piece of difference becomes that Max Pooling can counteract Dimensions which Dilated Convolutions cannot necessarily do.

Dilated convolutions are somewhat restrained in their design as the wave length boundary of Oscillations that we can realistically infer is limited across the spectra

--

--

Arinjoy Banerjee

Full Stack Web Developer | ML Engineer starting out with blogs