26 * 26 * 512
Conv 1*1/1
208 * 208 * 16
Maxpool 2*2/2
Input Image416 * 416 *3
Conv 3*3/1
13 * 13 * 128
13 * 13 * 256
26 * 26 * 256
208 * 208 * 32
13 * 13 * 512
26 * 26 * 128
concat
13 * 13 * 1024
104 * 104 * 64
52 * 52 * 128
26 * 26 * 384
416 * 416 * 16
Predict1-Output13 * 13 * (num_classes+1+4) * num_anchors
52 * 52 * 64
Predict2-Output26 * 26 * (num_classes+1+4) * num_anchors
Upsample
Maxpool 2*2/1
104 * 104 * 32
Backbone