Why we chose the EfficientDet family of models and which compound coefficient ϕ you should
choose
The aim of our object detector is to be as accurate as possible along with running in real-time.
Traditionally, there has been an associated trade-off between either metric, either the accuracy
or the speed. The EfficientDet family aims to build a scalable detection architecture with high
accuracy and a reasonable computational footprint across a wide spectrum of resource constraints
ranging from 3B to 300B FLOPS.
There are three major contributions of this paper:
BiFPN: A weighted bidirectional feature network for easy and fast multi-scale feature
fusion.
Compound scaling: A new method, which jointly scales up backbone, feature network,
box/class
network, and resolution
EfficientDet: A new family of detectors with significantly better accuracy and efficiency
across a wide spectrum of resource constraints.
The model addresses the problem of efficient multi-scale feature fusion. Feature Pyramid
Networks(FPN) are used for efficient multi-scale feature fusion. Not all features contribute
equally to the output features and hence, EfficientDet detectors propose a new strategy for
multi-scale feature fusion.
Additionally, model scaling is a well-known strategy to improve accuracy in object detection
models by increasing the size of the backbone. Similar to compund scaling in EfficientNets
EfficientDets propose a compound scaling coefficient that jointly scales up the resolution,
depth, width for the backbone, feature network and box/class prediction network.
FPNs were introduced for to detect objects at multiple scales but these are computationally
expensive. CNNs on the other hand, form an inherent heirarchial pyramid structure but lack the
representational capacity due to low semantic features in high-resolution maps. FPNs overcome
this by using a bottom-up and top-down pathway. High-level features are upsampled first and then
combined with low-level features using a lateral connection.
The BiFPN architecture learns weights while fusing feature maps of different scales using either
unbounded fusion, softmax-based fusion and Fast-Normalized fusion. The backbone networks are
ImageNet pretrained
EfficientNets. The authors proposed a new compound scaling method for object detection, which
uses a simple
compound coefficient ϕ to jointly scale-up all dimensions of the backbone network, BiFPN
network, class/box network, and resolution.
Click here
to view an excellent blog with more details written, on the paper and click here to view the paper itself.
|