Skip to the content.

EfficientNet is a convolutional neural network architecture designed to achieve state-of-the-art performance while minimizing computational resources. The paper introduces a compound scaling method to balance model depth, width, and resolution, enabling efficient scaling for convolution networks. By optimizing the trade-off between model size and accuracy, the network was constructed based on a mobile-size baseline while still achieving higher performance with significantly fewer parameters.

EfficientNet marked a significant breakthrough in computer vision and deep learning as the model to attain state-of-the-art performance across various image classification benchmarks. It outperformed its predecessors in computational efficiency, amplifying its impact on the field. Since its introduction, EfficientNet has been widely adopted in numerous real-world applications.

Literature Review

Authors and Publication Date:

Context in the Timeline:

cnn_timline1 [4]

Scaling Strategies:

Grid Search for Efficient Scaling:

Advantages of Efficient Scaling:

Biography

Mingxing Tan
Quoc V. Le
Staff Software Engineer at Google Brain [1] Research Scientist at Google Brain
Postdoc at Cornell University PhD at Standford
PhD at Peking University Bachelor’s degree in Computer Science at The Australian National University
  Was a researcher at NICTA and Max Planck Institute

The paper’s two authors were at Google Brain at the time of the publication.

Mingxing Tan received his PhD from Peking University. He was a postdoctoral researcher at Cornell University. Quoc V. Le finished his Bachelor’s in Computer Science at The Australian National University and his PhD at Sandford [2]. Le also worked as a researcher at NICTA and Max Planck Institute previously.

Visual Explaination (Digrammer section)

In this section we see a graph showing the accuracy to parameter plot of various model when compared to the EfficientNet models B0-B7

fig1[0]

The graph above are the results of running the model of the ImageNet datasets, where we see that EfficientNet outperforms all the other model even after having significatly lower number of parameters than all the other models. In case of EfficientNet-B7 achieved new state of art with 84.4% top-1 accuracy outperforming the previous SOTA GPipe but being 8.4 times smaller and 6.1 times faster.

How does this happen?

It happens with the two techniques explored in this paper compound scalling and neural architecture search

Rethinking Model Scaling for Convolutional Neural Networks

Compound Scaling

Before EfficientNets, ConvNets were typically scaled up by increasing only one dimension - depth, width or image resolution. EfficientNets introduced a compound scaling approach that uniformly scales all three dimensions - depth, width and input resolution - using fixed ratios to balance between them. This allows the network to capture more fine-grained patterns as the input image gets bigger. Compound scaling sets EfficientNets apart from prior works and performs better than arbitrary scaling, as seen in results on MobileNets and ResNets in the figure below. Intuitively, bigger input size needs more layers and channels to capture details. So compound scaling works by scaling up depth, width and resolution together in a principled way.

fig2[0]

The experiments with compound scaling reveal an important insight - balancing all dimensions is key for maximizing accuracy under computational constraints. As the figure shows below, the highest gains come from increasing depth, input resolution and width together. When resolution is higher at 299x299 (r=1.3) versus 224x224 (r=1.0), scaling width leads to much greater accuracy improvement on a deeper base network (d=2.0), for the same FLOPS cost. Simply put, pushing only one dimension hits a scaling limit. With a bigger input image, more layers are needed to process the additional detail, and more channels to capture the richer patterns. The authors succinctly state it as: “In order to pursue better accuracy and efficiency, it is critical to balance all dimensions of network width, depth, and resolution during ConvNet scaling.”

fig3[0]

Neural Architecture Search (NAS)

The authors went beyond simply applying compound scaling to existing architectures. They recognized that developing a new baseline model, optimized specifically for mobile applications, would better showcase the power of compound scaling. This led them to create the EfficientNet architecture using neural architecture search - identifying an efficient baseline model tailored for mobile devices rather than arbitrarily picking an off-the-shelf ConvNet. Starting from this specialized foundation allowed them to scale up EfficientNet most effectively. Neural Architecture Search (NAS) is an automated process to design neural network architectures. It iterates over the space of possible architectures, training child models and using their performance as a signal to guide the search towards increasingly better designs specialized for the problem at hand.

MnasNet Approach

Alt text [3]

Mnasnet utilizes both model accuracy and latency as an objective function while constrainting an overly large latency. The pipeline allows the controller to defines multiple blocks of neural networks containing different hyper-parameters. Then a method based on reinforcement learning is employed to discover model architectures that effectively optimize the given objective function. In each iteration, the controller initiates the process by generating a set of models, sampling them through its RNN network by predicting a sequence of tokens.

obj_fn1

Neural Architecture Search for EfficientNets

A similar approach to MnasNet has been used to create EfficientNet-B0. While MnasNet uses actual latency measured from a mobile device, as the network does not bound to one hardware, EfficientNet uses FLOPS instead. The pipeline does the same processes as MnasNet with an objective function to maximize ACC(m) x [FLOP(m)/T]^w

Alt text[0]

where w,d,r are coefficients for scaling network width, depth, and resolution. F^, L^, H^, W^ and C^ are predefined parameters in baseline network.

Scaling Efficient-B0 to get B1-B7

Now that we undestand how the EfficientNet Architecture makes use of compound scaling and the NAS architecture, In this scetion we look at how the author scaled Efficient-B0 to get B1-B7.

Alt text[0]

The author defines the above values as shown, also where φ is a user-defined coeffecient that determines how much extra resources are available. Where the α, β, γ can be determined using a small grid search and thus we can scale networks depth, width and input resolution to get a bigger network.

As mentioed in the paper: Starting from the baseline EfficientNet-B0, we apply our compound scaling method to scale it up with two steps[0]:

Social Impact

EfficientNet is considered a small network suitable for embedding in mobile devices. As in the current smartphone era, integrating mobile computer vision has become pervasive, fundamentally reshaping the mobile phone market with a strong emphasis on AI and ML. EficientNet and all mobile computer vision technologies bring us two-sided impacts.

On the positive side, the mobile-sized baseline requirements and streamlined parameters in network construction enhance widespread accessibility and resource efficiency. It facilitates the deployment of AI applications across a diverse array of mobile devices, fostering an upsurge in the prevalence and accessibility of mobile-based AI applications.

The negative impacts of these advancements are manifold. First is the concern of inaccuracy, mainly when humans heavily rely on the outcomes of these models, particularly in sensitive fields like healthcare. Furthermore, legal consent becomes a critical issue, extending beyond widely used public datasets like ImageNet. Addressing legal intricacies seems imperative before such applications are widely deployed and used. Additionally, concerns surrounding fraud, bias, and ethical consent further underscore the intricate moral landscape accompanying the rapid evolution of mobile computer vision.

Alt text

Industrial Applications

AR Monocle[5]

Follow-on Research

Drawbacks of EfficientNet are mainly because of its training speed. Several points are raised to point out that EfficientNet could be more efficient.

By finding the new way to search and scale the network could make EfficientNet be faster and efficient.

Review by Pichaya Saittagaroon

Overall:

8: Strong Accept. High-impact paper, with no major concerns with respect to evaluation, resources, reproducibility, or ethical considerations.

Strengths and Weaknesses:

Review by Ashwin Sharan

Overall:

Technically strong paper with, with novel ideas, excellent impact on the field of computer vision in terms of classification, with excellent evaluation, resources, and reproducibility, and no unaddressed ethical considerations. (Score 8/10)

Strengths and Weaknesses:

References

[0] Tan, M. and Le, Q. (2020.) EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. https://arxiv.org/abs/1905.11946

[1] APPM Department Colloquium - Mingxing Tan https://www.colorado.edu/amath/2021/02/26/appm-department-colloquium-mingxing-tan

[2] Quoc V. Le https://cs.stanford.edu/~quocle/

[3] MnasNet: Towards Automating the Design of Mobile Machine Learning Models https://blog.research.google/2018/08/mnasnet-towards-automating-design-of.html

[4] Wrapping Up CNN Models: Shifting Focus to Attention-Based Architectures https://blog.gopenai.com/wrapping-up-cnn-models-shifting-focus-to-attention-based-architectures-5029b87034d9

[5] Image of AR monocle, https://i.redd.it/h4xwqzlwemo91.png

Team Member