A Shallow Object Detection Introduction
The goal of this article is originally about to record what I have read and survey about Object detection, and later I found out that it’s much better to make it an beginner-oriented introduction so here it is.
I would try to stay shallow, friendly and organized as much as I can so that it wouldn’t freak out beginners or make them lost in terminology and formula.
Btw, I’m not a native english speaker.
Generally speaking, It’s a task about to locate certain object and tell what the object is. e.g., Where’s Wally. Human can do this effortlessly so that you may not notice how hard this actually could be. Not to mention designing a program / algorithm to do so.
Therefore, we use Machine Learning, specially Neural Network nowadays, to solve this task. Rather than actually design an algorithm that can detect object, just design an program that can learn how to detect object.
e.g., Use Neural Network to find out Where’s Wally / Waldo like this and this.
If you want to go through the history of Object Detection from the very beginning, go check out Lecture 1 of Stanford CS231n. It’s a very nice course for those who want to study computer vision. However, This course is not designed for beginners, so if you are pretty new to this topic, maybe take a Machine Learning course on MOOC would be much more helpful.
Usually, You will see Object Detection along with several keywords like Machine Learning, Computer Vision, Image Classification, Neural Network, CNN … and so on. To clarify this, I will list those I’ve heard of and explain the relationships between them.
Keep in mind that my shallow explanation may not be comprehensive enough because every words I’m going to mention could be a topic that would spend you months to study.
If you still want more detail, try google search.
I Split this section into 2 parts, One is Before Neural Network and another is Neural Network. In First part, I will briefly introduce some method which was either common or once state-of-the-art at their time. Then I will introduce Neural Network which is actually the reason make me to start writing this in next part.
As far as I know, Most traditional method around 2006 - 2010 could fit into the three steps flow. See the figure below.
To put it simply, this is what actually each steps do:
Here is some keywords of traditional method, I only list few because I know very less about them.
Generally, NN-based method could be classified into 2 classes. One is often called as Two-Stage Detectors due to the way it approach the task. They first find regions that are potential to be a object over image and then try to tell what kind of object is it. Another is One-Stage Detectors which attempt to solve two problems together.
Some also says that the major difference between the two kind is how they approach the problem. Two-Stage Detector try to take Object Detection as a classification problem and One-Stage Detector treat it as a regression problem. Also, they have different trade-off between accuracy and speed, Two-Stage Detectors are usually more accurate and One-Stage Detectors are faster.
Here is the paper list of both kinds detectors. I only list those I’ve heard of, If I miss something important, remind me please.
Also, I found this tree graph is pretty useful to understand the situation of this research domain. However, it shows that the last update was at 12/31/17 so it might be a little outdated.
Other than Object Detection framework, each method mentioned above also must combine with a Backbone Network (some may called it as Feature Extractor, just like the role in the three step flow I mention above) to function normally. Different Backbone Network imply different structure, different possible performance and different computing power required. There are a lot of classic neural network structure over the years and here are the most common ones.
I only link the very origin paper because each structure I mention have various variants. You may see something like ResNet-101, VGG-16, MobileNet v1… and so on. Some suffix just means that it use different parameter but some may actually means significant breakthrough. Just realize that the research about network structure is a very popular research topic so they would get improved all the time.
Generally, COCO is the most difficult one because of the number of small target.
To compare the detection performance between methods, there are two most used metrics. One is Intersection over Union (IoU) and another is mean Average Precision (mAp).
Other than two mentioned above, Inference time / fps is also a very important metrics because in most case, we would like to deploy the object detector to mobile devices, which means that the computing power would be much lower than normal PC.