Object Detection using YOLO v3

9 min readJul 21, 2023

Object detection using YOLO (You Only Look Once) is a popular and efficient deep learning algorithm. YOLO is a real-time object detection system that can detect multiple objects in an image or video stream. The main advantage of YOLO is that it performs object detection in a single pass through the neural network, making it much faster compared to other methods that use multiple region proposals.

Architecture: YOLO uses a deep convolutional neural network (CNN) to perform object detection. YOLO versions are named YOLOv1, YOLOv2 ………YOLOv8. Each version has improvements over the previous one in terms of accuracy and speed.
Grid Cell: YOLO divides the input image into an S x S grid. Each cell in the grid predicts multiple bounding boxes and their associated class probabilities.
Bounding Boxes Prediction: For each grid cell, YOLO predicts bounding boxes (usually 2 or 3) along with their confidence scores. The confidence score represents how confident the model is that the bounding box contains an object.
Class Prediction: In addition to predicting bounding boxes, YOLO also predicts the probability of each class (object category) within the bounding boxes. Each box is associated with a class probability for each class.
Non-Maximum Suppression: After predictions are made, a post-processing step called non-maximum suppression (NMS) is applied. NMS eliminates duplicate detections by selecting the most confident bounding box among overlapping boxes with the same class.

Steps to implement in Python using Colab

In this article, we’ll concentrate on using YOLO with OpenCV. This one is the best method for beginners to rapidly get the algorithm working without requiring complicated installs.

Step 1: Import the library

Step2: Download the files

To run the algorithm we need three files: Download the file from here

yolo3 weights: The core of the algorithm used to recognize objects is the weight file, which is the trained model.
yolov3.cfg: The configuration file, or cfg file, contains all of the settings for the algorithm.
coco.names: The names of the things that the algorithm can detect are contained in name files.

Upload these three files in the Colab folder

Step 3: Mount the Drive

Step 4: Read the class labels and assign a different color to each label

Step 5: Load the YOLO

Step 6: Get the output layer names of a neural network.

We use a Darknet framework. The Darknet framework is a popular open-source neural network framework primarily used for training and running deep neural networks. Darknet is known for its support of YOLO (You Only Look Once).

Let’s break down the code:

ln = net.getLayerNames(): This line retrieves the names of all layers in the neural network and stores them in the variable ln.
ln = [ln[i-1] for i in net.getUnconnectedOutLayers()]: This line gets the indices of the unconnected output layers in the neural network and uses them to create a list containing the names of those output layers. The unconnected output layers are the layers from which the final predictions are obtained. The net.getUnconnectedOutLayers() function returns the indices of the output layers.

Step 7: Load the input image

Step 8: Construct a blob and perform a forward pass

The term “blob” is often used as a shorthand for “binary large object,” but in the context of deep learning frameworks like Darknet or other libraries, it does not specifically refer to binary data.

Blob it’s used to extract features from the image and resize them. YOLO accepts three sizes:

320×320 it’s small so less accuracy but better speed
609×609 it’s bigger so high accuracy and a slow speed
416×416 it’s in the middle and you get a bit of both.

cv2.dnn.blobFromImage(image, 1 / 255.0, (416, 416), swapRB=True, crop=False): This function is used to preprocess the input image and create a blob that is suitable for input into a neural network. The function takes the following arguments:

image: The input image that you want to preprocess and feed into the neural network.
1 / 255.0: This is a scale factor used to normalize the pixel values of the image. In this case, it divides each pixel value by 255 to bring them into the range [0, 1].
(416, 416): The size to which the image will be resized. This is typically the input size required by the neural network.
swapRB=True: This flag indicates whether to swap the Red and Blue channels in the image. In OpenCV, images are loaded in BGR format by default, but some neural networks expect RGB format. Setting this flag to True perform the channel swap.
crop=False: This flag determines whether to crop the image or resize it to the specified size while maintaining the original aspect ratio. In this case, crop is set to, which means resizing without cropping.

2. net.setInput(blob): Once the blob is created from the image, this line sets the blob as the input to the neural network (net). The neural network can process the preprocessed image and perform inference (forward pass) to generate predictions.

4. layerOutputs = net.forward(ln): This line executes the forward pass of the neural network with the provided input (blob) and obtains the output of specific layers in the network.

Step 9: Initialize the variables

Initialize our lists of detected bounding boxes, confidences, and class IDs, respectively

Step 10: Identify the objects in the image

It processes the detections obtained from each output layer and filters them based on confidence scores to retain only the detections with a confidence greater than a specified threshold (in this case, 0.5). It extracts the bounding box coordinates, confidence scores, and class IDs for the detected objects and organizes them into separate lists (boxes, confidences, and classIDs).

Let’s go through the code step-by-step:

for output in layerOutputs:: This loop iterates over each output obtained from the neural network's output layers. Each output represents the detections made by a specific output layer.
for detection in output:: This nested loop iterates over each detection within the output. Each detection contains information about a single detected object, including bounding box coordinates and class probabilities.
scores = detection[5:]: The code extracts the class probabilities (confidence scores) from the detection. The first five elements detection are usually related to the bounding box coordinates and other information.
classID = np.argmax(scores): This line finds the index of the class with the highest confidence score (the class with the highest probability).
confidence = scores[classID]: The code retrieves the confidence score for the predicted class.
if confidence > 0.5:: This line checks whether the confidence score of the detected object is greater than the specified threshold (0.5 in this case). If the confidence is above the threshold, the detection is considered valid, and its information is extracted.
box = detection[0:4] * np.array([W, H, W, H]): This line extracts the bounding box coordinates (centerX, centerY, width, height) from the detection and scales them back to the original image dimensions (W and H represent the width and height of the original image, respectively).
(centerX, centerY, width, height) = box.astype("int"): The bounding box coordinates are converted to integers.
x = int(centerX - (width / 2)): Calculate the top-left x-coordinate of the bounding box.
y = int(centerY - (height / 2)): Calculate the top-left y-coordinate of the bounding box.
boxes.append([x, y, int(width), int(height)]): Add the extracted bounding box coordinates to the boxes list.
confidences.append(float(confidence)): Add the confidence score to the confidences list.
classIDs.append(classID): Add the class ID to the classIDs list.

Step 11: Eliminate overlapping or redundant detections.

Non-Maximum Suppression (NMS) on the detected bounding boxes to eliminate overlapping or redundant detections. NMS is commonly used in object detection tasks to remove duplicate bounding boxes and retain only the most confident and non-overlapping detections.

idxs = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.3): This line applies Non-Maximum Suppression to the detected bounding boxes and their corresponding confidence scores. The cv2.dnn.NMSBoxes function takes the following arguments:

boxes: The list of bounding box coordinates obtained from the previous code snippet.
confidences: The list of confidence scores corresponding to each bounding box.
score_threshold: The threshold value used to filter out low-confidence detections. In this case, detections with a confidence score of less than 0.5 will be discarded.
nms_threshold: The threshold for NMS. This value controls how much overlap between bounding boxes is allowed. If the Intersection over Union (IoU) between two boxes is higher than this threshold, the box with the lower confidence will be suppressed. A value of 0.3 means that if two boxes have an IoU of 0.3 or higher, one of them will be suppressed.

Step 12: Visualization and display the results

The visualization process draws bounding boxes and class labels on the input image for the detected objects.

(x, y) = (boxes[i][0], boxes[i][1]): Extract the top-left coordinates (x, y) of the bounding box.
(w, h) = (boxes[i][2], boxes[i][3]): Extract the width and height (w, h) of the bounding box.
color = [int(c) for c in COLORS[classIDs[i]]]: Obtain the color for drawing the bounding box and label text. The COLORS list likely contains a predefined set of colors corresponding to different classes.
cv2.rectangle(image, (x, y), (x + w, y + h), color, 2): Draw a rectangle image around the detected object using the top-left and bottom-right coordinates. The color determines the color of the bounding box and 2 is the thickness of the rectangle.
text = "{}: {:.4f}".format(LABELS[classIDs[i]], confidences[i]): Create the text label to display the class name along with the confidence score of the detection.
cv2.putText(image, text, (x, y - 5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2): Draw the label text on the image. The text will be positioned just above the top-left corner of the bounding box. The cv2.FONT_HERSHEY_SIMPLEX specifies the font type, 0.5 the font scale, and 2 the thickness of the text.

Object detection using video

Instead of step 7 the following code need to be written. OpenCV’s VideoCapture module in Python to capture frames from a webcam feed. The code you provided initializes a video capture object and then enters into a loop to continuously read frames from the webcam.

video_capture = cv2.VideoCapture(0)
while True:
    # Capture frame-by-frame
    re,img = video_capture.read()
    img = cv2.resize(img, None, fx=0.4, fy=0.4)
    height, width, channels = img.shape

Steps 8 to 12 need to write inside the while loop. But on Step 12 line 9 onwards replace with the following code.

 cv2.imshow("Image",cv2.resize(img, (800,600)))
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break
video_capture.release()
cv2.destroyAllWindows()

This code sets up a loop to continuously read frames from the webcam. It will keep capturing frames until the user presses the ‘q’ key, at which point it will release the video capture resources and close any OpenCV windows that might be open.

Free Datasets

https://paperswithcode.com/datasets?task=object-detection