Yes, Gemini can do object detection

Introduction

5 min readOct 26, 2024

We’re familiar with Gemini’s impressive multi-modal abilities, especially when it comes to reasoning with image data — whether that involves captioning, OCR, classification, or identifying specific content within an image.

Unlike its open model counterpart, PaliGemma, Gemini models aren’t explicitly trained for object detection tasks. This fact led me to conduct some experiments and write this blog.

Note: Here, when we talk about object detection, we mean identifying and localizing objects by drawing bounding boxes, much like models such as YOLO, DETR, EfficientDet, Florence-2, and PaliGemma do.

So, without further ado, let’s find out if Gemini can perform object detection and localization. If yes, to what extent?

Prerequisites

We’ll only need the Gemini API key — nothing else. I’m assuming you’re already familiar with the Gemini API. If you’re not, check out this blog to learn how to create your Gemini API key on Google AI Studio.

Open the Colab notebook available in the repository: https://github.com/NSTiwari/Object-Detection-using-Gemini

Step 1: Install the necessary libraries and dependencies

# Install Generative AI SDK.
!pip install -q -U google-generativeai

# Import libraries
from google.colab import userdata
import google.generativeai as genai
import re
from PIL import Image
import cv2
import numpy as np

Step 2: Configure API key and model

Feel free to pick either Gemini 1.5 Flash or Gemini 1.5 Pro, whichever you prefer.

API_KEY = userdata.get('gemini')
genai.configure(api_key=API_KEY)

model = genai.GenerativeModel(model_name='gemini-1.5-pro')

Step 3: Pass input image and text prompt

Make the text prompt clear and simple, using an example. For this case, we asked Gemini to give the bounding box coordinates like this: [ymin, xmin, ymax, xmax, object_name].

input_image = "image.jpg" # @param {type : 'string'}
img = Image.open(input_image)

response = model.generate_content([
    img,
    (
        "Return bounding boxes for all objects in the image in the following format as"
        " a list. \n [ymin, xmin, ymax, xmax, object_name]. If there are more than one object, return separate lists for each object"
    ),
])

result = response.text

Step 4: Parse the model response

def parse_bounding_box(response):
    bounding_boxes = re.findall(r'\[(\d+,\s*\d+,\s*\d+,\s*\d+,\s*[\w\s]+)\]', response)

    # Convert each group into a list of integers and labels.
    parsed_boxes = []
    for box in bounding_boxes:
        parts = box.split(',')
        numbers = list(map(int, parts[:-1]))
        label = parts[-1].strip()
        parsed_boxes.append((numbers, label))

    # Return the list of bounding boxes with their labels.
    return parsed_boxes

bounding_box = parse_bounding_box(result)

Step 5: Draw bounding boxes

The bounding box coordinates provided by the model must be normalized by dividing the image’s height and width by 1000.

label_colors = {}

def draw_bounding_boxes(image, bounding_boxes_with_labels):
    if image.mode != 'RGB':
        image = image.convert('RGB')

    image = np.array(image)

    for bounding_box, label in bounding_boxes_with_labels:

        # Normalize the bounding box coordinates.
        width, height = image.shape[1], image.shape[0]
        ymin, xmin, ymax, xmax = bounding_box
        x1 = int(xmin / 1000 * width)
        y1 = int(ymin / 1000 * height)
        x2 = int(xmax / 1000 * width)
        y2 = int(ymax / 1000 * height)

        if label not in label_colors:
            color = np.random.randint(0, 256, (3,)).tolist()
            label_colors[label] = color
        else:
            color = label_colors[label]

        font = cv2.FONT_HERSHEY_SIMPLEX
        font_scale = 0.5
        font_thickness = 1
        box_thickness = 2
        text_size = cv2.getTextSize(label, font, font_scale, font_thickness)[0]

        text_bg_x1 = x1
        text_bg_y1 = y1 - text_size[1] - 5
        text_bg_x2 = x1 + text_size[0] + 8
        text_bg_y2 = y1


        cv2.rectangle(image, (text_bg_x1, text_bg_y1), (text_bg_x2, text_bg_y2), color, -1)
        cv2.putText(image, label, (x1 + 2, y1 - 5), font, font_scale, (255, 255, 255), font_thickness)
        cv2.rectangle(image, (x1, y1), (x2, y2), color, box_thickness)

    image = Image.fromarray(image)
    return image

output = draw_bounding_boxes(img, bounding_box)

That’s all about the code. Now it’s time to evaluate how it performs on some images. Let’s begin with a simple example.

Image with single object
This is a photo of me. The only object here is the person.
Prompt: Return bounding boxes for person in the image in the following format as a list. [ymin, xmin, ymax, xmax, object_name].

Good start, let’s now try with multiple objects.

Image with multiple objects
An image of a dog and a bike.
Prompt: Return bounding boxes for all the objects in the image in the following format as a list. [ymin, xmin, ymax, xmax, object_name]. If there are more than one object, return separate lists for each object.

Not bad at all. It managed to detect the objects accurately, but these are common ones, right? Let’s challenge Gemini further.

I had an image of the famous painting “Ram Darbar” from the Ramayan. Let’s see if Gemini can identify and detect all the characters.

Prompt: This is a painting of “Ram Darbar” from Ramayan. Return bounding boxes for all the characters in the image in the following format as a list. [ymin, xmin, ymax, xmax, character_name].

Painting of Ram Darbar, Ramayan [Image source: Google Images]

That was fantastic. I was impressed that it not only drew the bounding boxes but also accurately identified each character, especially since I specifically asked for their names.

It’s time to test some unconventional images. I drew Albert Einstein (apologies, that’s the best I could do). Let’s give this a try.

A picture of a drawing
Prompt: Return name and bounding boxes of a famous personality in the image in the following format as a list [ymin, xmin, ymax, xmax, object_name].

Drawing of Albert Einstein [Image by author]

Woah, guess what? I’m not bad at drawing after all. Or maybe Gemini is just smart enough to recognize this as Einstein. 😜

Feel free to try out the Hugging Face 🤗 Spaces directly.

After a series of tests on different images: from recognizing people and objects to identifying characters in paintings and drawings, and accurately localizing them with bounding boxes, Gemini has truly met my expectations for object detection.

I wouldn’t personally compare Gemini to models specifically designed for object detection, as its strengths lie in different areas. However, this experiment satisfied my curiosity: yes, it can manage detection tasks quite well and is capable of detecting almost any object.

With that, we conclude this blog. I hope you found it interesting. Go ahead and check out the Colab notebook or the HF Spaces, and I’d love to hear your feedback. As always, I’ll return with more interesting blogs for you, but in the meantime, keep learning and growing.