Real-Time Face Detection in Google Colab Using Python & OpenCV

InspirationHighlightMomentum

A step-by-step guide to building a face counter that works directly in your browser — no local setup needed.

📖 Introduction

Building a face detection app sounds like it needs a powerful local machine, a webcam driver, and a full OpenCV setup. But what if you could do it entirely inside Google Colab — for free, in the browser?

In this blog, we'll walk through a Python program that:

Accesses your browser webcam using JavaScript
Captures a real-time frame and passes it to Python
Detects faces using a deep learning model (ResNet-SSD)
Displays the result with bounding boxes and a face count

🧱 Why Not Just Use cv2.VideoCapture(0)?

If you've used OpenCV locally, your first instinct is:

python

cap = cv2.VideoCapture(0) # Open webcam

This works on your local machine — but in Google Colab, it fails silently.

The solution: Use JavaScript to access the browser's webcam, capture a frame, and send it to Python as a base64-encoded image.

📦 Packages Used

1. opencv-python-headless

1pip install opencv-python-headless

OpenCV (Open Source Computer Vision Library) is the backbone of this project. We install the headless variant because:

The regular opencv-python includes GUI dependencies (cv2.imshow, etc.)
Colab has no display server, so those dependencies cause errors
headless gives us all image processing features without the GUI bloat

Used for:

Reading and decoding images (cv2.imdecode)
Drawing bounding boxes (cv2.rectangle, cv2.putText)
Running the DNN face detector (cv2.dnn)
Color space conversion (cv2.cvtColor)

2. IPython.display (built into Colab)

from IPython.display import display, Javascript, Image

This is IPython's display system — it lets us render rich content inside Colab cells.

Class | What it does

Javascript | Executes JS code inside the browser
Image | Renders a raw image (bytes) inline in the notebook
display() | Outputs any of the above into the cell

3. google.colab.output.eval_js (built into Colab)

from google.colab.output import eval_js

This is the bridge between Python and JavaScript in Colab. It:

Runs a JavaScript expression in the browser
Waits for it to resolve (supports Promise)
Returns the result back to Python as a string

This is how we get the webcam image from the browser into Python memory.

4. base64 (Python standard library)

1from base64 import b64decode

When the browser sends an image via JavaScript's canvas.toDataURL(), it returns a base64-encoded string like:

data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ...

We strip the data:image/jpeg;base64, prefix and decode the rest into raw bytes using b64decode.

5. numpy (built into Colab)

1import numpy as np

NumPy is used to convert the raw image bytes into an array that OpenCV can process:

img_array = np.frombuffer(img_bytes, dtype=np.uint8)
frame = cv2.imdecode(img_array, cv2.IMREAD_COLOR)

OpenCV works with NumPy arrays internally — every image is a 3D array of shape (height, width, channels).

🔬 The DNN Face Detector — Why Not Haar Cascade?

We went through two approaches in this project:

❌ Approach 1: Haar Cascade (failed)

face_cascade = cv2.CascadeClassifier(
    cv2.data.haarcascades + 'haarcascade_frontalface_default.xml'
)
faces = face_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5)

Haar Cascade is fast and built into OpenCV, but it has serious limitations:

Requires a fully visible, frontal face
Fails with partial faces (cropped at edges)
Struggles with head tilts beyond ~15°
Less accurate with beards or facial hair
Inconsistent with varied skin tones

✅ Approach 2: ResNet-SSD DNN (what we use)

# Download the model files
wget https://raw.githubusercontent.com/opencv/opencv/master/.../deploy.prototxt
wget https://github.com/opencv/opencv_3rdparty/.../res10_300x300_ssd_iter_140000.caffemodel

The ResNet-10 + SSD (Single Shot MultiBox Detector) model is a lightweight deep learning face detector trained by OpenCV. It's stored as two files:

File | Purpose

deploy.prototxt | Defines the neural network architecture

res10_300x300_ssd_iter_140000.caffemodel | Pre-trained weights (140,000 iterations)

Comparison:

Feature | Haar Cascade | ResNet-SSD DNN

Partial faces | ❌ | ✅
Head tilt | ❌ | ✅
Beards | ❌ Sometimes | ✅
Varied lighting | ❌ | ✅
Confidence score | ❌ | ✅

Speed | ⚡ Fast | 🐢 Slightly slower

🧩 Step-by-Step Code Breakdown

Step 1 — Camera Warmup via JavaScript

data = eval_js(f'''
    new Promise((resolve, reject) => {{
        const video = document.createElement('video');
        navigator.mediaDevices.getUserMedia({{ video: true }})
        .then(stream => {{
            video.srcObject = stream;
            video.play();

            const waitReady = setInterval(() => {{
                if (video.videoWidth > 0) {{
                    clearInterval(waitReady);

                    // Warmup: capture N frames before saving
                    let count = 0;
                    const canvas = document.createElement('canvas');
                    const warmup = setInterval(() => {{
                        canvas.getContext('2d').drawImage(video, 0, 0);
                        if (++count >= {warmup_frames}) {{
                            clearInterval(warmup);
                            stream.getTracks().forEach(t => t.stop());
                            resolve(canvas.toDataURL('image/jpeg', 0.95));
                        }}
                    }}, {interval_ms});
                }}
            }}, 100);
        }});
    }})
''')

Why the warmup? Webcams use auto-exposure — the first frame is often underexposed (very dark) because the sensor hasn't adjusted yet. By capturing 15 frames at 150ms intervals (~2.2 seconds total), we allow the camera's auto-exposure to stabilize before taking the actual image.

Step 2 — Decode the Image

blob = cv2.dnn.blobFromImage(
    cv2.resize(frame, (300, 300)),  # model expects 300x300
    scalefactor=1.0,
    size=(300, 300),
    mean=(104.0, 177.0, 123.0)     # BGR mean subtraction (ImageNet values)
)

blobFromImage does three things:

Resizes the image to 300×300 (required input size)
Subtracts mean values per channel — this normalizes pixel values and improves accuracy
Returns a 4D blob of shape (1, 3, 300, 300) — batch size, channels, height, width

Step 4 — Run the Detector

net.setInput(blob)
detections = net.forward()  # shape: (1, 1, 200, 7)

The output tensor has shape (1, 1, N, 7) where each detection is:

[batch, _, detection_id, class, confidence, x1, y1, x2, y2]

We filter by confidence threshold:

for i in range(detections.shape[2]):
    confidence = detections[0, 0, i, 2]
    if confidence > 0.5:   # only keep detections > 50% confident
        box = detections[0, 0, i, 3:7] * np.array([w, h, w, h])

The bounding box coordinates are normalized (0.0 to 1.0), so we multiply by the original image dimensions to get pixel coordinates.

Step 5 — Draw Results and Display

cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 200, 0), 2)
cv2.putText(frame, f"{confidence*100:.0f}%", (x1, y1-8),
            cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 200, 0), 2)

# Display in Colab
_, buffer = cv2.imencode('.jpg', frame)
display(Image(data=buffer.tobytes()))

Since Colab has no window system, we encode the frame as a JPEG byte buffer and display it using IPython's Image class directly in the cell output.

🎛️ Tuning Parameters

Parameter | Default | Effect

warmup_frames | 15 |More frames = better exposure, but slower

interval_ms | 150 | Delay between warmup frames

confidence_threshold | 0.5

Lower = more detections, higher = fewer false positives

If faces are being missed:

python
result, count = detect_faces_dnn(frame, confidence_threshold=0.3)

If too many false positives:

python
result, count = detect_faces_dnn(frame, confidence_threshold=0.7)

💡 Possible Improvements

Idea | How

Continuous video stream |Loop capture_frame() in a cell with clear_output()
Face recognition | Use face_recognition lib or DeepFace to identify who each face is
Save output image | cv2.imwrite('result.jpg', result_frame) then download
Count faces in uploaded image | Replace webcam capture with a file upload widget
Use a better model | Try YOLOv8-face or MediaPipe Face Detection for higher accuracy

✅ Final Thoughts

This project demonstrates a common real-world pattern: bridging two runtimes (browser JS + server-side Python) to solve a problem that neither can solve alone. Colab's eval_js is the key that makes it work.

The switch from Haar Cascade to ResNet-SSD was also an important lesson — always match your detector to your real-world conditions. A model that works perfectly in demos can completely fail on real webcam footage.

Built with Python, OpenCV, and Google Colab. No local setup required.