RTSP to Web for Edge

7 min readFeb 13, 2024

TL;DR —

Complex, immature and/or legacy methods to expose RTSP feeds, very few work for PWA or modern browser-based applications
WebRTC and HTTP Live Server are complex or too slow. MJPEG is simple and elegant solution that can be run in an edge location
Proxying RTSP to MJPEG can be done with OpenCV + Python (or Rust)
All code & docker registry available in this repo https://gitlab.com/mike-ensor/rtsp-to-mjpeg

It’s 2024 and we still don’t have a simple answer for how to expose an RTSP feed and consume that into a browser. Over the last 5 years, I have spent a lot of time on the “edge” where workloads run and communicate with LAN-based services such as video cameras, SCADAs, Mini PCs, and more. For many reasons, the cameras are not exposed to the cloud so they can be used in AI/ML or other purposes. This is one of the key reasons why edge compute has become a hot-ticket item over the last few years. The obvious reason is latency, but a quick second is internet bandwidth or intermittent internet connectivity. Additionally, more governments are restricting movement of data out of a location if it contains PII (Personally Identifiable Information, like the gate of a human or a face). This gets even more strict with humans 13 and under. For these purposes, edge compute is a good answer, until you realize the industry tooling is complex, legacy and often not complete or immature.

One of those incomplete cases is an easy way to expose an RTSP feed to be used by a PWA (Progressive Web App) or any other Browser-based application. Our goal is to consume an RTSP feed and consume that within a PWA or modern browser-based application. Furthermore, the application should run in a Kubernetes cluster or in any Docker-based system.

WebRTC

We’ll start with the quick 2024 answer is to use WebRTC to establish a peer-to-peer connection with the camera resource. On paper, this looks simple until you dig deeper and find STUN and ICE servers, methods used to look up IPs for P2P connections. If you are in an internet challenged location, a public STUN server (ie. stun.l.google.com) may not be available at the time of the query, or may not be available at all if you have an air-gapped system. It is possible to setup P2P connections inside a LAN, but there is a lot of complexity and it feels like you need a PhD in media technologies to decipher. For public access or the complexity reasons, WebRTC is not an easy or possible solution.

HTTP Live Serving (HLS)

The next option would be to go for HTTP Live Serving (HLS) where an application like ffmpeg or gstreamer reads the rtsp feed and produces small chunks of video that is then consumed by the browser. This is relatively straight-forward and there are many examples to be found on Google. The challenge is that even with 1s chunks, the client-side HLS feed is at least 1 second behind real-time. Adding to that challenge, the default HTML5 Video objects will start to stream the first chunk, which could be many seconds to minutes behind live if the .ts files are not cleaned up. In some cases, this is not an issue, so if your case can withstand 5–60 seconds of latency behind live (ie: streaming movies or playback of videos), you should use any of the many examples. If you need near real-time (within a second of “live”, you need a 3rd option.

M(otion)JPEG

Motion JPEG is a loosely defined technology where the browser consumes an SRC for an image, but is given a new image over and over. As in motion pictures, the sequence of images produces a video feed. The challenge then becomes taking frame-by-frame (or a subset) and producing images in a stream back to the endpoint.

Approach

OpenCV provides a mostly simple interface to take in an RTSP feed, extract frames, then push those to the endpoint. While this operation can be done in C or Rust, Python is a relatively easy language and has a good community behind it. Our approach will be to setup a small API to start or stop the feed, create a thread that handles the RTSP frame extraction, use message passing between the main-thread and frame-thread. The reason for multiple threading is to allow the frame-thread to focus on just threads, while the main thread can respond to incoming API requests and push the images back to the browser.

First, let’s address the API. We will want to have an API endpoint for “Start” and “Stop” of the feed so we’re not consuming resources unless we need it. The way Python threads work is that there is a “start” but no way to “stop” a thread, only have the thread terminate on its own (naturally or forced) or to have the spawning thread complete as well.

Object/Class to Contain Video Capture

To control the thread, I recommend using a variable in the class/object that controls the “video capture” (example), then exit the “read images from RTSP feed” loop when the variable is no longer true. See the line while self.CAPTURE is True: When the variable in stop() is changed, the loop exits. When the start() is called, the variable is set to True. Starting the thread again will start pulling frames from the RTSP feed.

    # https://gitlab.com/mike-ensor/rtsp-to-mjpeg/-/blob/main/app/main/stream.py?ref_type=heads#L84    

    # NOTE: This is the function being run by the thread, not called directly
    def push_frame_thread(self):
        if self.is_initialized() is not True:
            log.warning("[RTSP Pushing Frames]: RTSP stream is not ready")
        else:
            log.info("[RTSP Pushing Frames]: Starting RTSP Frame Pushing")
            while self.CAPTURE is True:
                grabbed, frame = self.vcap.read()
                if grabbed is False:
                    log.debug("[RTSP Pushing Frames]: Video frame not found")
                    continue
                else:
                    try:
                        self.pipe[FRAME_PRODUCING_PIPE].send(frame)
                    except IOError as e:
                        if e.errno == errno.EPIPE:
                            pass

            log.info("[RTSP Pushing Frames]: Closing RTSP stream")
            self.vcap.release()

    # https://gitlab.com/mike-ensor/rtsp-to-mjpeg/-/blob/main/app/main/stream.py?ref_type=heads#L113
    def start(self):
        self.CAPTURE = True
        self.t.start()  # Start the thread (now the pipe will start receiving frames)

    # https://gitlab.com/mike-ensor/rtsp-to-mjpeg/-/blob/main/app/main/stream.py?ref_type=heads#L117 
    def stop(self):
        self.CAPTURE = False

In this example, the results of the frame are sent to the self.pipe[FRAME_PRODUCING_PIPE] The other end of the pipe is in the function that receives from the pipe and yields an image back to the endpoint function.

    # https://gitlab.com/mike-ensor/rtsp-to-mjpeg/-/blob/main/app/main/stream.py?ref_type=heads#L66

    # Called from the API frontend function GET /stream
    def render_frame_from_pipe(self):
        while self.is_full_ready():
            frame = self.pipe[FRAME_READING_PIPE].recv()

            (_, frame_jpg) = cv2.imencode(".jpg", frame)

            out_frame = frame_jpg.tobytes()
            # Yield the frame back to the caller with the `image/jpeg` mime-type
            yield b"--frame\r\n" b"Content-type: image/jpeg\r\n\r\n" + out_frame + b"\r\n"

To round out the server-side, we need to take the render_frame_from_pipe and wrap that in an API /stream . In this example, the stream() function is the URL /stream and will produce a stream of images with multipart/x-mixed-replace; boundary=frame mimetype. This is important to tell the browser to expect a stream of images. This function also performs some functionality to send back a “no-feed” image to indicate the stream has not been started, or has an error.

# https://gitlab.com/mike-ensor/rtsp-to-mjpeg/-/blob/main/app/main/routes.py?ref_type=heads#L84

@bp.route("/stream", methods=["GET"])
@cross_origin()
def stream():
    global VIDEO_FEED

    if not VIDEO_FEED.is_initialized():
        log.warning("/stream called and stream has not been initialized, sending static no-feed.jpg")
        return send_file("static/no-feed.jpg", mimetype="image/jpeg")
    elif not VIDEO_FEED.is_stream_ready():
        log.debug("/stream called and stream HAS been initialized, but is NOT ready")
        return send_file("static/no-feed.jpg", mimetype="image/jpeg")
    else:
        return Response(
            VIDEO_FEED.render_frame_from_pipe(), mimetype="multipart/x-mixed-replace; boundary=frame"
        )

Browser Requests

After we have the RTSP frames being pushed into a pipe, and that pipe yielding those images to the caller, let’s focus on how to use the image stream with browser components. We have a few options to display the images and depending on the use case, there are pros and cons.

Our first approach is to use a canvas wrapped by a div object using the CSS field background-image(/stream). This allows JavaScript to interact with the canvas to draw bounding boxes around the objects in the stream. This was the reasoning behind writing this blog, I wanted to draw Bounding Boxes from an inference server for identified classifiers (let me know if you would like me to write a blog on this). Below is an example of the div + canvas method.

<!-- https://gitlab.com/mike-ensor/rtsp-to-mjpeg/-/blob/main/app/templates/index.html?ref_type=heads#L25 -->
<style>
    .overlay-image {
        position: absolute;
        top: 0px; /* Relative to the `.image-container-alt` div */
        left: 0px; /* Relative to the `.image-container-alt` div */

        z-index: 100; /* Ensure overlay is on top */
    }

    .image-container {
        position: relative;
        width: 640px; /* Set to the canvas (could use JS) */
        height: 480px; /* Set to the canvas (could be JS) */
        background-image: url("/stream");
        background-position: left top;
        background-size: 640px 480px;
        background-repeat: no-repeat, no-repeat;
    }
  </style>
</style>

<!-- https://gitlab.com/mike-ensor/rtsp-to-mjpeg/-/blob/main/app/templates/index.html?ref_type=heads#L74 -->
<div class="image-container">
    <canvas id="canvas" width="640" height="480" class="overlay-image"></canvas>
</div>

For purposes of simplicity, I will not demonstrate an img with a canvas overlay as it is very similar in concept. The only difference being instead of a background-image, the image SRC is used <img src="/stream"...>

With a canvas, it is not too complex to get the 2d context and draw rectangles over areas of the canvas to display bounding boxes or “nerd statistics” (ie: FPS and data throughput).

Conclusion

Networks, browsers and streaming protocols are complex and there is usually a way to solve your challenge if you can break down the problem into smaller components, then address each. In our example here, we took RTSP feeds, broke down the problem into individual frames, used those frames to re-create a stream that a browser could consume.

Feel free to use and extend the working example in the repository https://gitlab.com/mike-ensor/rtsp-to-mjpeg and the docker images hosted at https://gitlab.com/mike-ensor/rtsp-to-mjpeg/container_registry . Note, there is no warranty or support provided for these containers or code. This is OSS software that can be copied and re-used, but there is no implicit or explicit support or any liability associated with use. Evaluate the software and use at your own risk.