Open‑YOLO 3D uses 2D object detection instead of heavy SAM/CLIP for open‑vocabulary 3D segmentation, achieving SOTA results with up to 16× faster inference.Open‑YOLO 3D uses 2D object detection instead of heavy SAM/CLIP for open‑vocabulary 3D segmentation, achieving SOTA results with up to 16× faster inference.

No SAM, No CLIP, No Problem: How Open‑YOLO 3D Segments Faster

2025/08/26 16:10

:::info Authors:

(1) Mohamed El Amine Boudjoghra, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) (mohamed.boudjoghra@mbzuai.ac.ae);

(2) Angela Dai, Technical University of Munich (TUM) (angela.dai@tum.de);

(3) Jean Lahoud, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) ( jean.lahoud@mbzuai.ac.ae);

(4) Hisham Cholakkal, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) (hisham.cholakkal@mbzuai.ac.ae);

(5) Rao Muhammad Anwer, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and Aalto University (rao.anwer@mbzuai.ac.ae);

(6) Salman Khan, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and Australian National University (salman.khan@mbzuai.ac.ae);

(7) Fahad Shahbaz Khan, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and Australian National University (fahad.khan@mbzuai.ac.ae).

:::

Abstract and 1 Introduction

  1. Related works
  2. Preliminaries
  3. Method: Open-YOLO 3D
  4. Experiments
  5. Conclusion and References

A. Appendix

Abstract

Recent works on open-vocabulary 3D instance segmentation show strong promise, but at the cost of slow inference speed and high computation requirements. This high computation cost is typically due to their heavy reliance on 3D clip features, which require computationally expensive 2D foundation models like Segment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a consequence, this hampers their applicability in many real-world applications that require both fast and accurate predictions. To this end, we propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D, that effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation. We address this task by generating class-agnostic 3D masks for objects in the scene and associating them with text prompts. We observe that the projection of class-agnostic 3D point cloud instances already holds instance information; thus, using SAM might only result in redundancy that unnecessarily increases the inference time. We empirically find that a better performance of matching text prompts to 3D masks can be achieved in a faster fashion with a 2D object detector. We validate our Open-YOLO 3D on two benchmarks, ScanNet200 and Replica, under two scenarios: (i) with ground truth masks, where labels are required for given object proposals, and (ii) with class-agnostic 3D proposals generated from a 3D proposal network. Our OpenYOLO 3D achieves state-of-the-art performance on both datasets while obtaining up to ∼16× speedup compared to the best existing method in the literature. On ScanNet200 val. set, our Open-YOLO 3D achieves mean average precision (mAP) of 24.7% while operating at 22 seconds per scene. Code and model are available at github.com/aminebdj/OpenYOLO3D

\

1 Introduction

3D instance segmentation is a computer vision task that involves the prediction of masks for individual objects in a 3D point cloud scene. It holds significant importance in fields like robotics and augmented reality. Due to its diverse applications, this task has garnered increasing attention in recent years. Researchers have long focused on methods that typically operate within a closed-set framework, limiting their ability to recognize objects not present in the training data. This constraint poses challenges, particularly when novel objects must be identified or categorized in unfamiliar environments. Recent methods [34, 42] address the problem of novel class segmentation, but they suffer from slow inference that ranges from 5 minutes for small scenes to 10 minutes for large scenes

\ Figure 1: Open-vocabulary 3D instance segmentation with our Open-YOLO 3D. The proposed Open-YOLO 3D is capable of segmenting objects in a zero-shot manner. Here, We show the output for a ScanNet200 [38] scene with various prompts, where our model yields improved performance compared to the recent Open3DIS [34]. We show zoomed-in images of hidden predicted instances in the colored boxes. Additional results are in Figure 4 and suppl. material.

\ due to their reliance on computationally heavy foundation models like SAM [23] and CLIP [55] along with heavy computation for lifting 2D CLIP feature to 3D.

\ Open-vocabulary 3D instance segmentation is important for robotics tasks such as, material handling where the robot is expected to perform operations from text-based instructions like moving specific products, loading and unloading goods, and inventory management while being fast in the decision-making process. Although state-of-the-art open-vocabulary 3D instance segmentation methods show high promise in terms of generalizability to novel objects, they still operate in minutes of inference time due to their reliance on heavy foundation models such as SAM. Motivated by recent advances in 2D object detection [7], we look into an alternative approach that leverages fast object detectors instead of utilizing computationally expensive foundation models.

\ This paper proposes a novel open-vocabulary 3D instance segmentation method, named Open-YOLO 3D, that utilizes efficient, joint 2D-3D reasoning, using 2D bounding box predictions to replace computationally-heavy segmentation models. We employ an open-vocabulary 2D object detector to generate bounding boxes with their class labels for all frames corresponding to the 3D scene; on the other side, we utilize a 3D instance segmentation network to generate 3D class-agnostic instance masks for the point clouds, which proves to be much faster than 3D proposal generation methods from 2D instances [34, 32]. Unlike recent methods [42, 34] which use SAM and CLIP to lift 2D clip features to 3D for prompting the 3D mask proposal, we propose an alternative approach that relies on the bounding box predictions from 2D object detectors which prove to be significantly faster than CLIP-based methods. We utilize the predicted bounding boxes in all RGB frames corresponding to the point cloud scene to construct a Low Granularity (LG) label map for every frame. One LG label map is a two-dimensional array with the same height and width as the RGB frame, with the bounding box areas replaced by their predicted class label. Next, we use intrinsic and extrinsic parameters to project the point cloud scene onto their respective LG label maps with top-k visibility for final class prediction. We present an example output of our method in Figure 1. Our contributions are following:

\ • We introduce a 2D object detection-based approach for open-vocabulary labeling of 3D instances, which greatly improves the efficiency compared to 2D segmentation approaches.

\ • We propose a novel approach to scoring 3D mask proposals using only bounding boxes from 2D object detectors.

\ • Our Open-YOLO 3D achieves superior performance on two benchmarks, while being considerably faster than existing methods in the literature. On ScanNet200 val. set, our Open-YOLO 3D achieves an absolute gain of 2.3% at mAP50 while being ∼16x faster compared to the recent Open3DIS [34].

\

:::info This paper is available on arxiv under CC BY-NC-SA 4.0 Deed (Attribution-Noncommercial-Sharelike 4.0 International) license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Share Insights

You May Also Like

Crypto Shows Mixed Reaction To Rate Cuts and Powell’s Speech

Crypto Shows Mixed Reaction To Rate Cuts and Powell’s Speech

The post Crypto Shows Mixed Reaction To Rate Cuts and Powell’s Speech appeared on BitcoinEthereumNews.com. Jerome Powell gave a speech justifying the Fed’s decision to push one rate cut today. Even though a cut took place as predicted, most leading cryptoassets began falling after a momentary price boost. Additionally, Powell directly addressed President Trump’s attempts to influence Fed policy, claiming that it didn’t impact today’s decisions. In previous speeches, he skirted around this elephant in the room. Sponsored Sponsored Powell’s FOMC Speech The FOMC just announced its decision to cut US interest rates, a highly-telegraphed move with substantial market implications. Jerome Powell, Chair of the Federal Reserve, gave a speech to help explain this moderate decision. In his speech, Powell discussed several negative economic factors in the US right now, including dour Jobs Reports and inflation concerns. These contribute to a degree of fiscal uncertainty which led Powell to stick with his conservative instincts, leaving tools available for future action. “At today’s meeting, the Committee decided to lower the target range…by a quarter percentage point… and to continue reducing the size of our balance sheet. Changes to government policies continue to evolve, and their impacts on the economy remain uncertain,” he claimed. Crypto’s Muted Response The Fed is in a delicate position, balancing the concerns of inflation and employment. This conservative approach may help explain why crypto markets did not react much to Powell’s speech: Bitcoin (BTC) Price Performance. Source: CoinGecko Sponsored Sponsored Bitcoin, alongside the other leading cryptoassets, exhibited similar movements during the rate cuts and Powell’s speech. Although there were brief price spikes immediately after the announcement, subsequent drops ate these gains. BTC, ETH, XRP, DOGE, ADA, and more all fell more than 1% since the Fed’s announcement. Breaking with Precedent However, Powell’s speech did differ from his previous statements in one key respect: he directly addressed claims that President Trump is attacking…
Share
2025/09/18 09:01
Share
Warsaw Stock Exchange lists its first Bitcoin ETF

Warsaw Stock Exchange lists its first Bitcoin ETF

The post Warsaw Stock Exchange lists its first Bitcoin ETF appeared on BitcoinEthereumNews.com. The Warsaw Stock Exchange has listed its first Bitcoin ETF, offering investors regulated exposure to BTC through futures contracts. Summary The Bitcoin BETA ETF tracks BTC through CME futures and includes a hedging strategy to reduce USD/PLN currency risk. Approved by Poland’s Financial Supervision Authority, the fund is managed by AgioFunds TFI. Bitcoin ETF arrives on Warsaw Stock Exchange The Warsaw Stock Exchange (GPW) has listed its first-ever crypto ETF, the Bitcoin BETA ETF. According to GPW’s official announcement, the Bitcoin BETA ETF does not invest in physical Bitcoin (BTC), but gains exposure through futures contracts traded on the Chicago Mercantile Exchange. To minimize foreign exchange volatility, the fund employs a hedging strategy using forward contracts, insulating investors from fluctuations in the USD/PLN exchange rate. Developed by AgioFunds TFI, the ETF was approved by Poland’s Financial Supervision Authority in June and is backed by Dom Maklerski Banku Ochrony Środowiska S.A. as its market maker. “Offering exposure to Bitcoin through an ETF listed on GPW increases safety of trading, as investors can participate in the cryptocurrency market using an instrument which is supervised, cleared, and subject to the transparency standards applicable to a regulated capital market,” said Michał Kobza, Member of the Management Board of the Warsaw Stock Exchange. The current crypto ETF landscape Globally, Bitcoin ETFs have already gained traction on major exchanges, including Nasdaq, NYSE, and Cboe in the U.S., where a wave of spot Bitcoin ETFs was approved in early 2024. Other prominent markets include the Toronto Stock Exchange in Canada, Germany’s Xetra, Switzerland’s SIX Exchange, Brazil’s B3, and Cboe Australia. These ETFs offer various structures, from physically-backed spot products to futures-based funds, like the one just listed on GPW. Beyond Bitcoin and Ethereum, altcoin ETFs are increasingly gaining traction. According to the latest count by Bloomberg analysts,…
Share
2025/09/19 14:30
Share