Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.

Visual Prompt Generation: Cross-Attention in Q-Former

2025/11/20 00:00

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

\ Figure 7. Overview of QFormer

A. Detailed Architecture of QFormer

The architecture overview is depicted in Figure 7. Specifically, QFormer is initialized as a BERT-based model[8] comprising a total of L = 12 layers. In contrast to typical BERT models that process textual inputs, QFormer takes R = 32 learnable query embeddings as inputs. These embeddings are utilized to extract visual information from the input visual data during Stage-1 pretraining in BLIP2[22]. Subsequently, they serve as visual prompt embeddings for the LLM inputs after projection.

\ Inside the QFormer, each layer includes a self-attention module composed of a Multi-Head Attention component and a Forward module (consisting of Linear, LayerNorm, and Residual Connection). The cross-attention module, initialized with random values, is inserted every G layers, where learnable query embeddings interact with visual embeddings. In the main paper, for the sake of conciseness, we condensed the representation of the multi-head attention and forward modules into self(cross) attention modules. Furthermore, we exclusively illustrated the modifications made to the cross-attention module in MIVPG, as the self-attention modules remain unchanged. The final QFormer output is represented by the last layer’s query embeddings.

\ For a more comprehensive understanding, readers are encouraged to refer to [22].

\

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington (wxz9204@mavs.uta.edu);

(2) Wenyi Wu, Amazon (wenyiwu@amazon.com);

(3) Qi Li, Amazon (qlimz@amazon.com);

(4) Rob Barton, Amazon (rab@amazon.com);

(5) Boxin Du, Amazon (boxin@amazon.com);

(6) Shioulin Sam, Amazon (shioulin@amazon.com);

(7) Karim Bouyarmane, Amazon (bouykari@amazon.com);

(8) Ismail Tutar, Amazon (ismailt@amazon.com);

(9) Junzhou Huang, The University of Texas at Arlington (jzhuang@uta.edu).

:::


:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

USD/CHF rises on US dollar rebound, weak Swiss economic data

USD/CHF rises on US dollar rebound, weak Swiss economic data

The post USD/CHF rises on US dollar rebound, weak Swiss economic data appeared on BitcoinEthereumNews.com. USD/CHF trades slightly higher on Friday, around 0.8060, up 0.15% at the time of writing. The pair remains on track for a weekly gain, supported by the persistent weakness of the US Dollar (USD) amid growing expectations of interest rate cuts by the Federal Reserve (Fed). The US Dollar Index (DXY) is heading toward its worst weekly performance since July, despite a modest rebound on Friday driven by firmer US Treasury yields. Investors continue to price in substantial monetary easing over the next 12 months. According to the CME FedWatch tool, the chance of a 25-basis-point cut at the December meeting now stands at 85%, compared with less than 40% one month ago. This dynamic is reinforced by dovish comments from several Fed officials and this week’s soft US Retail Sales data. Speculation within the National Economic Council (NEC), suggesting that Kevin Hassett may emerge as the leading candidate to replace Jerome Powell in May, also fuels expectations of a prolonged easing cycle through 2026. In this context, US Dollar rallies are likely to remain contained unless the macroeconomic backdrop shifts meaningfully. In Switzerland, the Swiss Franc (CHF) lacks momentum following economic indicators that came in well below expectations. Swiss Gross Domestic Product (GDP) contracted 0.5% (QoQ) in Q3, below the 0.4% contraction consensus and after a revision of the previous quarter to 0.2%. Growth YoY slowed to 0.5%, far below the previously reported 1.3%. The only positive signal came from the KOF Leading Indicator, which improved to 101.7 from 101.03, slightly above consensus. Still, the data confirms a slowdown in the Swiss economy, reinforcing expectations that the Swiss National Bank (SNB) may keep its policy rate at 0.00% potentially through 2027, according to several analysts. Overall, the environment continues to favour USD/CHF upside, although the pair remains sensitive to…
Share
BitcoinEthereumNews2025/11/28 22:04