{{Short description|Type of artificial intelligence system}} {{Machine learning bar}} A '''vision–language model (VLM)''' is a type of artificial intelligence system that can jointly interpret and generate information from both images and text, extending the capabilities of large language models (LLMs), which are limited to text. It is an example of multimodal learning.
Many widely used commercial applications now rely on this ability. OpenAI introduced computer vision capabilities to its GPT-4V variant of the GPT-4 model, enabling users to incorporate uploaded photographs or diagrams into their discussions with ChatGPT. It has since become an integral part of ChatGPT's standard offering. Similar capabilities were added to Google’s Gemini, Anthropic’s Claude 3 Opus,<ref name="Vision">{{Cite web |title=Vision |url=https://docs.claude.com/en/docs/build-with-claude/vision |access-date=2025-10-15 |website=Claude Docs |language=en}}</ref> and Microsoft’s Copilot with Vision.<ref name="support.microsoft.com">{{Cite web |title=Using Copilot Vision with Microsoft Copilot - Microsoft Support |url=https://support.microsoft.com/en-au/topic/using-copilot-vision-with-microsoft-copilot-3c67686f-fa97-40f6-8a3e-0e45265d425f |access-date=2025-10-15 |website=support.microsoft.com}}</ref> Alongside these models, several open-source vision–language models—such as LLaVA,<ref name=":0">{{cite arXiv |last1=Liu |first1=Haotian |title=Visual Instruction Tuning |date=2023-12-11 |eprint=2304.08485 |last2=Li |first2=Chunyuan |last3=Wu |first3=Qingyang |last4=Lee |first4=Yong Jae |class=cs.CV }}</ref> InstructBLIP,<ref name=":1">{{cite arXiv |last1=Dai |first1=Wenliang |title=InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning |date=2023-06-15 |eprint=2305.06500 |last2=Li |first2=Junnan |last3=Li |first3=Dongxu |last4=Tiong |first4=Anthony Meng Huat |last5=Zhao |first5=Junqi |last6=Wang |first6=Weisheng |last7=Li |first7=Boyang |last8=Fung |first8=Pascale |last9=Hoi |first9=Steven |class=cs.CV }}</ref> and MiniGPT-4<ref name=":2">{{cite arXiv |last1=Zhu |first1=Deyao |title=MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models |date=2023-10-02 |eprint=2304.10592 |last2=Chen |first2=Jun |last3=Shen |first3=Xiaoqian |last4=Li |first4=Xiang |last5=Elhoseiny |first5=Mohamed |class=cs.CV }}</ref>—have been released by the research community, offering smaller-scale alternatives for experimentation and academic study.
== History == Vision language models evolved from ''image captioning'' systems. Such systems were designed to take images alone (without accompanying instructions), and produce descriptions.
Most image captioning systems used an encoder-decoder architecture, where an ''encoder'' summarized images to into feature vectors, which were fed to a ''decoder'' to generate the associated description. Early methods (early 2010s), combined handcrafted visual features to encode images, and n-gram or rule-based text templates to generate descriptions.<ref>{{Cite book |last1=Kulkarni |first1=Girish |last2=Premraj |first2=Visruth |last3=Dhar |first3=Sagnik |last4=Li |first4=Siming |last5=Choi |first5=Yejin |last6=Berg |first6=Alexander C |last7=Berg |first7=Tamara L |chapter=Baby talk: Understanding and generating simple image descriptions |date=25 June 2011 |title=CVPR 2011 |pages=1601–1608 |doi=10.1109/CVPR.2011.5995466 |isbn=978-1-4577-0394-2 }}</ref><ref>{{Cite book |last1=Paragios |first1=Nikos |title=Computer Vision - ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV |last2=Daniilidis |first2=Kostas |last3=Maragos |first3=Petros |date=2010 |publisher=Springer Berlin Heidelberg Springer e-books |isbn=978-3-642-15561-1 |series=Lecture Notes in Computer Science |location=Berlin, Heidelberg}}</ref>
With the rise of deep learning, neural networks became dominant in image captioning. In 2015, methods emerged that used variations of convolutional neural networks (CNN) to encode images, and recurrent neural networks (RNN) to generate the captions.<ref>{{Cite journal |last1=Vinyals |first1=Oriol |last2=Toshev |first2=Alexander |last3=Bengio |first3=Samy |last4=Erhan |first4=Dumitru |date=2015 |title=Show and Tell: A Neural Image Caption Generator |url=https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Vinyals_Show_and_Tell_2015_CVPR_paper.html |journal=CVPR 2015 |pages=3156–3164}}</ref><ref>{{Cite journal |last1=Xu |first1=Kelvin |last2=Ba |first2=Jimmy Lei |last3=Kiros |first3=Ryan |last4=Cho |first4=Kyunghyun |last5=Courville |first5=Aaron |last6=Salakhutdinov |first6=Ruslan |last7=Zemel |first7=Richard S. |last8=Bengio |first8=Yoshua |date=2015-07-06 |title=Show, attend and tell: neural image caption generation with visual attention |url=https://dl.acm.org/doi/10.5555/3045118.3045336 |journal=Proceedings of the 32nd International Conference on International Conference on Machine Learning |series=ICML'15 |location=Lille, France |publisher=JMLR.org |pages=2048–2057}}</ref> By 2018, transformer networks replaced RNNs in the role of language decoders.<ref>{{cite arXiv |last1=Lu |first1=Jiasen |title=ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks |date=2019-08-06 |eprint=1908.02265 |last2=Batra |first2=Dhruv |last3=Parikh |first3=Devi |last4=Lee |first4=Stefan |class=cs.CV }}</ref> Importantly, training of network parameters was based on datasets of image-text pairs, like MS COCO.<ref name=":3">{{Cite web |title=COCO - Common Objects in Context |url=https://cocodataset.org/ |access-date=2025-10-20 |website=cocodataset.org}}</ref> The scope of applications was also broadened, to include ''visual question answering'' (VQA),<ref>{{Cite web |title=Vision-Language Models (VLM) vs Visual Question Answering (VQA) in 2025? |url=https://www.gravio.com/en-blog/vision-language-models-vlm-vs-visual-question-answering-vqa-in-2025#:~:text=Visual%20Question%20Answering%20(VQA)%20is,answering%20based%20on%20visual%20content. |access-date=2025-10-20 |website=www.gravio.com}}</ref> ''phrase grounding''<ref>{{Cite web |date=2024-11-13 |title=What is Phrase Grounding? |url=https://blog.roboflow.com/what-is-phrase-grounding/ |access-date=2025-10-21 |website=Roboflow Blog |language=en}}</ref> and others.
In 2021, OpenAI's release of CLIP (Contrastive Language–Image Pretraining) was a major step towards the later evolution of VLMs. Rather than focus on a specific task like image captioning, CLIP is a general-purpose foundation model which can be extended to a broad range of downstream tasks. Importantly, CLIP's components were trained on a vast dataset of 400 million image-text pairs, producing powerful models. CLIP's general-purpose structure also places this powerful capability at the disposal of systems with far smaller computational budgets.
Starting in 2022, a plethora of VLM architectures have been proposed, based on similar design philosophies (elaborated below). These included Google DeepMind's proprietary Flamingo<ref name=":4">{{cite arXiv |last1=Alayrac |first1=Jean-Baptiste |title=Flamingo: a Visual Language Model for Few-Shot Learning |date=2022-11-15 |eprint=2204.14198 |last2=Donahue |first2=Jeff |last3=Luc |first3=Pauline |last4=Miech |first4=Antoine |last5=Barr |first5=Iain |last6=Hasson |first6=Yana |last7=Lenc |first7=Karel |last8=Mensch |first8=Arthur |last9=Millican |first9=Katie |class=cs.CV }}</ref> and an open-source variant,<ref>{{cite arXiv |last1=Awadalla |first1=Anas |title=OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models |date=2023-08-07 |eprint=2308.01390 |last2=Gao |first2=Irena |last3=Gardner |first3=Josh |last4=Hessel |first4=Jack |last5=Hanafy |first5=Yusuf |last6=Zhu |first6=Wanrong |last7=Marathe |first7=Kalyani |last8=Bitton |first8=Yonatan |last9=Gadre |first9=Samir |class=cs.CV }}</ref> LLaVA,<ref name=":0" /> SalesForce's InstructBLIP,<ref name=":1" /> Microsoft's Kosmos,<ref>{{cite arXiv |last1=Huang |first1=Shaohan |title=Language Is Not All You Need: Aligning Perception with Language Models |date=2023-03-01 |eprint=2302.14045 |last2=Dong |first2=Li |last3=Wang |first3=Wenhui |last4=Hao |first4=Yaru |last5=Singhal |first5=Saksham |last6=Ma |first6=Shuming |last7=Lv |first7=Tengchao |last8=Cui |first8=Lei |last9=Mohammed |first9=Owais Khan |class=cs.CL }}</ref> KAUST's MiniGPT-4<ref name=":2" /> and others. All these merged a separately trained CLIP-like image encoder, an off-the-shelf large language model (LLM) for text encoding, stitched together using specialized components. The resulting joint system was trained on curated datasets.
The release of GPT-4V in 2023 marked the emergence of highly-impactful commercial applications. This was quickly followed by other systems mentioned above (including Google’s Gemini, Anthropic’s Claude 3 Opus,<ref name="Vision"/> and Microsoft’s Copilot with Vision<ref name="support.microsoft.com"/>). These applications are substantially more powerful for general-purpose assignments, typically containing substantially more parameters, trained on massive datasets, requiring enormous compute power. Their architectures have not been disclosed.
== Architecture == The input to VLMs consists of vision elements (images and videos) and text. The output is typically corresponding text. Generative models, which also generate vision elements (e.g., DALL-E), are beyond the scope of this article.
Below is a description of a few representative models, for whom the architecture is known. Commercial VLMs like GPT-4V, whose designs were not publicly disclosed, are likely based on similar concepts.
=== LLaVA 1.0 === LLaVA (Large Language and Vision Assistant)<ref name=":0" /> 1.0 is a simple model which captures some of the main concepts of open-source VLMs. The input to the model is an image and an accompanying textual language instruction. thumb|Architecture of LLaVA 1.0
==== Language model backbone ==== Conceptually, the design is built around an off-the-shelf foundation LLM (a fine-tuned variant of Llama<ref>{{Cite web |title=Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality {{!}} LMSYS Org |url=https://lmsys.org/blog/2023-03-30-vicuna |access-date=2025-10-25 |website=lmsys.org |language=en}}</ref> called Vicuna), with components patched on to support the image inputs.
LLaVA borrows the tokenizer and the transformer modules (including their weights) from Vicuna, and uses them to handle the accompanying text. Recall that in a legacy (non-VLM) application of Vicuna, the tokenizer converts text into a stream of tokens, which are transferred into the transformer module, which in turn produces a stream response tokens. These are then converted back to text using the tokenizer.
==== Vision encoding ==== To this, LLaVA adds two components, to support image inputs:
* '''Vision Encoder:''' This is constructed from an off-the-shelf, separately trained CLIP model (specifically, variant ViT-L/14) from OpenAI. The vision encoder converts the image into an array of embedding vectors (more on this below), which encode useful information on the image. This information could be used straightforwardly by the LLM. This is because the LLM is designed to receive tokens, which have different dimensions. Furthermore, being an off-the-shelf LLM, Vicuna was not trained to recognize and respond to such information. * '''Projection''' (known elsewhere as a '''bridge''')''':''' This module links the vision encoder with the LLM. Namely, it is a simple matrix of trainable parameters, which converts the dimensions of the vision encoder outputs, and can also be trained (see below) to be useful to the LLM. Its outputs are called '''image tokens'''.
The image tokens are prepended to the text tokens and processed by the LLM exactly as ordinary text tokens, yielding the final response.
A simple hack on CLIP ViT-L/14 vision encoder was used to obtain more effective encoded vectors. As that module is a vision transformer, a straightforward application would have used the class token at the output of its last transformer layer (see vision transformer class) as a single vector output. LLaVA 1.0, however, uses the grid (non-class) tokens at the output of the previous (penultimate) layer, to produce multiple vector outputs. The grid tokens correspond to spatial patches in these image input and thus capture finer-granularity information.
==== Training ==== Training was required to align the modules so that they could be combined into a single model'''.''' In VLM terminology, this step is referred to as '''instruction tuning.''' LLaVA 1.0 achieved this in two stages. '''Stage 1''' focused on preliminary alignment of the projection layer. Only the weights of that module were trained, with those of the other modules being frozen. The dataset was a subset of the CC3M<ref>{{Cite journal |last1=Sharma |first1=Piyush |last2=Ding |first2=Nan |last3=Goodman |first3=Sebastian |last4=Soricut |first4=Radu |date=2018 |editor-last=Gurevych |editor-first=Iryna |editor2-last=Miyao |editor2-first=Yusuke |title=Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning |url=https://aclanthology.org/P18-1238/ |journal=Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |location=Melbourne, Australia |publisher=Association for Computational Linguistics |pages=2556–2565 |doi=10.18653/v1/P18-1238}}</ref> dataset of image-captioning pairs. This dataset was small (595,000 pairs), and limited in its scope, containing only simple image-caption pairs. '''Stage 2''' focused on a more elaborate training of both the projection layer and the LLM. The vision encoder remained frozen. A rich training dataset (LLaVA-Instruct-158K<ref>{{Cite web |date=2025-06-06 |title=liuhaotian/LLaVA-Instruct-150K · Datasets at Hugging Face |url=https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K |access-date=2025-10-25 |website=huggingface.co}}</ref>) of image-text pairs was produced for this stage, by harnessing a text-only LLM (GPT-4) to convert the simple captions of image-caption pairs (from the COCO dataset<ref name=":3" />) into elaborate conversation-style prompts.
Subsequent versions of LLaVA introduced several improvements over LLaVA 1.0. Some notable conceptual improvements include the replacement on LLaVA 1.5<ref>{{cite arXiv |last1=Liu |first1=Haotian |title=Improved Baselines with Visual Instruction Tuning |date=2024-05-15 |eprint=2310.03744 |last2=Li |first2=Chunyuan |last3=Li |first3=Yuheng |last4=Lee |first4=Yong Jae |class=cs.CV }}</ref> of the simple projection module with a more elaborate MLP. LLaVA-NeXT<ref>{{cite arXiv |last1=Li |first1=Feng |title=LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models |date=2024-07-28 |eprint=2407.07895 |last2=Zhang |first2=Renrui |last3=Zhang |first3=Hao |last4=Zhang |first4=Yuanhan |last5=Li |first5=Bo |last6=Li |first6=Wei |last7=Ma |first7=Zejun |last8=Li |first8=Chunyuan |class=cs.CV }}</ref> added support for multiple image aspect ratios, beyond than LLaVA 1.0's 224x224.
=== Flamingo === Predating LLaVA 1.0 by a year, Flamingo<ref name=":4" /> (DeepMind, 2022) actually involves a more elaborate design than LLaVA. Among its benefits are support for multiple images in a single conversation, and support for video. thumb|Architecture of the Flamingo VLM Architecturally, the design involves a more tightly-coupled integration between the language and vision modules, and a ''perception-resampler'' module (described below).
==== LLM and Vision Backbones ==== Like LLaVA, Flamingo begins with independently designed LLM and vision encoder, for text analysis and image embedding, respectively. Both are pre-trained for their narrow purposes, independently of their final utility as components of Flamingo. Furthermore, as components, their weights remain frozen in the course of joint training (see below).
Flamingo uses DeepMind's off-the-shelf Chinchilla as its LLM backbone. For the vision encoder, they opted for a non-transformer design (the ResNet-based NFNet-F6<ref>{{cite arXiv |last1=Brock |first1=Andrew |title=High-Performance Large-Scale Image Recognition Without Normalization |date=2021-02-11 |eprint=2102.06171 |last2=De |first2=Soham |last3=Smith |first3=Samuel L. |last4=Simonyan |first4=Karen |class=cs.CV }}</ref>). They trained this using a CLIP-style contrastive loss on image-caption pairs from the ALIGN<ref>{{cite arXiv |last1=Jia |first1=Chao |title=Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision |date=2021-06-11 |eprint=2102.05918 |last2=Yang |first2=Yinfei |last3=Xia |first3=Ye |last4=Chen |first4=Yi-Ting |last5=Parekh |first5=Zarana |last6=Pham |first6=Hieu |last7=Le |first7=Quoc V. |last8=Sung |first8=Yunhsuan |last9=Li |first9=Zhen |class=cs.CV }}</ref> and a specially curated dataset called LTIP (Long Text & Image Pairs).
The vision encoder takes single images as inputs (more on videos below) and produces a two-dimensional grid of feature vectors.
==== Perceiver-Resampler ==== The perceiver-resampler<ref>{{cite arXiv |last1=Jaegle |first1=Andrew |title=Perceiver: General Perception with Iterative Attention |date=2021-06-23 |eprint=2103.03206 |last2=Gimeno |first2=Felix |last3=Brock |first3=Andrew |last4=Zisserman |first4=Andrew |last5=Vinyals |first5=Oriol |last6=Carreira |first6=Joao |class=cs.CV }}</ref> component plays a key role in support for video and variable-number of images at the Flamingo input.
Multiple consecutive images (one or more) are first fed one-by-one into the vision encoder, producing a three-dimensional grid of feature vectors. Videos are converted into a sequence of images by sampling at a rate of 1 frame per second. The resulting grid is flattened into a long, variable-size array of feature vectors.
The perceiver-resampler converts this into a short, ''fixed-length'' array of tokens. It uses a design that is based on cross-attention between a fixed number of artificial, predetermined query vectors (whose values are determined by training), and (key,value) pairs derived from the array of feature vectors.
Note that in this context, the consecutive images are assumed to be contiguous, without intervening text. The general case will be discussed later, below.
==== Gated cross-attention/dense blocks ==== These are multiple blocks (see the figure above), that play a role parallel to LLaVA's Projection module, serving as an interface between the vision and text processing modules. Their design, however, is more entangled with the language model. thumb|Gated cross-attention and dense block Specifically, between select transformer blocks of the language model, Flamingo inserts these cross-attention-and-dense blocks. These blocks resemble the decoder blocks of encoder-decoder transformer architectures. That is, their queries are obtained from the preceding legacy self-attention transformer block of the backbone LLM. Their keys and values are derived from the vision feature vectors. Their outputs are forwarded to the consecutive backbone LLM block. They also include skip connections.
One important modification to the added blocks, relative to the blocks of encoder-decoder transformers, is the inclusion of a ''tanh'' ''gating.'' These small modules multiply their inputs are controlled by a trainable scalar weight in the interval (-1, 1), specific to each such block. These weights modulate the impact of the cross-attention-dense block on the text generation process. They are initialized at zero at the beginning of training, when the weights of the other modules of the block are still untrained and random. As training progresses, their values gradually increase. These gates have a crucial role in ensuring training stability.
==== Chunking ==== To support interleaved images and text sequences, Flamingo introduced a simple adaptation which arguably increases performance for in-context (few-shot) learning. Specifically, they break the input stream into chunks, each of which contains a single vision input (image, contiguous sequence of images, or video). When applying the above-mentioned cross-attention between text and visual features, text tokens are only allowed to attend to the vision input within their chunk. Other vision inputs are masked out.
Note that text tokens are still indirectly influenced by ''all'' video inputs, via the intra-text self-attention.
==== Training ==== During training, the language and text backbones are frozen (as noted above). Training used three datasets: The LTIP dataset mentioned above, a curated dataset of video-text pairs (called VTP), and a massive dataset of interleaved text-image sequences, derived from HTML documents (MultiModal MassiveWeb - M3W).
=== Qwen2-VL === Qwen2-VL<ref>{{cite arXiv |last1=Wang |first1=Peng |title=Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution |date=2024-10-03 |eprint=2409.12191 |last2=Bai |first2=Shuai |last3=Tan |first3=Sinan |last4=Wang |first4=Shijie |last5=Fan |first5=Zhihao |last6=Bai |first6=Jinze |last7=Chen |first7=Keqin |last8=Liu |first8=Xuejing |last9=Wang |first9=Jialin |class=cs.CV }}</ref> by Alibaba (2024) has a conceptually simple architecture that also provides functionality and flexibility exceeding Flamingo.
Like LLaVA 1.0 and Flamingo, it begins with backbone language model (Qwen2) and vision encoder (DFN<ref>{{cite arXiv |last1=Fang |first1=Alex |title=Data Filtering Networks |date=2023-11-06 |eprint=2309.17425 |last2=Jose |first2=Albin Madappally |last3=Jain |first3=Amit |last4=Schmidt |first4=Ludwig |last5=Toshev |first5=Alexander |last6=Shankar |first6=Vaishaal |class=cs.AI }}</ref> vision transformer).
Like LLaVA and unlike Flamingo, Qwen2-VL uses ''unified'' processing of vision and text data, feeding all tokens into the ''input'' of the language model, rather than injecting them into internal blocks. Tokens are then treated equally by the language model, using self-attention rather than cross-attention. To support interleaved text and vision data, and delimit streams of tokens from the latter, special tokens (vision_start and vision_end) are used.
==== Naive dynamic resolution ==== A key difference from LLaVA 1.0 and Flamingo is that the vision encoder supports ''arbitrary'' image resolutions, without first reshaping the image to a fixed shape. The number of encoded tokens is ''variable'' and depends on the image shape.
Videos are sampled at 2 frames per second to produce a steam of images, which are each encoded separately. This too contributes to the variable length of the token array.
An MLP aligns the embedding dimension of the ViT with that of the language model. It also has a role in reducing the dimensionality of the image encoding by merging the vector embeddings 2x2 adjacent patches (see ViT). Video encoding also benefits from a [https://en.wikipedia.org/wiki/Convolutional_layer?utm_source=chatgpt.com#Convolution 3D convolution] that also operates on the temporal dimension.
==== Multimodal rotary positional encoding (M-RoPE) ==== Standard positional encoding is poorly suited for vision data, especially when the data is encoded into variable-length token embeddings. Specifically, its 1-dimensional representation loses the spatial layout of images and temporal continuity of video.
Qwen2-VL uses a multidimensional variant for vision data, which it calls multimodal rotary positional encoding (M-RoPE). With this implementation, each token is assigned a triplet of indices (''i'',''x'',''y''), defined as follows. For images, ''x'' and ''y'' represent the spacial coordinates of the token in the image. ''i'' is constant for all tokens of the image, and equals the sequence number of the image within the unified input stream to the model. M-RoPE positional encoding is then constructed by interleaving separate 1-D RoPE encodings for the three indices.<ref>{{Cite web |last=tangbasky |date=2025-09-08 |title=Qwen2-VL's RoPE Variant— M-RoPE |url=https://medium.com/everyday-ai/qwen2-vls-rope-variant-m-rope-8cfcc4672ea9 |access-date=2025-10-30 |website=Everyday AI |language=en}}</ref>
Video positional encoding uses similar triplets, except that ''i'' is not constant and progresses with each image in the video stream. It thus encodes temporal location.
==== Training ==== Qwen2-VL is trained in a three-stage process that progressively integrates visual and linguistic understanding. In Stage 1, the vision encoder is trained, keeping the other modules frozen. In Stage 2, the entire architecture is unfrozen, and in Stage 3, the vision encoder is frozen while the language model is fine-tuned. The training dataset includes a diverse range of modalities, ultimately amounting to 1.4 trillion tokens (including encoded vision tokens). The training loss is computed over text tokens at the output of the language model.
==== Visual grounding ==== A key functionality that is enabled by multimodal positional encoding is ''visual grounding;'' namely, the ability to reason about specific objects within an image. M-RoPE's preservation of the spatial location of image tokens is essential for this.
To support grounding, much of the training data includes information on objects in images, including captions and bounding box coordinates. Training dataset preparation involves formatting this data into a standard structure, which includes special tokens (object_ref_start, object_ref_end, box_start, box_end).
== See also == * Vision-language-action model * Multimodal learning * Large language model (LLM) * Foundation model
== References == {{reflist}}
Category:Language modeling Category:Natural language processing Category:Computer vision