ProjectTOYOTA_SD-VLM – SD-VLM - Structured Decoding VLMs for Visual Agent Intelligence

Basic data

Acronym:
TOYOTA_SD-VLM
Title:
SD-VLM - Structured Decoding VLMs for Visual Agent Intelligence
Duration:
01/01/2026 to 31/12/2026
Abstract / short description:
Project Description:
Vision-Language Models (VLMs) have gained significant interest over the recent years. While they were
first mainly targeted towards captioning, they have since then significantly evolved and nowadays play
an indispensable role e.g. via visual question answering (VQA) [1], summarization [2], and many more
downstream tasks [3]. They have further shown remarkable capacities e.g. in context of detection, ref-
erential expression localization, few-shot learning, as well as other computer vision tasks. In this context,
one characteristic that recently gained more traction is their ability to not only produce natural language
responses, but also structured output [4, 5] known as structured or or constrained decoding e.g. in form of
bounding boxes. In this case the answer to a question would not be a regular sentence, but a structured
decoding, e.g. in form of a bounding box denoted by “ul x, ul y, lr x, lr y” as the upper left and lower
right coordinates of a bounding box [6].
Such output can be useful as it allows to directly supervise the training of such a model for regular vision
tasks such as localization, tracking etc. But more importantly, it allows to generate such structured
decoding based on natural language interaction, which sets this scenario apart from classical computer
vision pipelines. This behavior is of particular interest as it allows to use VLMs to act a form of generalized
vision model, thus a model that can handle multiple different vision tasks, depending on the prompt [7].
This can be considered different from classical computer vision models, which were usually designed and
implemented to fulfill one specific task and to generate one specific form of output. Compared to that, a
VLM can output results for a plethora of tasks and in any desired format (csv, json, etc.), depending on
how it is prompted. This can be especially relevant for the upcoming generation of new reasoning and AI
agent based system [8] and especially in context of LLM code generation, as in such a setting an agent
model would not only prompt for executable code, but also would need the respective data in machine
readable form to run the code on. A VLM that would be able to produce structured code based on natural
language prompts could fill this gap in a visual AI agent pipeline, as it could produce a tailored output to
be further processed by other systems.
The problem with enforcing this output in standardized scenarios is that it is not consistent or that it could
diminish performance. Thus, while VLMs tend to follow instructions, they actually still often deviate from
the given answer format, making it hard to evaluate them in a formatted way [9, 10], with smaller models
struggling more to adhere to structured outputs consistently than larger ones. but even more important,
it is currently to our best knowledge not extensive ly explored how the enforcement of a given format, e.g.
a JSON schema, impacts the overall performance of the model and if the VLMs might e.g. identify more
objects or provide a more robust answer if no specific answer format is given.
Keywords:
machine learning
maschinelles Lernen
KI
Artificial Intelligence, Künstliche Intelligenz

Involved staff

Managers

Department of Informatics
Faculty of Science

Other staff

Tübingen AI Center
Central cross-faculty facilities
Tübingen AI Center
Central cross-faculty facilities

Local organizational units

Tübingen AI Center
Central cross-faculty facilities
University of Tübingen

Funders

Brüssel, Belgium

Cooperations

Brüssel, Belgium
Help

will be deleted permanently. This cannot be undone.