The study shows that vision models can’t handle a whore with negation words

Imagine that the radiologist examines the chest X -ray from the new patient. He notices that the patient has sweet in the tissue but does not have a heart. Looking to accelerate the diagnosis of DIA, it could use a model of machine learning from vision to search for messages from similar patients.

However, if the model mistakenly identifies messages with both conditions, the most likely diagnosis could be quite different: if the patient has swelling of the tissue and the heart, this condition is very likely to be related, but without the heart, there could be several basic causes.

In a new study, MIT scientists have found that vision vision models are very likely to make such a mistake in real situations because they do not understand the negative words like “no” and “no” what is false or absent.

“These negation words can have a very significant impact, and if we use these models only blindly, we can come across catastrophic consequences,” says Kumail Alhamoud, postgraduate student MIT and the main author of this study.

Scientists have tested the ability of visual language models to identify negation in image subtitles. Models often worked and also random estimate. Based on these findings, the team created a data set of images with corresponding subtitles that include negation words describing the missing objects.

They show that the vision’s resolution model with this data file leads to performance improvisation when the model is asked to load images that do not contain certain objects. It also increases accuracy in answering the question with more options with negative subtitles.

However, scientists are sure that there is more work to add the main causes of this problem. They hope that their research warns potential users of a previously non -etified deficiency that could have the consequences in the high -proportion settings where these models are currently used, from destinations that patients receive certain treatment, to the identification of defects in production.

“This is a technical document, but we are considering larger, but if something is as crucial as the negation is broken, we should not be used by a wide vision/language models in many ways we use them now – without intensive evaluation,” says Marzyeh Ghassemi, Associated Electrical Engineering and Computer Science (EECS)

Ghassemi and Alhamoud are also connected to the newspaper Shaden, postgraduate student MIT; Yonglonglong Tian of Openai; Guohao Li, form postdoktor at Oxford University; Philip HS Torr, Professor in Oxford; and Yoon Kim, professional assistant EECS and member of computer science and laboratory of artificial intelligence (CSail) on MIT. Research will be presented at a conference on computer vision and pattern recognition.

Neglect of negation

VLM models are trained using huge collections of images and corresponding subtitles that learn to coded as sets of numbers called vector representation. Models use these vectors to distinguish between different images.

VLM uses two separate encoders, one for text and one for images and encoders, learning to release similar vectors for the image and its corresponding text title.

“The headlines express what is in the pictures – they are a positive label. And that’s actually a white problem. No one looks at the image of a dog that jumps through the hairdryers and subtitles, saying,” The dog jumps over the hair dryers, without helicopters, “says Ghassemi.

Because images of images-Coppation do not contain examples of negation, VLM never learns to identify it.

In order to dig into this problem, scientists have proposed two benchmark tasks that test the ability of VLM to understand negation.

For the first, they used a large language model (LLM) to re -copy in an existing data file by asking LLM to think about objects that are not in the picture and write them into the headline. Then they tested the models by promoting them with negation words to get images that contain certain objects, but not.

For the second task, they suggested questions from the multiple selection options that VLMs to choose the most appropriate CAction from the list of closely related options. These subtitles differ only by adding a link to an object that does not appear in the figure or negates an object that appears in the figure.

Models often failed in both tasks, while the performance of images of images decreased by nearly 25 pierients with negative subtitles. Regarding more questions with multiple selections, the best models were only about 39 percent of AccCAAIR, with several models performing in or even under a random chance.

One of the reasons for this failure is the abbreviation, which scientists call the confirmation of bias – VLM ignores the negative words and instead focus on objects in the pictures.

“It won’t just happen for words like” no “and” no. “No matter how you express negation or exclusion, models simply ignore it,” says Alhamoud.

This consisted of every VLM they test.

“The problem of solvent”

Since VLM is not usually trained on image subtitles with negation, scientists have developed data sets with negative words as the first step to solve the problem.

Using a data set with 10 million peers with image-TEXT, they call for LLM to offer related subtitles that specify what is excluded from images and provide new subtitles with negation words.

They had to be particularly careful that these synthetic subtitles still read naturally, or it could cause VLM to the real world when they face more complicated headlines written by people.

They found that FineTuning VLMS with their data file led to performance profits across the album. Improved image search capabilities by about 10 percent, and at the same time increased performance in the task of answering questions with multiple selection options by about 30 percent.

“But our solution is not perfect. We are just data sets for recovery, a form of data increases. We have touched on how these models work, but we hope it is a signal, that it is a problem with a solvent and others can take our solutions and improve it,” says Alhamoud.

At the same time, he hopes that their work encourages more users to think about the problem they want to use VLM to solve and design some examples to test it before deployment.

In the future, scientists could expand this work by teaching VLM to process text and pictures, which can improve their ability to understand negation. In addition, developmental data sets that include peers for copying images for specific applications such as health care.

Leave a Comment