Numerous faked images and a string of startlingly inaccurate responses from Gemini and Grok are part of a tidal wave of AI slop engulfing coverage of the Iran war
They “see” it just like they “understand” the text. Both Grok and Gemini are cited in the article and both are multimodal in this way. They get it wrong because AIs are prone to hallucinate when they’re at the limits of their knowledge.
The image gets processed by a separate image encoder (or vision model), which turns it into a vector embedding. The LLM itself never sees the raw pixels - it only works with that embedding. So when it judges whether an image is real or AI-generated, it’s reasoning based on a compressed representation, not by directly “looking” at the picture.
From your link:
There are different encoders for each modality type, such as text encoder or audio encoder.
Each modality encoder is designed to extract specific information from its corresponding input data. For example, the text encoder might extract semantic representations from text, while the image encoder extracts visual features from images.
So the language model that’s answering back to you never saw the image - the image encoder did and it then described it back to the language model.
That’s clearly not what you were saying, and I’m still pretty sure you don’t understand. You said an image model gave the text model a description of the image. That is false. The image encoder, just like the text encoder gives the LLM a vector embedding.
Because the LLMs can’t even see the picture. They’re text-only. Those images are first processed by a separate vision model, and whatever question the LLM is answering is based on the description the vision model gave it - not by actually looking at the picture itself. Just another example of an uneducated user misusing the system and then complaining that it doesn’t work for a task it was never intended for.
Emphasis added.
An image encoder is in no way an image model that turns an image into text like you were saying:
You’re right on the terminology - it’s an image encoder that produces a vector embedding, not a literal text description. I should have been more precise with the wording.
However, my main point still stands: the LLM itself never sees the raw image. It only receives a compressed embedding created by the vision encoder.
So what? It never sees the text you type either. You called it “text-only”, but that’s not true in any way for multimodal models. In the strictest sense, transformer models are vector embedding only. The model itself can only work with a series of vectors, so everything you give it needs to be turned into vectors, even text.
What vectors you give it is entirely dependent on the application and how it was trained, and pretty much anything can be turned into a vector embedding. Anything with semantic meaning at least (otherwise you can’t make an embedding). The attention mechanism can work on literally any series of vectors with semantic meaning.
Saying that the model isn’t “seeing” the image just because the image was transformed into a compressed set of vectors is akin to saying a human isn’t “seeing” an image when they’re looking at a JPEG (since a JPEG is a compressed set of cosine exponents).
Because I’m not convinced that I am. My layman’s explanation might not get every technical detail exactly right, but I think the core point is factual.
Since you really don’t believe me, here is an experiment to show how accurately a vector embedding represents the real image:
Original:
Output from the LLM:
Clearly, what the LLM is seeing is an incredibly detailed and accurate representation of the real image. Practically the same quality as a JPEG. (Not quite, as you can see in the garbled text on the tag, but nearly the same quality.)
Okay, after spending way too much time researching this, I’ve come to the conclusion that my original statement was wrong - and I finally also got my chosen AI assistant to agree with me.
My original claim: “The AI said the picture was fake because it can’t even see the picture. It only got a description from a vision model.”
My revised explanation: “It got it wrong because the image embedding it received didn’t contain the relevant information to make that distinction.”
So yes. While it’s true that the LLM can’t see the picture in the human sense of vision, that’s not what I was talking about. It can “see” the image well enough to tell the difference. It simply doesn’t know what to look for. It’s not really the LLM’s fault - it’s because the image encoder wasn’t trained on enough pictures labeled as AI-generated.
That’s not how multimodal models work. The image is transformed into a vector embedding, just like the text is.
https://www.ionio.ai/blog/a-comprehensive-guide-to-multimodal-llms-and-how-they-work
They “see” it just like they “understand” the text. Both Grok and Gemini are cited in the article and both are multimodal in this way. They get it wrong because AIs are prone to hallucinate when they’re at the limits of their knowledge.
That’s what I was saying.
The image gets processed by a separate image encoder (or vision model), which turns it into a vector embedding. The LLM itself never sees the raw pixels - it only works with that embedding. So when it judges whether an image is real or AI-generated, it’s reasoning based on a compressed representation, not by directly “looking” at the picture.
From your link:
So the language model that’s answering back to you never saw the image - the image encoder did and it then described it back to the language model.
That’s clearly not what you were saying, and I’m still pretty sure you don’t understand. You said an image model gave the text model a description of the image. That is false. The image encoder, just like the text encoder gives the LLM a vector embedding.
Emphasis added.
An image encoder is in no way an image model that turns an image into text like you were saying:
https://en.wikipedia.org/wiki/Vision_transformer
You’re right on the terminology - it’s an image encoder that produces a vector embedding, not a literal text description. I should have been more precise with the wording.
However, my main point still stands: the LLM itself never sees the raw image. It only receives a compressed embedding created by the vision encoder.
So what? It never sees the text you type either. You called it “text-only”, but that’s not true in any way for multimodal models. In the strictest sense, transformer models are vector embedding only. The model itself can only work with a series of vectors, so everything you give it needs to be turned into vectors, even text.
https://en.wikipedia.org/wiki/Transformer_(deep_learning)
What vectors you give it is entirely dependent on the application and how it was trained, and pretty much anything can be turned into a vector embedding. Anything with semantic meaning at least (otherwise you can’t make an embedding). The attention mechanism can work on literally any series of vectors with semantic meaning.
Saying that the model isn’t “seeing” the image just because the image was transformed into a compressed set of vectors is akin to saying a human isn’t “seeing” an image when they’re looking at a JPEG (since a JPEG is a compressed set of cosine exponents).
Why not just admit that you were wrong?
Because I’m not convinced that I am. My layman’s explanation might not get every technical detail exactly right, but I think the core point is factual.
Since you really don’t believe me, here is an experiment to show how accurately a vector embedding represents the real image:
Original:
Output from the LLM:
Clearly, what the LLM is seeing is an incredibly detailed and accurate representation of the real image. Practically the same quality as a JPEG. (Not quite, as you can see in the garbled text on the tag, but nearly the same quality.)
Okay, after spending way too much time researching this, I’ve come to the conclusion that my original statement was wrong - and I finally also got my chosen AI assistant to agree with me.
My original claim: “The AI said the picture was fake because it can’t even see the picture. It only got a description from a vision model.”
My revised explanation: “It got it wrong because the image embedding it received didn’t contain the relevant information to make that distinction.”
So yes. While it’s true that the LLM can’t see the picture in the human sense of vision, that’s not what I was talking about. It can “see” the image well enough to tell the difference. It simply doesn’t know what to look for. It’s not really the LLM’s fault - it’s because the image encoder wasn’t trained on enough pictures labeled as AI-generated.