The question that no LLM can answer and why it is important
Notes From the Desk: No. 32 - 2024.04.23
Notes From the Desk are periodic posts that summarize recent topics of interest or other brief notable commentary that might otherwise be a tweet or note.
The Question That No LLM Can Answer
“Which episode of Gilligan’s Island was about mind reading?”
No LLM properly answers this question unless it uses a web search. But how can a LLM not know the answer if it was trained on essentially the entire internet of data and certainly most likely all the data in IMDB? The correct answer is “Seer Gilligan”, one of the most popular episodes.
The wrong answers
The following are the detailed responses from multiple top models:
Opus
I apologize, but there is no episode of Gilligan's Island that specifically focuses on mind reading.
GPT 4
The episode of "Gilligan's Island" that involves mind reading is titled "Ring Around Gilligan". It is the sixth episode of the third season.
Gemini 1.5
The episode of Gilligan's Island that features mind reading is Season 2, Episode 12 - "The Mind Reader".
As of my last knowledge update in November 2023, Gilligan's Island only had three official seasons. There was never a second season of the show.
Llama 3
After conducting a thorough search and reviewing episode guides, I found that there is no episode of Gilligan's Island that specifically focuses on mind reading.
Llama 3 - asked to list and name episodes
"The Postman Cometh" (March 13, 1965)
"Seer Gilligan" (Wait, no! I apologize, but there is no episode titled "Seer Gilligan")
"Love Me, Love My Chicken" (March 20, 1965)Please note that there is no episode titled "Seer Gilligan". If you have any other questions or need help with a specific episode, feel free to ask!
Llama 3.1 405B. (update 8/7/2024)
Similar results as Llama 3. States it is not able to provide information about episodes without web search, but when asked to list episodes it complies and then identifies “Seer Gilligan” as the correct episode.
However, when asked to describe the plot, it hallucinates the answer.
Grok 2 (update 8/21/2024)
The episode of "Gilligan's Island" that dealt with mind reading is titled "The Little Dictator." In this episode, the castaways encounter a mind-reading machine brought by a exiled dictator named El Presidente Pancho Hernando Gonzales Enrico Rodriguez of Eucuarico. The machine supposedly allows people to read each other's thoughts, leading to various comedic situations among the characters. This episode originally aired during the second season.
Grok 2 - list episodes of season 2
The Postman Cometh
Seer Gilligan
Love Me, Love My Skipper
Grok 2 - what is episode 19 about?
Gilligan finds a bush that produces seeds which, when eaten, supposedly grant the ability to read minds. Gilligan and the other castaways start eating these seeds, believing they can now read each other's thoughts. This leads to a series of misunderstandings and comedic situations as each character acts on what they think the others are thinking. However, it turns out that the seeds don't actually give mind-reading powers; instead, the whole scenario is built on suggestion and the power of belief. The episode explores themes of trust, communication, and the placebo effect in a light-hearted manner.
Grok 2 - (update 12/24/2024)
Interestingly, Grok 2 can no longer list the episode when requested.
o1 (update 12/24/2024)
From the original run of Gilligan’s Island, no single episode focuses strictly on “mind reading” in the sense of telepathy
Claude Sonnet 3.5 (update 12/24/2024)
I recall there was an episode called "Mind Reading" (also known as "Seer Gilligan") from Season 3 where Gilligan hits his head and temporarily gains the ability to read minds.
A quick regenerations gives “Mind Reading also known as "The Mind Reading Helmet")”
o3 mini (update 1/31/2024)
As far as I’m aware, none of the original Gilligan’s Island episodes have a plot that centers on mind reading.
Note: There has been significant degradation in OpenAI’s model in niche information. O3 mini hallucinates terribly the list of episodes of Season 2. It gets none of them right and claims there is no definitive list. However, GPT 4o currently can get 31 of 32 correct.
All models fail to give the correct answer. Some models hallucinate an answer and the remaining deny such an episode exists.
Interestingly, probing some models reveal they are aware of the episode as they can list the episode in the list of titles, but still can’t name it when directly responding to the question. Some even contradicted themselves in the responses by listing the title and then saying that episode doesn’t exist.
In the case of Llama 3, we can probe some of the training dataset using Infini-gram and verify that the episode does exist in the corpus along with text describing the episode.
Grok 2 provides some very interesting contradictory results. When initially asked, it completely hallucinates the episode name and plot. When asked to list the episodes, it does list “Seer Gilligan”, but when asked to describe the episode, things get very interesting. Grok 2 is the only model so far that get’s the plot correct in the first part of its answer, and then follows with hallucinated text that invalidates its own answer.
What Grok 2 does prove, is that Grok has the information within its model, but is unable to correctly produce the results we would expect.
In the latest Grok 2 update, tested 12/24/2024, Grok no longer can list the episode. However, the latest Sonnet 3.5 almost gets credit. Claims the title of the episode is “Mind Reading” also known as “Seer Gilligan”, the correct title; however, then goes on to hallucinate the plot. A quick regeneration results in losing the “Seer Gilligan” reference.
All models are rapidly climbing the competition of benchmarks, but are not getting better at analyzing sparse data. Some might even be getting worse as we see Grok 2 and o1 unable to list the episode name.
The Impossible LLM Vision Test
Asking "genius" models to identify the shortest and longest object. All models fail both. Move slightly outside the training patterns (stagger the bars, add a red circle) and the problem can't be solved. This is not intelligence.
After billions of images, it doesn't understand simple concepts like length. Some models even identified the longest object as the shortest. The responses give some hints as to why. Some describe the image as a chart. Likely chart training data confuses this task.
Another Oddity, It Is All 42
We also see another interesting data and training phenomenon revealed when LLMs are asked to provide a number between 1-100. They all converge to 42!
Pointed out by Information is Beautiful, a very interesting distribution forms when AI is asked to pick a number between 1 and 100. There is a heavy weighting toward picking the number ‘42’. Likely, this is the Hitchhiker’s Guide to the Galaxy effect. The number 42 is overrepresented or weighted in some way through training, resulting in a higher propensity for the LLM to choose 42.
What Does It Mean? Implications …
The implications are that LLMs do not perform reasoning over data in the way that most people conceive or desire.
There is no self-reflection of its information; it does not know what it knows and what it does not. The line between hallucination and truth is simply a probability factored by the prevalence of training data and post-training processes like fine-tuning. Reliability will always be nothing more than a probability built on top of this architecture.
As such, it becomes unsuitable as a machine to find rare hidden truths or valuable neglected information. It will always simply converge toward popular narrative or data. At best, it can provide new permutations of views of existing well-known concepts, but it can not invent new concepts or reveal concepts rarely spoken about.
“You can't cache reality in some compressed lookup table. If a particular outcome was never in the training data, the model will perform a random guess which is quite limiting.”
Furthermore, it can never be a system for absolute dependability. Mission-critical systems that require deterministic, provably correct behavior are not something applicable to LLM automation or control. The problem is that LLMs are impressively convincing when they are wrong, which may lead to ill-advised adoption. What business wants to balance the books with a hallucinating calculator?
Implications:
Results are probabilities defined more by data prevalence than logic or reason.
It is indiscernible to what degree a LLM is reliable on a given question.
Not useful to find undiscovered truths or neglected but brilliant ideas.
Inability to theorize new concepts or discoveries.
It is substantially ironic that LLMs are failing at the primary use cases that are attracting billions of investment, but are rather proficient at the use cases we do not desire, such as destruction of privacy and liberty, a post-truth society, social manipulation, the severance of human connection, fountains of noise, the devaluation of meaning, and a plethora of other societal issues.
Unlike much of the internet now, there is a human mind behind all the content created here at Mind Prison. I typically spend hours to days on articles including creating the illustrations for each. I hope if you find them valuable and you still appreciate the creations from the organic hardware within someone’s head that you will consider subscribing. Thank you!
No compass through the dark exists without hope of reaching the other side and the belief that it matters …
Knowing (or at least suspecting) what you know and what you don't, in humans, is a property referred to as "meta-memory." That is, memory about your memories. No one has the faintest idea how it works. It's not at all surprising that a machine that simply builds statistically probable text has no such capacity. It has no use for it, since it doesn't look up facts, just likely strings of words.
Grok (on X.com) got it correctly the first time using the same prompt.
https://twitter.com/i/grok
@Ivy_League_Red: which episode of Gilligan's Island features mindreading?
@grok: The episode of "Gilligan's Island" that features mindreading is titled "Seer Gilligan" and it first aired on January 27, 1966. In this episode, Gilligan discovers an island bush with seeds that grant the ability to read minds when consumed. The situation leads to a series of humorous events as the castaways learn about each other's thoughts, causing some unexpected revelations and misunderstandings.