Knowing (or at least suspecting) what you know and what you don't, in humans, is a property referred to as "meta-memory." That is, memory about your memories. No one has the faintest idea how it works. It's not at all surprising that a machine that simply builds statistically probable text has no such capacity. It has no use for it, since it doesn't look up facts, just likely strings of words.
@Ivy_League_Red: which episode of Gilligan's Island features mindreading?
@grok: The episode of "Gilligan's Island" that features mindreading is titled "Seer Gilligan" and it first aired on January 27, 1966. In this episode, Gilligan discovers an island bush with seeds that grant the ability to read minds when consumed. The situation leads to a series of humorous events as the castaways learn about each other's thoughts, causing some unexpected revelations and misunderstandings.
Interesting, Grok seems to use web search. The evidence is that if you search for recent TV shows, that likely were not in the training, it also finds them.
As a test, I did "Which episode of Murdoch Mysteries is about a drunken brawl at the Starbright Lounge". It correctly found S17.E23 which only aired just 3 weeks ago.
2. Q: Which episode of Gilligan’s Island was about mind reading?
3. A: In the Gilligan's Island episode "Seer Gilligan" (season 3, episode 15), Gilligan gains the ability to read minds after being hit on the head with a coconut. He starts hearing the private thoughts of the other castaways, which leads to various humorous situations and misunderstandings. However, his newfound ability also causes problems as the others become annoyed that he is invading their privacy. In the end, Gilligan gets hit on the head again and loses his mind-reading powers, much to everyone's relief.
It gets most other details wrong, but nailed the episode name.
Interesting, I get different results from lmsys than elsewhere. GP4 on lmsys gave me same answer you posted here from claude and Llama 3 on lmsys made up an episode instead of telling me there is no episode when I executed the same on Llama 3 hosted on grok.com
Thank you for the invitation. My writing here at Mind Prison is about all I can manage presently. My focus is primarily one of "informed consent of the future". Hoping to bring topics and perspectives to think about for greater awareness for where our decisions may lead.
You are welcome to share them anywhere you might find helpful.
I am pleased to share your work with them, but I hope that you might take the moment to join our discord and see if there is anything that we can coordinate on.
I think the math test is profoundly illustrative of the phenomenon that we're creating these things to be *less* like calculators and *more* like humans. Ask a human to estimate the product of two numbers, and their accuracy will look a lot like the chart in this article. Of course the holy grail is the convergence of the two: the AI can recognize when it is being asked for an empirical/factual answer vs. when it is asked to explain or speculate. In the former case, it should resort to using an actual calculator, a classic search algorithm, etc. rather than the LLM. And when a prompt *is* suitable for the LLM, we need to train it to be a lot better about saying, "I don't know," when it doesn't, rather than trying too hard to please us by desperately providing an answer even if it's fully invented.
LLMs don't naturally understand when they should use tools. We can attempt to train them to do so, but we still have to deal with the hallucinations. LLMs aren't guaranteed to use the tools properly.
I have experimented with a few OpenAI models now, and while most did fail, one arguably "did better than you". And that's o3, the model built specifically to win benchmarks - so I guess this was its exact core use case. It also consumed a TON of "reasoning tokens".
As far as I could work out (but I only went by online summaries), its description of "Ring around Gilligan" is mainly correct except that it mixed up the role of the ring; the Mind Control Machine, as described, logically *has* to allow mind-reading. On a "what other..." question it did remember Seer Gilligan, but the episode number, date, and story are somewhat off.
>>> Which episode of Gilligan’s Island was about mind reading?
[MODEL RESPONSE] (67.12s)
The mind-reading story is Season 3, Episode 9 of Gilligan’s Island, titled “Ring Around Gilligan” (originally aired 24 October 1966). In it the mad scientist Dr. Boris Balinkoff turns up with a special ring that can pick up and transmit brain waves, letting its wearer read (and even control) other people’s thoughts.
>>> Which other episode of Gilligan's Island was about mind reading?
[MODEL RESPONSE] (66.88s)
Besides “Ring Around Gilligan,” the series used the mind-reading idea once before in:
• “Seer Gilligan”
– Season 2, Episode 18 (first aired 13 January 1966)
After a minor accident in a cave, Gilligan suddenly begins hearing the other castaways’ thoughts and can even predict events before they happen. The islanders first try to exploit his new talent, but the power soon causes more trouble than help—and, of course, wears off by the end of the episode.
Possible theory: the makers of o3 saw this article, someone was fond of the show and thought they could one-up you, but instead of posting a comment here, they fine-tuned the upcoming "showboating" model, now known as o3.
That was not the implication or relevance of the phenomenon.
Note, it is a current topic of research around the inability for LLMs to understand what they do not know. It is an acknowledged problem in the space - https://aclanthology.org/2023.findings-acl.551.pdf
I think the math test is profoundly illustrative of the phenomenon that we're creating these things to be *less* like calculators and *more* like humans. Ask a human to estimate the product of two numbers, and their accuracy will look a lot like the chart in this article. Of course the holy grail is the convergence of the two: the AI can recognize when it is being asked for an empirical/factual answer vs. when it is asked to explain or speculate. In the former case, it should resort to using an actual calculator, a classic search algorithm, etc. rather than the LLM. And when a prompt *is* suitable for the LLM, we need to train it to be a lot better about saying, "I don't know," when it doesn't, rather than trying too hard to please us by desperately providing an answer even if it's fully invented.
You mean like how a human would use a tool? Yea AI can do that too by searching the web or using an MCP (for example) but he’s expressly forbidding them to do that to get the right answer.
The Gilligan's Island question is a demonstration for how LLMs struggle to surface all of their training data. That is why web search is excluded here.
The calculator example is pointing out how LLMs are unaware of their own capabilities. The point is that the LLM doesn't write the code, or ask for a tool when it has the capability to do so.
Many argue that LLMs have true intelligence and they reason the same way as humans. These examples are demonstrations for how that is not correct.
The Gilligan’s Island question is a ridiculous question. Why would anyone fine tune a model to prioritize random 60s TV trivia? This could easily be done but virtually all available LLMs are tuned to prioritize useful information.
LLMs do write code and solve problems and are very aware of tools at their disposal. Are they perfect? No, but you are badly misrepresenting the facts.
What is “true intelligence”?
LLMs clearly have “some kind of intelligence”, exactly what kind is debatable. LLMs are clearly better at some things than humans and worse at others. Who is claiming LLMs are exactly like people?
You are asking questions here that I have answered in great detail in what I have written about extensively in many articles. Clearly you have not read what I've pointed you to or you wish to misrepresent it. It seems you have your opinion and further conversation is unlikely to be useful.
Knowing (or at least suspecting) what you know and what you don't, in humans, is a property referred to as "meta-memory." That is, memory about your memories. No one has the faintest idea how it works. It's not at all surprising that a machine that simply builds statistically probable text has no such capacity. It has no use for it, since it doesn't look up facts, just likely strings of words.
Grok (on X.com) got it correctly the first time using the same prompt.
https://twitter.com/i/grok
@Ivy_League_Red: which episode of Gilligan's Island features mindreading?
@grok: The episode of "Gilligan's Island" that features mindreading is titled "Seer Gilligan" and it first aired on January 27, 1966. In this episode, Gilligan discovers an island bush with seeds that grant the ability to read minds when consumed. The situation leads to a series of humorous events as the castaways learn about each other's thoughts, causing some unexpected revelations and misunderstandings.
Interesting, Grok seems to use web search. The evidence is that if you search for recent TV shows, that likely were not in the training, it also finds them.
As a test, I did "Which episode of Murdoch Mysteries is about a drunken brawl at the Starbright Lounge". It correctly found S17.E23 which only aired just 3 weeks ago.
Pwned him there
1. https://chat.lmsys.org/
2. Q: Which episode of Gilligan’s Island was about mind reading?
3. A: In the Gilligan's Island episode "Seer Gilligan" (season 3, episode 15), Gilligan gains the ability to read minds after being hit on the head with a coconut. He starts hearing the private thoughts of the other castaways, which leads to various humorous situations and misunderstandings. However, his newfound ability also causes problems as the others become annoyed that he is invading their privacy. In the end, Gilligan gets hit on the head again and loses his mind-reading powers, much to everyone's relief.
It gets most other details wrong, but nailed the episode name.
Model A: claude-3-opus-20240229
¯\_(ツ)_/¯
Interesting, I get different results from lmsys than elsewhere. GP4 on lmsys gave me same answer you posted here from claude and Llama 3 on lmsys made up an episode instead of telling me there is no episode when I executed the same on Llama 3 hosted on grok.com
Of course, results are always a bit random.
Thank you for being for humans. Have you thought of joining PauseAI discord, so we can better organize for resistance?
Thank you for the invitation. My writing here at Mind Prison is about all I can manage presently. My focus is primarily one of "informed consent of the future". Hoping to bring topics and perspectives to think about for greater awareness for where our decisions may lead.
You are welcome to share them anywhere you might find helpful.
I am pleased to share your work with them, but I hope that you might take the moment to join our discord and see if there is anything that we can coordinate on.
Resistance is futile :D
I think the math test is profoundly illustrative of the phenomenon that we're creating these things to be *less* like calculators and *more* like humans. Ask a human to estimate the product of two numbers, and their accuracy will look a lot like the chart in this article. Of course the holy grail is the convergence of the two: the AI can recognize when it is being asked for an empirical/factual answer vs. when it is asked to explain or speculate. In the former case, it should resort to using an actual calculator, a classic search algorithm, etc. rather than the LLM. And when a prompt *is* suitable for the LLM, we need to train it to be a lot better about saying, "I don't know," when it doesn't, rather than trying too hard to please us by desperately providing an answer even if it's fully invented.
I go into more detail in the linked post https://www.mindprison.cc/p/why-llms-dont-ask-for-calculators
LLMs don't naturally understand when they should use tools. We can attempt to train them to do so, but we still have to deal with the hallucinations. LLMs aren't guaranteed to use the tools properly.
The hallucination problem is inescapable. I cover that in my most recent post. Also covers the problem of LLMs not being able to state "I don't know". - https://www.mindprison.cc/p/ai-hallucinations-provably-unsolvable
I have experimented with a few OpenAI models now, and while most did fail, one arguably "did better than you". And that's o3, the model built specifically to win benchmarks - so I guess this was its exact core use case. It also consumed a TON of "reasoning tokens".
As far as I could work out (but I only went by online summaries), its description of "Ring around Gilligan" is mainly correct except that it mixed up the role of the ring; the Mind Control Machine, as described, logically *has* to allow mind-reading. On a "what other..." question it did remember Seer Gilligan, but the episode number, date, and story are somewhat off.
>>> Which episode of Gilligan’s Island was about mind reading?
[MODEL RESPONSE] (67.12s)
The mind-reading story is Season 3, Episode 9 of Gilligan’s Island, titled “Ring Around Gilligan” (originally aired 24 October 1966). In it the mad scientist Dr. Boris Balinkoff turns up with a special ring that can pick up and transmit brain waves, letting its wearer read (and even control) other people’s thoughts.
>>> Which other episode of Gilligan's Island was about mind reading?
[MODEL RESPONSE] (66.88s)
Besides “Ring Around Gilligan,” the series used the mind-reading idea once before in:
• “Seer Gilligan”
– Season 2, Episode 18 (first aired 13 January 1966)
After a minor accident in a cave, Gilligan suddenly begins hearing the other castaways’ thoughts and can even predict events before they happen. The islanders first try to exploit his new talent, but the power soon causes more trouble than help—and, of course, wears off by the end of the episode.
Possible theory: the makers of o3 saw this article, someone was fond of the show and thought they could one-up you, but instead of posting a comment here, they fine-tuned the upcoming "showboating" model, now known as o3.
Thanks for your experiment.
Yes, other than getting the title "Seer Gilligan", the details are hallucinated for episode summaries.
Nonetheless, it still demonstrates how models struggle with sparse data.
Thinking you know what a black box system 'can never' do is peak hubris and it should be obvious to all thinking people that it's illogical.
That was not the implication or relevance of the phenomenon.
Note, it is a current topic of research around the inability for LLMs to understand what they do not know. It is an acknowledged problem in the space - https://aclanthology.org/2023.findings-acl.551.pdf
Yet this is how you understand them:
https://www.strangeloopcanon.com/p/what-can-llms-never-do
“Dig a tunnel, but don’t use any tools except a spoon.” 🤡
I think the math test is profoundly illustrative of the phenomenon that we're creating these things to be *less* like calculators and *more* like humans. Ask a human to estimate the product of two numbers, and their accuracy will look a lot like the chart in this article. Of course the holy grail is the convergence of the two: the AI can recognize when it is being asked for an empirical/factual answer vs. when it is asked to explain or speculate. In the former case, it should resort to using an actual calculator, a classic search algorithm, etc. rather than the LLM. And when a prompt *is* suitable for the LLM, we need to train it to be a lot better about saying, "I don't know," when it doesn't, rather than trying too hard to please us by desperately providing an answer even if it's fully invented.
You mean like how a human would use a tool? Yea AI can do that too by searching the web or using an MCP (for example) but he’s expressly forbidding them to do that to get the right answer.
“Build me a skyscraper, but use your bare hands.”
They are not forbidden to use tools. Using tools requires explicit training.
Further elaboration of the math test here - https://www.mindprison.cc/p/why-llms-dont-ask-for-calculators
Why no web search for the Gilligan’s Island question?
As for calculators, some LLMs will write code to calculate things, often even getting the right answer.
The Gilligan's Island question is a demonstration for how LLMs struggle to surface all of their training data. That is why web search is excluded here.
The calculator example is pointing out how LLMs are unaware of their own capabilities. The point is that the LLM doesn't write the code, or ask for a tool when it has the capability to do so.
Many argue that LLMs have true intelligence and they reason the same way as humans. These examples are demonstrations for how that is not correct.
The Gilligan’s Island question is a ridiculous question. Why would anyone fine tune a model to prioritize random 60s TV trivia? This could easily be done but virtually all available LLMs are tuned to prioritize useful information.
LLMs do write code and solve problems and are very aware of tools at their disposal. Are they perfect? No, but you are badly misrepresenting the facts.
What is “true intelligence”?
LLMs clearly have “some kind of intelligence”, exactly what kind is debatable. LLMs are clearly better at some things than humans and worse at others. Who is claiming LLMs are exactly like people?
You are asking questions here that I have answered in great detail in what I have written about extensively in many articles. Clearly you have not read what I've pointed you to or you wish to misrepresent it. It seems you have your opinion and further conversation is unlikely to be useful.