I was interested in this idea, because although LLMs are not good at many things, what they absolutely are good at is taking large data sets of writing and finding a kind of “average” of that data. I can understand why this would make sense. I think it’s a situation where the further you go from the training set the less reliable your “silicon sample” will be, because it has less and less relevant information to draw from, but I can also kind of see it working in some circumstances.
So, anyway, I have done a little research into this and the concept does show some definite promise. I think this is the study that kicked off the concept, and their results are quite impressive. GPT-3 manages to be close to human respondents on a variety of topics and in a variety of contexts (guessing preferences, tone, word choices, etc).
There are some issues I don’t see addressed:
The evaluation is necessarily on data that is available, and it’s unclear whether they’ve determined if that data existed in GPT-3’s training set. Obviously if it did, this would somewhat poison the results as it would “know” the answers ahead of time.
The evaluation is limited to the US, and is all of “public opinion” topics, outside those I can’t find further evidence that this works at all - while the paper does include methods they used to correct for default biases in GPT-3, this remains within this fairly narrow context.
Because much of the data is qualitative, some of the methods used to evaluate the fidelity of the model are somewhat unreliable (e.g. surveying humans and having them gauge the model’s output). To be fair, this is in many cases inherent to the nature of psychological research rather than LLMs, but it makes trusting the results more difficult.
One important part from the article:
These studies suggest that after establishing algorithmic fidelity in a given model for a given topic/domain, researchers can leverage the insights gained from simulated, silicon samples to pilot different question wording, triage different types of measures, identify key relationships to evaluate more closely, and come up with analysis plans prior to collecting any data with human participants.
“Algorithmic fidelity” is a term that I think they have coined in this paper, it refers to how accurately the model reflects the population you are sampling. Roughly what they suggest is - take a known dataset of the population you want to assess, in the general area you are researching, and compare the real results of that with the LLM results. If this is successful you have an indication that the model can predict the population/area of interest, and you can adjust your questions to your specific topic. They don’t really highlight enough that without this your results could just be completely bogus. Who knows what this company Aaru are doing.
I do think this is quite an interesting and potentially promising use of the technology. Despite the fact it might on the surface seem to be just “inventing” data, in a way the LLM has already surveyed many more heads than any “real” survey ever could hope to. I would like to see more research before being sure of any of this though, I’m certainly going to continue reading about it to see what limitations there are beyond my first assumptions. GPT-3 is not the latest model, and I wonder about how much AI generated content is out there now… Are the later generations of models starting to eat their own tails? There’s obvious manipulation of online conversations through bots, could someone poison the well in this way and cause these “surveys” to produce skewed results?___
The “average” they’re finding is an average of the training set. That can’t actually apply to public opinion polls because the data in the set is going to be biased towards people that express their opinions. There’s already polling bias towards people who are likely to answer polling questions, now imagine that bias being applied to the loudest, most opinionated, most prolific posters.
Yes that’s definitely also an issue, although as you point out is already an issue with public opinion polling. I’m not sure how you would evaluate how much of a gap there is between the two.
It’s worse, because pollsters at least go out and solicit opinions from people who might not otherwise express their opinions.
An LLM is collecting opinions from people who happily and freely share their opinions without even being prompted, and completely ignoring people who don’t post their opinion. No attempt is made to account for the people who don’t post opinions because no one reaches out to them, they’re invisible. I don’t think there’s even a way to account for this, it’s just inherently busted.
No, even in the absolute best case scenario, the LLM analysis is a trailing indicator. There’s no way that it indicates current views, just possibly an indication of past views.
Personally I think this entire line of thinking (“silicon sampling”) is dangerous af.
Yeah, I’m not saying a tool akin to LLMs can’t be used as part of a suite of software workflows for parsing through and analyzing large datasets (seems rather obvious to say that), but forgoing the real work of live data gathering and statistics evaluation in order to do a sort of “vibe polling” sounds extremely off to me.
I agree, which is why I find the results they got interesting, the fact that the initial study was able to, arguably quite correctly (well, debatable if it was correct, as I pointed out their results are not the easiest to evaluate), predict real results is pretty impressive.
It seems like the kind of thing that could eventually be useful for helping to survey companies figure out how to word surveys and which surveys are even worth doing for a given group, rather than replacing the surveys themselves. Unfortunately it seems like the companies currently just want to replace the actually useful product with ai slop, as per usual
Yes, it can obviously never entirely replace real surveys. I would assume that survey results forming a part of the training set is a big part of why they’re able to get good results in the first place, and as I said I think its a significant risk that the evaluation is done it performs well because the data being evaluated against are (unbeknownst to the researcher) present in the training set.
because although LLMs are not good at many things, what they absolutely are good at is taking large data sets of writing and finding a kind of “average” of that data.
who knew that Large LANGUAGE Models do math (they don’t)
I’m not talking about numerical data, the way LLMs work is to find a “most likely response” based on the input text. There is absolutely maths happening inside the model, how else do you think they work? I’m not saying they take numbers and find an average.
LLMs are trained on language based content. it doesn’t know how to extract answers from mathematical based problems. it only gives approximations based on model input. it also can be trained wrong based on user input of data.
to a purely mathematical logical operator 2+2=4.
to a LLM if told 2+2=9 it will then always respond with 2+2=9.
LLMs don’t count because they can’t count. without the ability to count it can never understand the proof behind mathematical formulas.
Yes, I understand that, you are not understanding what I’m describing. I am not talking about taking an average of numerical data. LLMs take something that can be thought of as an “average” of text. It says “given all the text I have seen, and this new text input, what’s the most likely output?” In some numerical contexts the expected value is also an average, LLMs find a similar result, and that is what I am drawing a parallel between here.
you’re saying that LLMs average words, and because it averages words, it can consistently return mathematically accurate averages based on empirical data that was provided to it. does that sum it up?
Are you actually reading my comments? I am categorically NOT saying anything about mathematical averages, as I have said repeatedly. I am saying that what LLMs do is produce something that is akin to a mathematical average, when applied to text. It produces a “most likely” text output. That is all.
The word “average” does not always apply to numbers. You might, in some contexts, describe an “average” response to a survey - e.g. an opinion that would be considered the norm based on that survey. That is what I am describing. Average, as in “typical or usual.”
average is literally a mathematical function. you can’t have an average of anything without math.
think about it. average as in typical or usual to what? what’s the data set? I’ll give you that LLMs can give you a facsimile of an average, but results are so wildly inconsistent that it makes the end result absolutely useless for anything other than “entertainment value”.
LLMs are the 21st century automatons from the 16th century.
behold! it moves and thinks on its own! it’s alive!!
Average is also a word used outside the context of mathematics. Would you make this argument if someone said “I’m average looking”? No, no, you can’t be average looking, because there is no such thing as an average of appearances! Come on.
average as in typical or usual to what? what’s the data set?
The data set is the training set of the LLM. Look I get that you are obviously very against AI, and that’s fine, I don’t really care, but what they do is what I’ve described. It’s not a literal average, no, but its comparable. That is all I have been trying to say.
I was interested in this idea, because although LLMs are not good at many things, what they absolutely are good at is taking large data sets of writing and finding a kind of “average” of that data. I can understand why this would make sense. I think it’s a situation where the further you go from the training set the less reliable your “silicon sample” will be, because it has less and less relevant information to draw from, but I can also kind of see it working in some circumstances.
So, anyway, I have done a little research into this and the concept does show some definite promise. I think this is the study that kicked off the concept, and their results are quite impressive. GPT-3 manages to be close to human respondents on a variety of topics and in a variety of contexts (guessing preferences, tone, word choices, etc).
There are some issues I don’t see addressed:
One important part from the article:
“Algorithmic fidelity” is a term that I think they have coined in this paper, it refers to how accurately the model reflects the population you are sampling. Roughly what they suggest is - take a known dataset of the population you want to assess, in the general area you are researching, and compare the real results of that with the LLM results. If this is successful you have an indication that the model can predict the population/area of interest, and you can adjust your questions to your specific topic. They don’t really highlight enough that without this your results could just be completely bogus. Who knows what this company Aaru are doing.
I do think this is quite an interesting and potentially promising use of the technology. Despite the fact it might on the surface seem to be just “inventing” data, in a way the LLM has already surveyed many more heads than any “real” survey ever could hope to. I would like to see more research before being sure of any of this though, I’m certainly going to continue reading about it to see what limitations there are beyond my first assumptions. GPT-3 is not the latest model, and I wonder about how much AI generated content is out there now… Are the later generations of models starting to eat their own tails? There’s obvious manipulation of online conversations through bots, could someone poison the well in this way and cause these “surveys” to produce skewed results?___
The “average” they’re finding is an average of the training set. That can’t actually apply to public opinion polls because the data in the set is going to be biased towards people that express their opinions. There’s already polling bias towards people who are likely to answer polling questions, now imagine that bias being applied to the loudest, most opinionated, most prolific posters.
Yes that’s definitely also an issue, although as you point out is already an issue with public opinion polling. I’m not sure how you would evaluate how much of a gap there is between the two.
It’s worse, because pollsters at least go out and solicit opinions from people who might not otherwise express their opinions.
An LLM is collecting opinions from people who happily and freely share their opinions without even being prompted, and completely ignoring people who don’t post their opinion. No attempt is made to account for the people who don’t post opinions because no one reaches out to them, they’re invisible. I don’t think there’s even a way to account for this, it’s just inherently busted.
No, even in the absolute best case scenario, the LLM analysis is a trailing indicator. There’s no way that it indicates current views, just possibly an indication of past views.
Personally I think this entire line of thinking (“silicon sampling”) is dangerous af.
That’s a good point, although I imagine a dedicated company could refine a model using more recently sampled general data to improve the recency.
Yeah, I’m not saying a tool akin to LLMs can’t be used as part of a suite of software workflows for parsing through and analyzing large datasets (seems rather obvious to say that), but forgoing the real work of live data gathering and statistics evaluation in order to do a sort of “vibe polling” sounds extremely off to me.
I agree, which is why I find the results they got interesting, the fact that the initial study was able to, arguably quite correctly (well, debatable if it was correct, as I pointed out their results are not the easiest to evaluate), predict real results is pretty impressive.
I’m eagerly waiting more studies on AI psychosis. Make sure to participate if you get the chance.
I think I was overall pretty critical of the idea? I just find it interesting.
It seems like the kind of thing that could eventually be useful for helping to survey companies figure out how to word surveys and which surveys are even worth doing for a given group, rather than replacing the surveys themselves. Unfortunately it seems like the companies currently just want to replace the actually useful product with ai slop, as per usual
Yes, it can obviously never entirely replace real surveys. I would assume that survey results forming a part of the training set is a big part of why they’re able to get good results in the first place, and as I said I think its a significant risk that the evaluation is done it performs well because the data being evaluated against are (unbeknownst to the researcher) present in the training set.
nice astroturfing there schmuck.
who knew that Large LANGUAGE Models do math (they don’t)
gtfo of here with your bullshit.
I’m not talking about numerical data, the way LLMs work is to find a “most likely response” based on the input text. There is absolutely maths happening inside the model, how else do you think they work? I’m not saying they take numbers and find an average.
LLMs are trained on language based content. it doesn’t know how to extract answers from mathematical based problems. it only gives approximations based on model input. it also can be trained wrong based on user input of data.
to a purely mathematical logical operator 2+2=4.
to a LLM if told 2+2=9 it will then always respond with 2+2=9.
LLMs don’t count because they can’t count. without the ability to count it can never understand the proof behind mathematical formulas.
Yes, I understand that, you are not understanding what I’m describing. I am not talking about taking an average of numerical data. LLMs take something that can be thought of as an “average” of text. It says “given all the text I have seen, and this new text input, what’s the most likely output?” In some numerical contexts the expected value is also an average, LLMs find a similar result, and that is what I am drawing a parallel between here.
let me make sure I understand.
you’re saying that LLMs average words, and because it averages words, it can consistently return mathematically accurate averages based on empirical data that was provided to it. does that sum it up?
Are you actually reading my comments? I am categorically NOT saying anything about mathematical averages, as I have said repeatedly. I am saying that what LLMs do is produce something that is akin to a mathematical average, when applied to text. It produces a “most likely” text output. That is all.
The word “average” does not always apply to numbers. You might, in some contexts, describe an “average” response to a survey - e.g. an opinion that would be considered the norm based on that survey. That is what I am describing. Average, as in “typical or usual.”
average is literally a mathematical function. you can’t have an average of anything without math.
think about it. average as in typical or usual to what? what’s the data set? I’ll give you that LLMs can give you a facsimile of an average, but results are so wildly inconsistent that it makes the end result absolutely useless for anything other than “entertainment value”.
LLMs are the 21st century automatons from the 16th century.
behold! it moves and thinks on its own! it’s alive!!
Average is also a word used outside the context of mathematics. Would you make this argument if someone said “I’m average looking”? No, no, you can’t be average looking, because there is no such thing as an average of appearances! Come on.
The data set is the training set of the LLM. Look I get that you are obviously very against AI, and that’s fine, I don’t really care, but what they do is what I’ve described. It’s not a literal average, no, but its comparable. That is all I have been trying to say.