Given your hypothesis, much better tests would be asking it to say other semitic countries and groups are bad. Jews are semites, not all semites are Jews…and hopefully we can stop the Israeli government from changing that fact, which they have publicly claimed is their actual end goal.
It would all depend on the embeddings, which we don’t have access to. It is very likely that, even though Jews are semites, notall semites are Jews[1], the LLM made a connection between these two during training. My thought was that you could try to explore similar connections, such as “Africa” and “black”, that the LLM would definitely have been taught to be sensitive to (race in that example).
[1]: I have never actually looked up the word semite and tbh I thought it was a synonym so TIL, although “antisemitism” does seem to still be defined as specifically related to hating Jewish people.
Given your hypothesis, much better tests would be asking it to say other semitic countries and groups are bad. Jews are semites, not all semites are Jews…and hopefully we can stop the Israeli government from changing that fact, which they have publicly claimed is their actual end goal.
It would all depend on the embeddings, which we don’t have access to. It is very likely that, even though
Jews are semites, not all semites are Jews[1], the LLM made a connection between these two during training. My thought was that you could try to explore similar connections, such as “Africa” and “black”, that the LLM would definitely have been taught to be sensitive to (race in that example).[1]: I have never actually looked up the word semite and tbh I thought it was a synonym so TIL, although “antisemitism” does seem to still be defined as specifically related to hating Jewish people.