A new study led by Nikta Gohari Sadr reveals that major AI language models, including GPT-4o, Claude 3.5 Haiku, Llama 3, DeepSeek V3, and the Persian‑tuned Dorna, perform poorly on the Persian cultural practice of taarif, correctly handling only 34 to 42 percent of scenarios compared with native speakers' 82 percent success rate. The researchers introduced TAAROFBENCH, a benchmark that tests AI systems on the nuanced give‑and‑take of polite refusals and insistence. The findings highlight a gap between Western‑centric AI behavior and the expectations of Persian speakers, raising concerns about cultural missteps in global AI applications.
Leer más →