AI Agents Can De‑Identify Anonymous Users with Notable Accuracy
Overview of the Study
Scientists explored whether artificial‑intelligence agents could move beyond traditional, structured‑data re‑identification methods and instead work from unstructured, free‑text sources such as interview transcripts and social‑media comments. By prompting a large language model (LLM) to extract identity‑related signals—like personal habits, preferences, or past experiences—the model then conducted autonomous web searches to locate candidates who matched those signals. The process concluded with a verification step to confirm that the candidate satisfied all extracted claims.
Experiment One: Interview Transcripts
In the first test, participants answered a questionnaire about their daily AI usage. The LLM parsed the responses, identified structured identity cues, and searched publicly available information to find matches. Out of the total group, the AI correctly identified 7 percent of the 125 participants, demonstrating that even with limited, vague data, the system could pinpoint real individuals.
Experiment Two: Reddit Movie Discussions
The researchers gathered comments from the r/movies subreddit and at least one of five smaller film‑related communities. They discovered that the more movies a user discussed, the easier it became for the AI to identify them. When a user shared a single movie, the system could identify 3.1 percent of users with 90 percent precision and 1.2 percent with 99 percent precision. With five to nine shared movies, identification rates rose to 8.4 percent (90 percent precision) and 2.5 percent (99 percent precision). Users who discussed more than ten movies saw a significant jump, reaching 48.1 percent identification at 90 percent precision and 17 percent at 99 percent precision.
Experiment Three: Large‑Scale Reddit Test
In a broader experiment, the team evaluated 5,000 Reddit users while adding 5,000 “distraction” identities—profiles that appeared only in the query set and had no true match in the candidate pool. The AI’s performance was compared against a classic Netflix‑prize‑style attack. Even with the added noise, the LLM‑based approach maintained its ability to locate correct matches, confirming the robustness of the method.
Implications and Future Outlook
Although the recall rates—especially in the first experiment—were modest, the ability of AI to deanonymize individuals from sparse, free‑form data marks a notable shift in privacy risk. The researchers emphasized that as LLM capabilities improve, the precision and recall of such deanonymization techniques are likely to increase. This raises important questions for platforms that rely on pseudonymity to protect user identities, suggesting a need for stronger safeguards against AI‑driven re‑identification.
Used: News Factory APP - news discovery and automation - ChatGPT for Business