Introduction
Imagine asking an AI assistant for a restaurant recommendation. A truly smart assistant wouldn’t just suggest a nearby spot—it would know you prefer vegan food, cozy vibes, and short wait times. But how does AI actually learn these preferences—and how well does it understand them?
Personalization is tricky. AI models don’t inherently know your job, favorite travel spots, or coffee order—they have to learn from your conversations and app usage. But measuring how well they do this is a challenge. Most research and commercial AI systems rely on retrieval-augmented generation (RAG) techniques, where an AI retrieves relevant user data before responding. However, the effectiveness of these systems remains unclear due to the lack of standardized evaluation benchmarks.
In this blog, we’ll dive into the challenges of AI personalization, why current systems fall short, and how PersonaBench helps bridge the gap—paving the way for smarter, more reliable AI assistants.
The complexity of personalization
Noisy Data
One of the primary challenges in AI personalization is the nature of user data. User data is often noisy, meaning it contains irrelevant or misleading information. For example, a significant portion of users’ daily conversations may consist of casual chitchat, discussions about current events, or work-related topics that are not directly relevant to personal information. Since these types of exchanges often make up the majority of conversation logs, they increase the difficulty of accurately retrieving meaningful personal details.
Fragmented User Information
Valuable personal details may be fragmented across multiple sessions in user documents. For instance, a user might mention their preferred movie genre in different conversation sessions with different people or reveal it through their movie ticket purchase history. This fragmentation makes it challenging for AI models to piece together a coherent understanding of the user’s preferences and attributes.
Dynamic User Attributes
Personal attributes can change over time. For example, a user’s favorite restaurant might change after trying a new place, or their job might change, affecting their daily routine and preferences. AI models need to be able to adapt to these changes and update their understanding of the user accordingly.
The role of retrieval-augmented generation
Retrieval-augmented generation (RAG) has emerged as a popular solution to these challenges. In RAG, a retriever identifies relevant user data based on the query, which is then combined with the query and passed to the language model for response generation. This approach aims to enhance the model’s ability to generate personalized responses by providing it with context-specific information.
However, the effectiveness of RAG systems is difficult to measure due to the lack of standardized evaluation resources. Most user data is private, and creating a large, diverse dataset with ground-truth personal information is challenging due to privacy concerns. In this approach, a retriever identifies relevant user data based on the query, which is then combined with the query and passed to the LLM for response generation (Figure 1).
Figure 1. Example of an AI assistant generates personalized responses
However, retrieving and interpreting personal information is inherently complex and challenging. Several aforementioned factors contribute to this complexity, including but not limited to the following:
- User data is often noisy.
- Valuable personal details may be fragmented.
- Personal attributes can change over time.
When deploying such systems, the true effectiveness of the pipeline remains unclear. This uncertainty primarily stems from the lack of publicly available user documents paired with ground-truth personal information, largely due to privacy concerns. As a result, there are no standardized evaluation resources to objectively assess and improve these personalized AI assistants.
Introducing PersonaBench
To address these gaps, we introduce PersonaBench, a new benchmark designed to rigorously test AI models’ ability to understand and use personal information. PersonaBench includes several key components:
- Multi-step data generation pipeline: We synthesize diverse user profiles using a multi-step process.
- Synthetic user documents: Building on these profiles, we generate various types of private user documents (e.g., conversations, user–AI interactions, purchase history) that simulate real-world user activities.
- Automated evaluation pipeline: We assess retrieval-augmented generation (RAG) models’ ability to answer personal questions by using the generated user documents as context.
Data Generation Pipeline
First Stage – Creating User Profiles
In the first stage of our data generation pipeline, we focus on creating user profiles—each representing a unique “character.” We follow these steps to build a diverse and comprehensive set of user profiles:
- Profile Template Definition
- Persona Sampling and Social Graph Creation
- Profile Completion
Second Stage – Generating Synthetic Private Data
In the second stage, we create synthetic private data that reflects the realistic behaviors and daily activities of each character. This data is generated in three document types:
- Conversation Data
- User–AI Interaction Data
- User Purchase History
The entire process is illustrated in Figure 2. Please refer to the technical paper for further details.
Figure 2. An illustration of the data generation pipeline and an example of a personal question used in the evaluation. The synthesized user profiles remain inaccessible to the models, which must rely solely on the user’s private documents to answer the questions.
Synthetic Dataset
User profile example:
Figure 3 presents examples of user profiles for two characters. Each profile includes personal attributes categorized into three meta-categories: Demographic Information, Psychographic Information, and Social Information. We note that all names and other user information are artificial and do not represent or refer to any real person.
Synthetic document example
User documents are generated on a session-by-session basis, where each session may contain a portion of a person’s personal information. To gain a comprehensive understanding of a person, all documents must be combined. Each document includes a sequence of sessions with timestamps, simulating the passage of time in real-world user data. Figure 4 illustrates a conversation session between two individuals, revealing one character’s preference for a music artist.
Figure 3. User profile examples
Figure 4. An example conversation session between two socially connected individuals. The session reveals one person, Nicholas Torres’s preference on music_artist: Hans Zimmer.
Evaluation Results
Another major contribution of PersonaBench is the evaluation pipeline. We designed multiple questions targeting basic user information, preferences, and social connections, each paired with a ground-truth answer. This setup enables an automated evaluation that assesses both the performance of retriever models in extracting the correct documents (i.e., retrieval evaluation) and the ability of the entire RAG pipeline to accurately answer personal questions (i.e., end-to-end evaluation). Table 1 and Table 2 show both evaluation results.
Table 1. Retrieval Evaluation
Table 2. Retrieval Evaluation
Besides providing mere numeric results, the evaluation pipeline also generates detailed feedback on various aspects of the evaluated models, as shown in Figure 5. This includes, but is not limited to, their robustness to noise (irrelevant information) in the dataset and sensitivity to information updates.
Figure 5. Advanced analysis on different aspects.
Conclusion and Future Works
This project makes two primary contributions:
- Synthetic Data Generation Pipeline: A multi-step process that produces realistic private user data, simulating real-world user activities.
- Standardized Evaluation Benchmark: A benchmark for evaluating how well AI models understand personal information derived from such data.
Although many recent systems rely on RAG to produce personalized responses, our findings reveal that current retrieval-augmented AI models struggle to answer private questions by extracting personal information from user documents, underscoring the need for improved methodologies to enhance AI personalization. PersonaBench provides a first standardized evaluation benchmark for this task, facilitating further research in the field.
However, the current benchmark only includes a Q&A task that tests whether AI assistants can accurately extract user information. Future work must also consider which user data (if any) is necessary for effectively accomplishing a given task, as well as how to invoke the appropriate APIs to fulfill user requests. These considerations open the door to extending the benchmark with more advanced and realistic tasks and evaluations.
By addressing these challenges, we can move closer to developing AI assistants that truly understand and cater to individual users, enhancing user experiences and building trust in AI technology.