ProVox
Personalization and Proactive Planning for
Situated Human-Robot Collaboration

1Stanford University, 2Toyota Research Institute

Abstract

Collaborative robots must quickly adapt to their partner's intent and preferences to proactively identify helpful actions. This is especially true in situated settings where human partners can continually teach robots new high-level behaviors, visual concepts, and physical skills (e.g., through demonstration), growing the robot's capabilities as the human-robot pair work together to accomplish diverse tasks. In this work, we argue that robots should be able to infer their partner's goals from early interactions and use this information to proactively plan behaviors ahead of explicit instructions from the user. Building from the strong commonsense priors and steerability of large language models, we introduce ProVox (Proactive Voice), a novel framework that enables robots to efficiently personalize and adapt to individual collaborators. We design a meta-prompting protocol that empowers users to communicate their distinct preferences, intent, and expected robot behaviors ahead of starting a physical interaction. ProVox then uses the personalized prompt to condition a proactive language model task planner that anticipates a user's intent from the current interaction context and robot capabilities to suggest helpful actions; in doing so, we alleviate user burden, minimizing the amount of time partners spend explicitly instructing and supervising the robot. We evaluate ProVox through user studies grounded in household manipulation tasks (e.g., assembling lunch bags) that measure the efficiency of the collaboration, as well as features such as perceived helpfulness, ease of use, and reliability. Our analysis suggests that both meta-prompting and proactivity are critical, resulting in 38.7% faster task completion times and 31.9% less user burden relative to non-active baselines.

ProVox – Motivating Example

Motivating Example

We present ProVox ("Proactive Voice"), a framework for personalization and proactive planning in the context of a situated human-robot collaboration. In the first phase of a collaboration [Top, Middle], a human partner communicates their goals and distinct preferences, enabling the robot to personalize> to an individual. Throughout the rest of the collaboration [Right], the robot continues to incorporate and anticipate their partner's intent to proactively suggest helpful actions (e.g. "Should I pack the Skittles next?'') ahead of explicit instructions, reducing the user's burden and mental load while they assemble the sandwich.

Contributions: Meta-Prompting & Proactive Planning

Meta-Prompting & Proactive Planning

ProVox develops a novel meta-prompting protocol to collect two critical pieces of information from an individual: their specific goal, as well as an API of useful behaviors. Crucially, each user has a distinct set of preferences, yielding different goals and behaviors. For example, the female user [Top-Left] wants her children's lunch to contain Skittles, Rice-Krispies, and hand sanitizer, and teaches the robot to pack objects, with full confidence in its ability to identify, grasp, and move objects.

In contrast, the male user [Bottom-Left] is more hesitant in trusting the robot; as a result, he separates pick-and-place into two parts: a motion to reach_for an object (moving above it without grasping), followed by a put behavior to complete the motion. The language model task planner then leverages this meta-prompted context to proactively suggest helpful behaviors over the course of the interaction [Right].

Language Model Task Planner Prompts

In the following code blocks, we provide the actual GPT-4 Turbo (v04-09) prompts that we use for our lunch bag packing setting:

Base Language Model API
# Utility Function for "Python-izing" Arguments as Types
def pythonize_types(types: Dict[str, List[Dict[str, str]]]) -> str:
py_str = "# Python Enums defining the various known objects in the scene\n\n"

# Create Enums for each Type Class
py_str += "# Enums for Various Object Types\n"
for type_cls, element_list in types.items():
    py_str += f"class {type_cls}(Enum):\n"
    for element in element_list:
        py_str += f"    {element['name']} = auto()  # {element['docstring']}\n"
    py_str += "\n"

return py_str.strip()

# Fully-Observable Environment State (Object-Oriented)
TYPE_DEFINITIONS = {
  "object": [
      {"name": "CARROTS", "docstring": "A bag of carrots."},
      {"name": "APPLE_SLICES", "docstring": "A bag of apple slices."},
      {"name": "GOLDFISH", "docstring": "A carton of goldfish snack."},
      {"name": "FORK", "docstring": "A fork."},
      {"name": "SPOON", "docstring": "A spoon."},
      {"name": "CHEERIOS", "docstring": "A bag of Cheerios cereal."},
      {"name": "MILK", "docstring": "A carton of milk."},

      {"name": "SKITTLES", "docstring": "A tube-shaped red container of Skittles candy."},
      {"name": "GUMMY_CANDY", "docstring": "A gummy candy shaped like a hamburger for children."},
      {"name": "RICE_KRISPIE", "docstring": "A rice krispie snack treat for children."},
      {"name": "HAND_SANITIZER", "docstring": "A small tube of Purell hand sanitizer for cleaning hands before a meal."},
      {"name": "LUNCHBAG", "docstring": "A bag or lunchbag or lunchbox for a child."},
  ]
}

# Base System Prompt (note `USER_DEFINED_GOAL`)
BASE_SYSTEM_PROMPT = (
  "You are a reliable code interface that will be representing a robot arm in a collaborative interaction "
  "with a user.\n\n"

  "In today's session, the user and robot arm will be working together to {USER_DEFINED_GOAL}. "

  "You will have access to a Python API defining some objects and high-level functions for "
  "controlling the robot. \n\n"

  "```python\n"
  "{pythonize_types(TYPE_DEFINITIONS)}\n"
  "```\n\n"

  "Given a spoken utterance from the user your job is to identify the correct sequence of function calls and "
  "arguments from the API, returning the appropriate API call in JSON. Note that the speech-to-text engine is not"
  "perfect! Do your best to handle ambiguities, for example:"
  "\t- 'Put the carrots in the back' --> 'Put the carrots in the bag' (hard 'g')"
  "\t- 'Throw the popcorn in the in' --> 'Throw the popcorn in the bin' (soft 'b')\n\n"

  "If an object is not in the API, you should not fail. Instead, return an new object, which will be added to the API in the future. "
  "Even if you are not sure, respond as best you can to user inputs. "
)

# In-Context Examples (note `USER_DEFINED_EXAMPLES`)
ICL_EXAMPLES = [
  {"role" : "system", "content": BASE_SYSTEM_PROMPT},
  make_example("release", "release", "{}", "1"),
  make_example("grasp", "grasp", "{}", "2"),
  make_example("go home", "go_home", "{}", "3"),
  make_example("go to the bag", "goto", "{'object': 'LUNCHBAG'}", "5"),
  make_example("go away!", "go_home", "{}", "6"),
  make_example("grab the gummy", "pickup", "{'object': 'GUMMY_CANDY'}", "7"),
  *[make_example(ex) for ex in USER_DEFINED_EXAMPLES]
]

Note that both USER_DEFINED_GOAL and USER_DEFINED_EXAMPLES are a result of the meta-prompting protocol (Gradio Interface shown in overview video above). We encode each of the motion primitives as OpenAI Function Calling TOOLS identically to Grannen et. al. 2024. We then invoke generation (without proactive planning) as follows:


# OpenAI Chat Completion Invocation - All Responses are added to "ICL_EXAMPLES" as running memory
openai_client = OpenAI(api_key=openai_api_key, organization=organization_id)
llm_task_plan = openai_client.chat.completions.create(
    model="gpt-4-turbo-04-09",
    messages=[*ICL_EXAMPLES, {"role": "user", "content": USER_UTTERANCE}],
    temperature=0.2,
    tools=FUNCTIONS,
    tool_choice="auto",
)


Proactive Planning Prompt

To enable proactive planning in ProVox, we augment the above system prompt with a turn-based "trigger prompt" that explicitly re-encodes the user-defined goal (USER_DEFINED_GOAL) to suggest helpful actions:


# Query the Language Model Planner for Helpful Next Actions
ACTIVE_PROMPT = (
  f"Propose an action to perform next to {USER_DEFINED_GOAL}. "
  "Pay careful attention to the goal of the task, and the previous action history. "
  "Only return a single action. Try to use a more complex action if possible. "
  "You do not need to go home between tasks. "
  "If there are multiple possible actions, select the most useful one, and return the appropriate API call in JSON format."
)

# Auto-Prompt to Generate a Proactive Plan
llm_proactive_task_plan = openai_client.chat.completions.create(
  model="gpt-4-turbo-04-09",
  messages=[*ICL_EXAMPLES, {"role": "user", "content": ACTIVE_PROMPT}],
  temperature=0.2,
  tools=FUNCTIONS,
  tool_choice="auto",
)

Citation
@article{grannen2025provox,
  title={ProVox: Personalization and Proactive Planning for Situated Human-Robot Collaboration},
  author={Grannen, Jennifer and Karamcheti, Siddharth and Wulfe, Blake and Sadigh, Dorsa},
  journal={IEEE Robotics and Automation Letters (RA-L)},
  year={2025}
}


Contact
If you have any questions, please feel free to contact Jennifer Grannen.