tw2/examples/agent/realtime_voice_agent/README.md

# Realtime Voice Agent Example

This example demonstrates how to build a **real-time voice conversation agent** using AgentScope's RealtimeAgent. The agent supports bidirectional voice streaming, enabling natural voice conversations with low latency and real-time audio transcription.

## Prerequisites

- Python 3.10 or higher
- Your DashScope API key in an environment variable `DASHSCOPE_API_KEY`

Install the required packages:

```bash
uv pip install agentscope fastapi uvicorn websockets
# or
# pip install agentscope
```

## Usage

### 1. Start the Server

Run the FastAPI server:

```bash
cd examples/agent/realtime_voice_agent
python run_server.py
```

The server will start on `http://localhost:8000` by default.

### 2. Open the Web Interface

Open your web browser and navigate to:

```
http://localhost:8000
```

You will see a web interface with:
- Configuration panel (instructions and user name)
- Voice control buttons (Start Recording, Stop Recording, Disconnect)
- Video recording button (Start Video Recording)
- Text input field
- Message display area
- Video preview area (when video recording is active)

### 3. Start Conversation

1. **Configure the Agent** (optional):
   - Modify the "Instructions" to customize the agent's behavior
   - Enter your name in the "User Name" field

2. **Start Voice Recording**:
   - Click the "🎤 Start Recording" button
   - Allow microphone access when prompted by your browser
   - Speak naturally to the agent
   - The agent will respond with voice and text

3. **Stop Recording**:
   - Click "⏹️ Stop Recording" to pause voice input

4. **Video Recording** (Optional):
   - Click the "📹 Start Video Recording" button to start video recording
   - Allow camera access when prompted by your browser
   - The system will automatically capture and send video frames to the server at 1 frame per second (1 fps)
   - A video preview will be displayed while recording
   - Click "🔴 Stop Video Recording" to stop recording
   - **Note**: Video recording requires an active voice chat session. Please start voice chat first before starting video recording.

## Switching Models

AgentScope supports multiple realtime voice models. By default, this example uses DashScope's `qwen3-omni-flash-realtime` model, but you can easily switch to other providers.

### Supported Models

- **GeminiRealtimeModel**
- **OpenAIRealtimeModel**

### How to Switch Models

Edit `run_server.py` and replace the model initialization code:

**For OpenAI:**

```python
from agentscope.realtime import OpenAIRealtimeModel

agent = RealtimeAgent(
    name="Friday",
    sys_prompt=sys_prompt,
    model=OpenAIRealtimeModel(
        model_name="gpt-4o-realtime-preview",
        api_key=os.getenv("OPENAI_API_KEY"),
        voice="alloy",  # Options: "alloy", "echo", "marin", "cedar"
    ),
)
```

**For Gemini:**

```python
from agentscope.realtime import GeminiRealtimeModel

agent = RealtimeAgent(
    name="Friday",
    sys_prompt=sys_prompt,
    model=GeminiRealtimeModel(
        model_name="gemini-2.5-flash-native-audio-preview-09-2025",
        api_key=os.getenv("GEMINI_API_KEY"),
        voice="Puck",  # Options: "Puck", "Charon", "Kore", "Fenrir"
    ),
)
```

Don't forget to set the corresponding API key environment variable before starting the server!