chore: initialize sandbox and overwrite remote content
Some checks failed
Pre-commit / run (ubuntu-latest) (push) Has been cancelled
Deploy Sphinx documentation to Pages / build_en (ubuntu-latest, 3.10) (push) Has been cancelled
Deploy Sphinx documentation to Pages / build_zh (ubuntu-latest, 3.10) (push) Has been cancelled
Python Unittest Coverage / test (macos-15, 3.10) (push) Has been cancelled
Python Unittest Coverage / test (macos-15, 3.11) (push) Has been cancelled
Python Unittest Coverage / test (macos-15, 3.12) (push) Has been cancelled
Python Unittest Coverage / test (ubuntu-latest, 3.10) (push) Has been cancelled
Python Unittest Coverage / test (ubuntu-latest, 3.11) (push) Has been cancelled
Python Unittest Coverage / test (ubuntu-latest, 3.12) (push) Has been cancelled
Python Unittest Coverage / test (windows-latest, 3.10) (push) Has been cancelled
Python Unittest Coverage / test (windows-latest, 3.11) (push) Has been cancelled
Python Unittest Coverage / test (windows-latest, 3.12) (push) Has been cancelled

This commit is contained in:
codex-bot
2026-03-02 22:32:27 +08:00
commit a64378956a
584 changed files with 93604 additions and 0 deletions

View File

@@ -0,0 +1,317 @@
# MemoryWithCompress
- [ ] TODO: The memory module with compression will be added to the agentscope library in the future.
## Overview
MemoryWithCompress is a memory management system designed for AgentScope's `ReActAgent`. It automatically compresses conversation history when the memory size exceeds a specified token limit, using a Large Language Model (LLM) to create concise summaries that preserve key information. This allows agents to maintain context over long conversations while staying within token constraints.
The system maintains two separate storage mechanisms:
- **`chat_history_storage`**: Stores the complete, unmodified conversation history (uses `MessageStorageBase` interface)
- **`memory_storage`**: Stores messages that may be compressed when token limits are exceeded (uses `MessageStorageBase` interface)
Both storage mechanisms are abstracted through the `MessageStorageBase` interface, allowing for flexible storage backends. By default, `InMemoryMessageStorage` is used for both.
## Core Features
### Automatic Memory Compression
- **Token-based Triggering**: Automatically compresses memory when the total token count exceeds `max_token`
- **LLM-Powered Summarization**: Uses an LLM to intelligently compress conversation history while preserving essential information
- **Structured Output**: Uses Pydantic schemas to ensure consistent compression format
### Dual Storage System
- **Complete History**: Maintains original, unmodified messages in `_chat_history` for reference
- **Compressed Memory**: Stores potentially compressed messages in `_memory` for efficient context management
### Flexible Memory Management
- **Filtering Support**: Provides `filter_func` parameter for custom memory filtering
- **Recent N Retrieval**: Supports retrieving only the most recent N messages
- **State Persistence**: Includes `state_dict()` and `load_state_dict()` methods for saving and loading memory state
- **Storage Abstraction**: Uses `MessageStorageBase` interface for flexible storage backends
- **Compression Triggers**: Supports both token-based and custom trigger functions for compression
- **Compression Timing Control**: Configurable compression on add (`compression_on_add`) and get (`compression_on_get`) operations
## File Structure
```
memory_with_compression/
├── README.md # This documentation file
├── main.py # Example demonstrating MemoryWithCompress usage
├── _memory_with_compress.py # Core MemoryWithCompress implementation
├── _memory_storage.py # Storage abstraction layer (MessageStorageBase, InMemoryMessageStorage)
├── _mc_utils.py # Utility functions (formatting, token counting, compression schema)
```
## Prerequisites
### Clone the AgentScope Repository
This example depends on AgentScope. Please clone the full repository to your local machine.
### Install Dependencies
**Recommended**: Python 3.10+
Install the required dependencies:
```bash
pip install agentscope
```
### API Keys
This example uses DashScope APIs by default. You need to set your API key as an environment variable:
```bash
export DASHSCOPE_API_KEY='YOUR_API_KEY'
```
You can easily switch to other models by modifying the configuration in `main.py`.
## How It Works
### 1. Memory Addition Flow
1. **Message Input**: New messages are added via the async `add()` method
2. **Dual Storage**: Messages are deep-copied and added to both `chat_history_storage` and `memory_storage`
3. **Optional Compression on Add**: If `compression_on_add=True`, compression may be triggered immediately after adding messages
### 2. Memory Retrieval and Compression Flow
When `get_memory()` is called (if `compression_on_get=True`):
1. **Token Counting**: The system calculates the total token count of all messages in `memory_storage`
2. **Compression Check**:
- First checks if token count exceeds `max_token` (automatic compression)
- Then checks if `compression_trigger_func` returns `True` (custom trigger)
3. **LLM Compression**: If compression is needed, all messages in `memory_storage` are sent to the LLM with a compression prompt
4. **Structured Output**: The LLM returns a structured response containing the compressed summary
5. **Memory Replacement**: The entire `memory_storage` is updated with the compressed message(s)
6. **Filtering & Selection**: Optional filtering and recent_n selection are applied
7. **Return**: The processed memory is returned
### 3. Compression Process
The compression uses a structured output approach:
- **Prompt**: Instructs the LLM to summarize conversation history while preserving key information
- **Customizable Prompt**: Supports `customized_compression_prompt` parameter for custom prompt templates
- **Schema**: Uses `MemoryCompressionSchema` (Pydantic model) to ensure consistent output format
- **Output Format**: Returns a message with content wrapped in `<compressed_memory>` tags
- **Async Support**: All compression operations are asynchronous
## Usage Examples
### Running the Example
To see `MemoryWithCompress` in action, run the example script:
```bash
python ./main.py
```
### Basic Initialization
Here is a snippet from `main.py` showing how to set up the agent and memory:
```python
from agentscope.agent import ReActAgent
from agentscope.model import DashScopeChatModel
from agentscope.formatter import DashScopeChatFormatter
from agentscope.token import OpenAITokenCounter
from agentscope.message import Msg
from _memory_with_compress import MemoryWithCompress
# 1. Create the model for agent and memory compression
model = DashScopeChatModel(
api_key=os.environ.get("DASHSCOPE_API_KEY"),
model_name="qwen-max",
stream=False,
)
# 2. Optional: Define a custom compression trigger function
async def trigger_compression(msgs: list[Msg]) -> bool:
# Trigger compression if the number of messages exceeds 2
# and the last message is from the assistant
return len(msgs) > 2 and msgs[-1].role == "assistant"
# 3. Initialize MemoryWithCompress
memory_with_compress = MemoryWithCompress(
model=model,
formatter=DashScopeChatFormatter(),
max_token=3000, # Compress when memory exceeds 3000 tokens
token_counter=OpenAITokenCounter(model_name="qwen-max"),
compression_trigger_func=trigger_compression, # Optional custom trigger
compression_on_add=False, # Don't compress on add (default)
compression_on_get=True, # Compress on get (default)
)
# 4. Initialize ReActAgent with the memory instance
agent = ReActAgent(
name="Friday",
sys_prompt="You are a helpful assistant named Friday.",
model=model,
formatter=DashScopeChatFormatter(),
memory=memory_with_compress,
)
```
### Custom Compression Function
You can provide a custom compression function:
```python
async def custom_compress(messages: List[Msg]) -> List[Msg]:
# Your custom compression logic
# Must return a List[Msg], not a single Msg
compressed_content = "..."
return [Msg("assistant", compressed_content, "assistant")]
memory_with_compress = MemoryWithCompress(
model=model,
formatter=formatter,
max_token=300,
compress_func=custom_compress,
)
```
### Custom Storage Backend
You can provide custom storage backends by implementing the `MessageStorageBase` interface:
```python
from _memory_storage import MessageStorageBase
class CustomStorage(MessageStorageBase):
# Implement required methods: start, stop, health, add, delete, clear, get, replace, __aenter__, __aexit__
...
memory_with_compress = MemoryWithCompress(
model=model,
formatter=formatter,
max_token=300,
chat_history_storage=CustomStorage(),
memory_storage=CustomStorage(),
)
```
## API Reference
### MemoryWithCompress Class
#### `__init__(...)`
Initializes the memory system. Key parameters include:
- `model` (ChatModelBase): The LLM model to use for compression
- `formatter` (FormatterBase): The formatter to use for formatting messages
- `max_token` (int): The maximum token count for `memory_storage`. Default: 28000. Compression is triggered when exceeded
- `chat_history_storage` (MessageStorageBase): Storage backend for complete chat history. Default: `InMemoryMessageStorage()`
- `memory_storage` (MessageStorageBase): Storage backend for compressed memory. Default: `InMemoryMessageStorage()`
- `token_counter` (Optional[TokenCounterBase]): The token counter for counting tokens. Default: None. If None, it will return the character count of the JSON string representation of messages (i.e., len(json.dumps(messages, ensure_ascii=False))).
- `compress_func` (Callable[[List[Msg]], Awaitable[List[Msg]]] | None): Custom compression function. Must be async and return `List[Msg]`. If None, uses the default `_compress_memory` method
- `compression_trigger_func` (Callable[[List[Msg]], Awaitable[bool]] | None): Optional function to trigger compression when token count is below `max_token`. Must be async and return `bool`. If None, compression only occurs when token count exceeds `max_token`
- `compression_on_add` (bool): Whether to check and compress memory when adding messages. Default: False
- `compression_on_get` (bool): Whether to check and compress memory when getting messages. Default: True
- `customized_compression_prompt` (str | None): Optional customized compression prompt template. Should include placeholders: `{max_token}`, `{messages_list_json}`, `{schema_json}`. Default: None (uses default template)
#### Main Methods
**`async add(msgs: Union[Sequence[Msg], Msg, None], compress_func=None, compression_trigger_func=None)`**
- Adds new messages to both `chat_history_storage` and `memory_storage`
- Messages are deep-copied to avoid modifying originals
- Raises `TypeError` if non-Msg objects are provided
- Parameters:
- `msgs`: Messages to be added
- `compress_func` (Optional): Override the instance-level compression function for this call
- `compression_trigger_func` (Optional): Override the instance-level trigger function for this call
- If `compression_on_add=True`, may trigger compression after adding
**`async direct_update_memory(msgs: Union[Sequence[Msg], Msg, None])`**
- Directly updates the `memory_storage` with new messages (does not update `chat_history_storage`)
- Useful for replacing memory content directly
**`async get_memory(recent_n=None, filter_func=None, compress_func=None, compression_trigger_func=None)`**
- Retrieves memory content, automatically compressing if token limit is exceeded (if `compression_on_get=True`)
- Parameters:
- `recent_n` (Optional[int]): Return only the most recent N messages
- `filter_func` (Optional[Callable[[int, Msg], bool]]): Custom filter function that takes (index, message) and returns bool
- `compress_func` (Optional): Override the instance-level compression function for this call
- `compression_trigger_func` (Optional): Override the instance-level trigger function for this call
- Returns: `list[Msg]` - The memory content (potentially compressed)
**`async delete(indices: Union[Iterable[int], int])`**
- Deletes memory fragments from `memory_storage` (note: does not delete from `chat_history_storage`)
- Indices can be a single int or an iterable of ints
**`async size() -> int`**
- Returns the number of messages in `chat_history_storage`
**`async clear()`**
- Clears all memory from both `chat_history_storage` and `memory_storage`
**`state_dict() -> dict`**
- Returns a dictionary containing the serialized state:
- `chat_history_storage`: List of message dictionaries from chat history
- `memory_storage`: List of message dictionaries from memory
- `max_token`: The max_token setting
- Note: This method handles async operations internally, so it can be called from both sync and async contexts
**`load_state_dict(state_dict: dict, strict: bool = True)`**
- Loads memory state from a dictionary
- Restores `chat_history_storage`, `memory_storage`, and `max_token` settings
- Note: This method handles async operations internally, so it can be called from both sync and async contexts
**`async retrieve(*args, **kwargs)`**
- Not implemented. Use `get_memory()` instead.
- Raises `NotImplementedError`
## Internal Methods
**`async _compress_memory(msgs: List[Msg]) -> List[Msg]`**
- Internal method that compresses messages using the LLM
- Uses structured output with `MemoryCompressionSchema`
- Returns a `List[Msg]` containing the compressed summary (typically a single message)
- Supports both streaming and non-streaming models
**`async _check_length_and_compress(compress_func=None) -> bool`**
- Checks if memory token count exceeds `max_token` and compresses if needed
- Returns `True` if compression was triggered, `False` otherwise
**`async check_and_compress(compress_func=None, compression_trigger_func=None, memory=None) -> tuple[bool, List[Msg]]`**
- Checks if compression is needed based on `compression_trigger_func`
- Returns a tuple: (was_compressed: bool, compressed_memory: List[Msg])
- If `memory` is provided, checks that instead of `memory_storage`
## Utility Functions
The `_mc_utils.py` module provides:
- **`format_msgs(msgs)`**: Formats a list of `Msg` objects into a list of dictionaries
- **`async count_words(token_counter, text)`**: Counts tokens in text (supports both string and list[dict] formats). Must be awaited.
- **`MemoryCompressionSchema`**: Pydantic model for structured compression output
- **`DEFAULT_COMPRESSION_PROMPT_TEMPLATE`**: Default prompt template for compression (includes placeholders: `{max_token}`, `{messages_list_json}`, `{schema_json}`)
## Storage Abstraction
The `_memory_storage.py` module provides:
- **`MessageStorageBase`**: Abstract base class for message storage backends
- Required async methods: `start()`, `stop()`, `health()`, `add()`, `delete()`, `clear()`, `get()`, `replace()`, `__aenter__()`, `__aexit__()`
- **`InMemoryMessageStorage`**: Default in-memory implementation
- Stores messages in a simple list
- Suitable for most use cases
## Best Practices
- **Token Limit Selection**: Choose `max_token` based on your model's context window and typical conversation length
- **Compression Timing**:
- Set `compression_on_get=True` (default) for compression during retrieval
- Set `compression_on_add=False` (default) to avoid compression during add operations, as it may not complete before `get_memory()` is called
- **Async Operations**: All main methods are async, so use `await` when calling them
- **State Persistence**: Use `state_dict()` and `load_state_dict()` to save/restore conversation state between sessions
- **Custom Compression**: For domain-specific compression needs, implement a custom `compress_func` (must be async and return `List[Msg]`)
- **Compression Triggers**: Use `compression_trigger_func` for custom compression logic beyond token limits (e.g., compress after N messages, compress on specific conditions)
- **Storage Backends**: Implement custom `MessageStorageBase` subclasses for persistent storage (e.g., database, file system)
## Troubleshooting
- **Compression Not Triggering**:
- Check that `compression_on_get=True` if you expect compression during retrieval
- Verify that `max_token` is set appropriately
- Ensure `get_memory()` is being called (and awaited)
- If using `compression_trigger_func`, verify it returns `True` when compression should occur
- **Structured Output Errors**: Ensure your model supports structured output (e.g., DashScope models with `structured_model` parameter)
- **Token Counting Issues**: Verify that your `token_counter` is compatible with your model and correctly configured
- **Async/Await Errors**: Remember that most methods are async - use `await` when calling them
- **Storage Issues**: If using custom storage backends, ensure all required methods are properly implemented and async
## Reference
- [AgentScope Documentation](https://github.com/agentscope-ai/agentscope)
- [Pydantic Documentation](https://docs.pydantic.dev/)

View File

@@ -0,0 +1,46 @@
# -*- coding: utf-8 -*-
"""The main entry point of the MemoryWithCompress example."""
import asyncio
import os
from agentscope.agent import ReActAgent, UserAgent
from agentscope.formatter import DashScopeChatFormatter
from agentscope.model import DashScopeChatModel
from agentscope.token import CharTokenCounter
async def main() -> None:
"""The main entry point of the MemoryWithCompress example."""
# Create model for agent and memory compression
agent = ReActAgent(
name="Friday",
sys_prompt="You are a helpful assistant named Friday.",
model=DashScopeChatModel(
api_key=os.getenv("DASHSCOPE_API_KEY"),
model_name="qwen3-max",
),
formatter=DashScopeChatFormatter(),
compression_config=ReActAgent.CompressionConfig(
enable=True,
agent_token_counter=CharTokenCounter(),
# We set a small trigger threshold for demonstration purposes.
trigger_threshold=1000,
keep_recent=3,
),
)
user = UserAgent("User")
# Simulate a conversation to trigger memory compression
msg = None
while True:
msg = await user(msg)
if msg.get_text_content() == "exit":
break
msg = await agent(msg)
print("The memory of the agent:")
for msg in await agent.memory.get_memory():
print(msg.to_dict(), end="\n")
asyncio.run(main())

View File

@@ -0,0 +1,479 @@
# ReMe Short-Term Memory in AgentScope
This example demonstrates how to
- use ReMeShortTermMemory to provide automatic working memory management for AgentScope agents,
- handle long conversation contexts with intelligent summarization and compaction,
- integrate short-term memory with ReAct agents for efficient tool usage and context management, and
- configure DashScope models for memory operations.
## Why Short-Term Memory?
### The Challenge: From Prompt Engineering to Context Engineering
As AI agents evolved from simple chatbots to sophisticated autonomous systems, the focus shifted from "prompt engineering" to "context engineering". While prompt engineering focused on crafting effective instructions for language models, context engineering addresses a more fundamental challenge: **managing the ever-growing conversation and tool execution history that agents accumulate**.
### The Core Problem: Context Explosion
Agentic systems work by binding LLMs with tools and running them in a loop where the agent decides which tools to call and feeds results back into the message history. This creates a snowball effect:
- **Rapid Growth**: A seemingly simple task can trigger 50+ tool calls, with production agents often running hundreds of conversation turns
- **Large Outputs**: Each tool call can return substantial text, consuming massive amounts of tokens
- **Memory Pressure**: The context window quickly fills up as messages and tool results accumulate chronologically
### The Consequence: Context Rot
When context grows too large, model performance degrades significantly—a phenomenon known as **"context rot"**:
- **Repetitive Responses**: The model starts generating redundant or circular answers
- **Slower Reasoning**: Inference becomes noticeably slower as context length increases
- **Quality Degradation**: Overall response quality and coherence decline
- **Lost Focus**: The model struggles to identify relevant information in the bloated context
### The Fundamental Paradox
Agents face a critical tension:
- **Need Rich Context**: Agents require comprehensive historical information to make informed decisions
- **Suffer from Large Context**: Excessive context causes performance degradation and inefficiency
**Context management aims to keep "just enough" information in the window**—sufficient for effective decision-making while leaving room for retrieval and expansion, without overwhelming the model.
### Why Short-Term Memory Management Matters
Effective short-term memory management is essential for:
1. **Maintaining Performance**: Keeping context within optimal size prevents quality degradation
2. **Enabling Long-Running Tasks**: Agents can handle complex, multi-step workflows without hitting context limits
3. **Cost Efficiency**: Reducing token usage directly lowers API costs
4. **Preserving Reasoning Quality**: Clean, focused context helps models maintain coherent reasoning chains
5. **Scalability**: Proper memory management allows agents to scale to production workloads
### The Solution: Intelligent Context Management
ReMeShortTermMemory implements proven context management strategies:
- **Context Offloading**: Moving large tool outputs to external storage while keeping references
- **Context Reduction**: Compacting tool results into minimal representations and summarizing when necessary
- **Smart Retention**: Keeping recent messages intact to maintain continuity and provide usage examples
- **Automatic Triggering**: Monitoring token usage and applying strategies before performance degrades
By implementing these strategies, ReMeShortTermMemory enables agents to handle arbitrarily long conversations and complex tasks while maintaining optimal performance throughout.
## Prerequisites
- Python 3.10 or higher
- DashScope API key from Alibaba Cloud
## QuickStart
Install agentscope and ensure you have a valid DashScope API key in your environment variables.
> Note: The example is built with DashScope chat model. If you want to use OpenAI models instead,
> modify the model initialization in the example code accordingly.
```bash
# Install agentscope from source
cd {PATH_TO_AGENTSCOPE}
pip install -e .
# Install dependencies
pip install reme-ai python-dotenv
```
Set up your API key:
```bash
export DASHSCOPE_API_KEY='YOUR_API_KEY'
```
Or create a `.env` file:
```bash
DASHSCOPE_API_KEY=YOUR_API_KEY
```
Run the example:
```bash
python short_term_memory_example.py
```
The example will:
1. Initialize a ReMeShortTermMemory instance with DashScope models
2. Demonstrate automatic memory compaction for long tool responses
3. Show integration with ReActAgent for context-aware conversations
4. Use grep and read_file tools to search and retrieve information from files
## Key Features
- **Automatic Memory Management**: Intelligent summarization and compaction of working memory to handle long contexts
- **Tool Response Optimization**: Automatic truncation and summarization of large tool responses to stay within token limits
- **Flexible Configuration**: Configurable thresholds for compaction ratio, token limits, and recent message retention
- **ReAct Agent Integration**: Seamless integration with AgentScope's ReActAgent and tool system
- **Async Operations**: Full async support for non-blocking memory operations
## Basic Usage
This section provides a detailed walkthrough of the `short_term_memory_example.py` code, explaining how each component works together to create an agent with intelligent context management.
### Configuration Parameters
#### `ReMeShortTermMemory` Class Parameters
The `ReMeShortTermMemory` class accepts the following initialization parameters:
- **`model`** (`DashScopeChatModel | OpenAIChatModel | None`): Language model for compression operations. Must be either `DashScopeChatModel` or `OpenAIChatModel`. This model is used for LLM-based compression when generating compact state snapshots. **Required**.
- **`reme_config_path`** (`str | None`): Optional path to ReMe configuration file for custom settings. Use this to provide advanced ReMe configurations beyond the standard parameters. Default: `None`.
- **`working_summary_mode`** (`str`): Strategy for working memory management. Controls how the memory system handles context overflow:
- `"compact"`: Only compact verbose tool messages by storing full content externally and keeping short previews in the active context.
- `"compress"`: Only apply LLM-based compression to generate compact state snapshots of conversation history.
- `"auto"`: First run compaction, then optionally run compression if the compaction ratio exceeds `compact_ratio_threshold`. This is the recommended mode for most use cases.
Default: `"auto"`.
- **`compact_ratio_threshold`** (`float`): Threshold for compaction effectiveness in AUTO mode. If `(compacted_tokens / original_tokens) > compact_ratio_threshold`, compression is applied after compaction. This ensures compression only runs when compaction alone isn't sufficient. Valid range: 0.0 to 1.0. Default: `0.75`.
- **`max_total_tokens`** (`int`): Maximum token count threshold before compression is triggered. This limit does **not** include `keep_recent_count` messages or system messages, which are always preserved. Should be set to 20%-50% of your model's context window size to leave room for new tool calls and responses. Default: `20000`.
- **`max_tool_message_tokens`** (`int`): Maximum token count for individual tool messages before compaction. Tool messages exceeding this limit are stored externally in files, with only a short preview kept in the active context. This is the maximum tolerable length for a single tool response. Default: `2000`.
- **`group_token_threshold`** (`int | None`): Maximum token count per compression group when splitting messages for LLM compression. When set to a positive integer, long message sequences are split into smaller batches for compression. If `None` or `0`, all messages are compressed in a single group. Use this to control the granularity of compression operations. Default: `None`.
- **`keep_recent_count`** (`int`): Number of most recent messages to preserve without compression or compaction. These messages remain in full in the active context to maintain conversation continuity and provide usage examples for the agent. The example uses `1` for demonstration purposes; **in production, a value of `10` is recommended** to maintain better conversation flow. Default: `10`.
- **`store_dir`** (`str`): Directory path for storing offloaded message content and compressed history files. This is where external files containing full tool responses and compressed message history are saved. The directory will be created automatically if it doesn't exist. Default: `"inmemory"`.
- **`**kwargs`** (`Any`): Additional arguments passed to `ReMeApp` initialization. Use this to pass any extra configuration options supported by the underlying ReMe application.
#### Parameter Relationships and Best Practices
- **Token Budget Strategy**: Set `max_total_tokens` to 20%-50% of your model's context window. For example, if your model has a 128K context window, set `max_total_tokens` between 25,600 and 64,000 tokens.
- **Compaction vs Compression**:
- Compaction is fast and lossless (full content is stored externally)
- Compression is slower but more aggressive (uses LLM to summarize)
- Use `"auto"` mode to benefit from both strategies
- **Recent Message Retention**: Higher `keep_recent_count` values (e.g., 10) provide better context continuity but consume more tokens. Lower values (e.g., 1) are more aggressive but may lose important recent context.
- **Tool Message Handling**: Adjust `max_tool_message_tokens` based on your typical tool response sizes. If your tools frequently return large outputs (e.g., file contents, API responses), consider a higher threshold or ensure compaction is enabled.
### Code Flow Diagram
```mermaid
flowchart TD
A[Start: Load Environment] --> B[Create Toolkit]
B --> C[Register Tools: grep & read_file]
C --> D[Initialize LLM Model]
D --> E[Create ReMeShortTermMemory]
E --> F[Enter Async Context Manager]
F --> G[Add Initial Messages with Large Tool Response]
G --> H[Memory Auto-Compacts Large Content]
H --> I[Create ReActAgent with Memory]
I --> J[User Sends Query]
J --> K[Agent Uses Tools to Search/Read]
K --> L[Tool Responses Added to Memory]
L --> M{Memory Token Limit?}
M -->|Exceeded| N[Auto-Compact/Summarize]
M -->|OK| O[Agent Generates Response]
N --> O
O --> P[Return Response to User]
P --> Q[Exit Context Manager]
Q --> End[End]
style H fill:#e1f5ff
style N fill:#ffe1e1
style O fill:#e1ffe1
```
### Step-by-Step Code Walkthrough
The example demonstrates a complete workflow from tool registration to agent interaction. Here's a detailed breakdown:
#### 1. Environment Setup and Imports
```python
import asyncio
import os
from dotenv import load_dotenv
load_dotenv()
```
The code starts by loading environment variables (including the DashScope API key) from a `.env` file.
#### 2. Tool Registration
The example defines two custom tools that demonstrate how to integrate retrieval operations:
**`grep` Tool**: Searches for patterns in files using regular expressions
```python
async def grep(file_path: str, pattern: str, limit: str) -> ToolResponse:
"""A powerful search tool for finding patterns in files..."""
from reme_ai.retrieve.working import GrepOp
op = GrepOp()
await op.async_call(file_path=file_path, pattern=pattern, limit=limit)
return ToolResponse(
content=[TextBlock(type="text", text=op.output)],
)
```
**`read_file` Tool**: Reads specific line ranges from files
```python
async def read_file(file_path: str, offset: int, limit: int) -> ToolResponse:
"""Reads and returns the content of a specified file..."""
from reme_ai.retrieve.working import ReadFileOp
op = ReadFileOp()
await op.async_call(file_path=file_path, offset=offset, limit=limit)
return ToolResponse(
content=[TextBlock(type="text", text=op.output)],
)
```
> **Important Note on Tool Replaceability**:
> - The `grep` and `read_file` tools shown here are **example implementations** using ReMe's built-in operations
> - You can **replace them with your own retrieval tools**, such as:
> - Vector database embedding retrieval (e.g., ChromaDB, Pinecone, Weaviate)
> - Web search APIs (e.g., Google Search, Bing Search)
> - Database query tools (e.g., SQL queries, MongoDB queries)
> - Custom domain-specific search solutions
> - Similarly, the **offline write operations** (used internally by ReMeShortTermMemory to store compacted content) can be customized by modifying the `write_text_file` function in AgentScope's tool system
> - The key requirement is that your tools return `ToolResponse` objects with appropriate content blocks
#### 3. LLM Model Initialization
```python
llm = DashScopeChatModel(
model_name="qwen3-coder-30b-a3b-instruct",
api_key=os.environ.get("DASHSCOPE_API_KEY"),
stream=False,
generate_kwargs={
"temperature": 0.001,
"seed": 0,
},
)
```
The model is configured with low temperature for consistent, deterministic responses. This same model will be used for both agent reasoning and memory summarization operations.
#### 4. Short-Term Memory Initialization
```python
short_term_memory = ReMeShortTermMemory(
model=llm,
working_summary_mode="auto", # Automatic memory management
compact_ratio_threshold=0.75, # Trigger compaction at 75% capacity
max_total_tokens=20000, # Set to 20%-50% of model's context window
max_tool_message_tokens=2000, # Maximum tolerable tool response length
group_token_threshold=None, # Max tokens per LLM compression batch; None means no splitting
keep_recent_count=1, # Keep 1 recent message intact (set to 1 for demo; use 10 in production)
store_dir="inmemory", # Storage directory for offloaded content
)
```
This configuration enables automatic memory management that will:
- Monitor token usage
- Automatically compact large tool responses when they exceed `max_tool_message_tokens`
- Trigger summarization when total tokens exceed `max_total_tokens` and compaction ratio exceeds `compact_ratio_threshold`
#### 5. Async Context Manager Usage
```python
async with short_term_memory:
# All memory operations happen here
```
The `async with` statement ensures proper initialization and cleanup of memory resources. This is the recommended approach for using `ReMeShortTermMemory`.
#### 6. Simulating Long Context
The example demonstrates memory compaction by adding a large tool response:
```python
# Read README content and multiply it 10 times to simulate a large response
f = open("../../../../README.md", encoding="utf-8")
readme_content = f.read()
f.close()
memories = [
{
"role": "user",
"content": "搜索下项目资料",
},
{
"role": "assistant",
"content": None,
"tool_calls": [...], # Tool call metadata
},
{
"role": "tool",
"content": readme_content * 10, # Large tool response (10x README)
"tool_call_id": "call_6596dafa2a6a46f7a217da",
},
]
await short_term_memory.add(
ReMeShortTermMemory.list_to_msg(memories),
allow_duplicates=True,
)
```
When this large content is added, `ReMeShortTermMemory` will:
1. Detect that the tool response exceeds `max_tool_message_tokens` (the maximum tolerable tool response length, set to 2000 in this example)
2. Automatically compact it by storing the full content in an external file
3. Keep only a short preview in the active memory
4. This happens transparently without manual intervention
#### 7. ReAct Agent Creation
```python
agent = ReActAgent(
name="react",
sys_prompt=(
"You are a helpful assistant. "
"工具调用的调用可能会被缓存到本地。"
"可以先使用`Grep`匹配关键词或者正则表达式所在行数,然后通过`ReadFile`读取位置附近的代码。"
# ... more instructions
),
model=llm,
formatter=DashScopeChatFormatter(),
toolkit=toolkit,
memory=short_term_memory, # Memory is integrated here
max_iters=20,
)
```
The agent is configured with:
- The same LLM model used for memory operations
- The toolkit containing `grep` and `read_file` tools
- The `short_term_memory` instance for automatic context management
- A system prompt that guides the agent on tool usage patterns
#### 8. Agent Interaction
```python
msg = Msg(
role="user",
content=("项目资料中agentscope_v1论文的一作是谁"),
name="user",
)
msg = await agent(msg)
print(f"✓ Agent response: {msg.get_text_content()}\n")
```
When the agent processes this message:
1. It receives the user query
2. Decides to use tools (e.g., `grep` to search for "agentscope_v1")
3. Tool responses are automatically added to memory
4. If memory grows too large, automatic compaction occurs
5. The agent generates a response based on the managed context
6. The response is returned to the user
### Complete Example Code Structure
```python
async def main() -> None:
# 1. Create toolkit and register tools
toolkit = Toolkit()
toolkit.register_tool_function(grep)
toolkit.register_tool_function(read_file)
# 2. Initialize LLM
llm = DashScopeChatModel(...)
# 3. Create short-term memory
short_term_memory = ReMeShortTermMemory(...)
# 4. Use async context manager
async with short_term_memory:
# 5. Add initial messages (with large content to demo compaction)
await short_term_memory.add(messages, allow_duplicates=True)
# 6. Create agent with memory
agent = ReActAgent(..., memory=short_term_memory, ...)
# 7. Interact with agent
response = await agent(user_message)
```
### Key Takeaways
1. **Automatic Memory Management**: Memory compaction and summarization happen automatically when thresholds are exceeded
2. **Tool Integration**: Tools return `ToolResponse` objects that are seamlessly integrated into memory
3. **Async Context Manager**: Always use `async with short_term_memory:` to ensure proper resource management
4. **Flexible Tool System**: The `grep` and `read_file` tools are examples—you can replace them with any retrieval mechanism that fits your use case
5. **Transparent Operation**: Memory management is transparent to the agent—it just sees a clean, focused context
### Using Async Context Manager
`ReMeShortTermMemory` implements the async context manager protocol, which ensures proper initialization and cleanup of resources. There are two ways to use it:
#### Recommended: Using `async with` Statement
The recommended approach is to use the `async with` statement, which automatically handles resource management:
```python
async with short_term_memory:
# Memory is initialized here
await short_term_memory.add(messages)
response = await agent(msg)
# Memory is automatically cleaned up when exiting the block
```
#### Alternative: Manual `__aenter__` and `__aexit__` Calls
You can also manually call `__aenter__` and `__aexit__` if you need more control:
```python
# Manually initialize
await short_term_memory.__aenter__()
try:
# Use the memory
await short_term_memory.add(messages)
response = await agent(msg)
finally:
# Manually cleanup
await short_term_memory.__aexit__(None, None, None)
```
> **Note**: It's recommended to use the `async with` statement as it ensures proper resource cleanup even if an exception occurs.
## Advanced Configuration
You can customize the ReMe config by passing a config path:
```python
short_term_memory = ReMeShortTermMemory(
model=llm,
reme_config_path="path/to/your/config.yaml", # Pass your custom ReMe configuration
# ... other parameters
)
```
For more configuration options, refer to the [ReMe documentation](https://github.com/agentscope-ai/ReMe).
## What's in the Example
The `short_term_memory_example.py` file demonstrates:
1. **Tool Integration**: Registering `grep` and `read_file` tools for searching and reading files
2. **Memory Initialization**: Setting up ReMeShortTermMemory with appropriate parameters for handling long contexts
3. **Long Context Handling**: Adding a large tool response (README content × 10) to demonstrate automatic memory compaction
4. **ReAct Agent Usage**: Using the agent with short-term memory to answer questions based on retrieved information
## Example Workflow
The example shows a typical workflow:
1. User asks to search for project information
2. Agent uses `grep` tool to find relevant content
3. Agent uses `read_file` tool to read specific sections
4. Large tool responses are automatically compacted by the memory system
5. Agent answers the user's question based on the retrieved information

View File

@@ -0,0 +1,349 @@
# -*- coding: utf-8 -*-
"""ReMe-based short-term memory implementation for AgentScope."""
import json
from pathlib import Path
from typing import Any, List
from uuid import uuid4
from agentscope import logger
from agentscope._utils._common import _json_loads_with_repair
from agentscope.formatter import DashScopeChatFormatter, OpenAIChatFormatter
from agentscope.memory import InMemoryMemory
from agentscope.message import Msg, TextBlock, ToolUseBlock, ToolResultBlock
from agentscope.model import DashScopeChatModel, OpenAIChatModel
from agentscope.tool import write_text_file
class ReMeShortTermMemory(InMemoryMemory):
"""Short-term memory implementation using ReMe for message management.
This class provides automatic working-memory management through a
multi-stage pipeline that reduces token usage while preserving
essential information:
1. **Compaction**: Truncates large tool messages by storing full
content in external files and keeping only short previews in the
active context.
2. **Compression**: Uses LLM to generate dense summaries of older
conversation history, creating a compact state snapshot.
3. **Offload**: Orchestrates compaction and optional compression
based on the configured working_summary_mode (COMPACT, COMPRESS,
or AUTO).
The memory management is triggered automatically when `get_memory()`
is called, ensuring the agent's context stays within token limits
while maintaining access to detailed historical information through
external storage.
"""
def __init__(
self,
model: DashScopeChatModel | OpenAIChatModel | None = None,
reme_config_path: str | None = None,
working_summary_mode: str = "auto",
compact_ratio_threshold: float = 0.75,
max_total_tokens: int = 20000,
max_tool_message_tokens: int = 2000,
group_token_threshold: int | None = None,
keep_recent_count: int = 10,
store_dir: str = "inmemory",
**kwargs: Any,
) -> None:
"""Initialize ReMe-based short-term memory.
Args:
model: Language model for compression operations. Must be
either DashScopeChatModel or OpenAIChatModel.
reme_config_path: Optional path to ReMe configuration file
for custom settings.
working_summary_mode: Strategy for working memory management.
- "compact": Only compact verbose tool messages by
storing full content externally and keeping short
previews.
- "compress": Only apply LLM-based compression to
generate compact state snapshots.
- "auto": First run compaction, then optionally run
compression if the compaction ratio exceeds
compact_ratio_threshold.
Defaults to "auto".
compact_ratio_threshold: Threshold for compaction
effectiveness in AUTO mode. If (compacted_tokens /
original_tokens) > this threshold, compression is
applied. Defaults to 0.75.
max_total_tokens: Maximum token count threshold before
compression is triggered. Does not include
keep_recent_count messages or system messages.
Defaults to 20000.
max_tool_message_tokens: Maximum token count for individual
tool messages before compaction. Tool messages exceeding
this are stored externally. Defaults to 2000.
group_token_threshold: Maximum token count per compression
group when splitting messages for LLM compression. If
None or 0, all messages are compressed in a single
group. Defaults to None.
keep_recent_count: Number of most recent messages to
preserve without compression or compaction. These
messages remain in full in the active context.
Defaults to 1.
store_dir: Directory path for storing offloaded message
content and compressed history files. Defaults to
"working_memory".
**kwargs: Additional arguments passed to ReMeApp
initialization.
Raises:
ValueError: If model is not a DashScopeChatModel or
OpenAIChatModel.
ImportError: If reme_ai library is not installed.
"""
super().__init__()
# Store working memory parameters
self.working_summary_mode = working_summary_mode
self.compact_ratio_threshold = compact_ratio_threshold
self.max_total_tokens = max_total_tokens
self.max_tool_message_tokens = max_tool_message_tokens
self.group_token_threshold = group_token_threshold
self.keep_recent_count = keep_recent_count
self.store_dir = store_dir
config_args = []
if isinstance(model, DashScopeChatModel):
llm_api_base = "https://dashscope.aliyuncs.com/compatible-mode/v1"
llm_api_key = model.api_key
self.formatter = DashScopeChatFormatter()
elif isinstance(model, OpenAIChatModel):
llm_api_base = str(getattr(model.client, "base_url", None))
llm_api_key = str(getattr(model.client, "api_key", None))
self.formatter = OpenAIChatFormatter()
else:
raise ValueError(
"model must be a DashScopeChatModel or "
"OpenAIChatModel instance. "
f"Got {type(model).__name__} instead.",
)
llm_model_name = model.model_name
if llm_model_name:
config_args.append(f"llm.default.model_name={llm_model_name}")
try:
from reme_ai import ReMeApp
except ImportError as e:
raise ImportError(
"The 'reme_ai' library is required for ReMe-based "
"short-term memory. Please try `pip install reme-ai`,"
"and visit: https://github.com/agentscope-ai/ReMe for more "
"information.",
) from e
self.app = ReMeApp(
*config_args,
llm_api_key=llm_api_key,
llm_api_base=llm_api_base,
embedding_api_key=llm_api_key, # fake api key
embedding_api_base=llm_api_base, # fake api base
config_path=reme_config_path,
**kwargs,
)
self._app_started = False
async def __aenter__(self) -> "ReMeShortTermMemory":
"""Async context manager entry.
Initializes the ReMe application for async operations.
"""
if self.app is not None:
await self.app.__aenter__()
self._app_started = True
return self
async def __aexit__(
self,
exc_type: Any = None,
exc_val: Any = None,
exc_tb: Any = None,
) -> None:
"""Async context manager exit.
Cleans up the ReMe application resources.
"""
if self.app is not None:
await self.app.__aexit__(exc_type, exc_val, exc_tb)
self._app_started = False
async def get_memory(self) -> list[Msg]:
"""Retrieve and manage working memory with automatic summarization.
This method performs the core working-memory management pipeline:
1. **Format messages**: Converts internal Msg objects to standard
message format using the appropriate formatter (DashScope or
OpenAI).
2. **Execute offload pipeline**: Calls ReMe's
summary_working_memory_for_as operation which orchestrates:
- Message compaction: Large tool messages are truncated and
stored externally with only previews kept in context.
- Message compression: If needed (based on
working_summary_mode), older messages are compressed using
LLM into dense summaries.
- File storage: Offloaded content is written to external
files for potential retrieval.
3. **Update content**: Replaces the internal message list with
the managed version, ensuring subsequent operations work with
the optimized context.
The operation respects configuration parameters like
max_total_tokens, keep_recent_count, and working_summary_mode to
balance context size with information preservation.
Returns:
List of Msg objects representing the managed working memory,
with large tool messages compacted and/or older history
compressed as needed.
Note:
This method automatically writes offloaded content to files
in the configured store_dir. The write_file_dict metadata
contains paths and content for all externally stored
messages.
"""
messages: list[dict[str, Any]] = await self.formatter.format(
msgs=self.content, # type: ignore[has-type]
)
for message in messages:
if isinstance(message.get("content"), list):
msg_content = message.get("content")
logger.warning(
"Skipping message with content as list. content=%s",
msg_content,
)
message["content"] = ""
# Execute ReMe's working memory offload pipeline
# This orchestrates compaction and/or compression based on
# working_summary_mode
result: dict = await self.app.async_execute(
name="summary_working_memory_for_as",
messages=messages,
working_summary_mode=self.working_summary_mode,
compact_ratio_threshold=self.compact_ratio_threshold,
max_total_tokens=self.max_total_tokens,
max_tool_message_tokens=self.max_tool_message_tokens,
group_token_threshold=self.group_token_threshold,
keep_recent_count=self.keep_recent_count,
store_dir=self.store_dir,
chat_id=uuid4().hex,
)
logger.info(
"summary_working_memory_for_as.result=%s",
json.dumps(result, ensure_ascii=False, indent=2),
)
# Extract managed messages and file write operations from result
messages = result.get("answer", [])
write_file_dict: dict = result.get("metadata", {}).get(
"write_file_dict",
{},
)
# Write offloaded content to external files
# This includes full tool message content and compressed message
# history
if write_file_dict:
for path, content_str in write_file_dict.items():
file_dir = Path(path).parent
if not file_dir.exists():
file_dir.mkdir(parents=True, exist_ok=True)
await write_text_file(path, content_str)
# Update internal content with managed messages
self.content = self.list_to_msg(messages)
return self.content
@staticmethod
def list_to_msg(messages: list[dict[str, Any]]) -> list[Msg]:
"""Convert a list of message dictionaries to Msg objects.
This method handles the conversion from standard message format
(used by ReMe and LLM APIs) back to AgentScope's Msg objects.
It properly handles:
- Text content for user, system, and assistant messages
- Tool result blocks (converting role="tool" to role="system")
- Tool use blocks from tool_calls in assistant messages
Args:
messages: List of message dictionaries with role, content,
and optional tool_calls or tool-related fields.
Returns:
List of Msg objects with properly structured content blocks.
"""
msg_list: list[Msg] = []
for msg_dict in messages:
role = msg_dict["role"]
content_blocks: List[
TextBlock | ToolUseBlock | ToolResultBlock
] = []
content = msg_dict.get("content")
# Convert text content to appropriate content blocks
if content:
if role in ["user", "system", "assistant"]:
content_blocks.append(TextBlock(type="text", text=content))
elif role in ["tool"]:
# Tool messages are converted to system messages with
# ToolResultBlock
role = "system"
content_blocks.append(
ToolResultBlock(
type="tool_result",
name=msg_dict.get("name"),
id=msg_dict.get("tool_call_id"),
output=[TextBlock(type="text", text=content)],
),
)
# Convert tool_calls to ToolUseBlock content blocks
if msg_dict.get("tool_calls"):
for tool_call in msg_dict["tool_calls"]:
# Parse tool arguments with repair for malformed JSON
input_ = _json_loads_with_repair(
tool_call["function"].get(
"arguments",
"{}",
)
or "{}",
)
content_blocks.append(
ToolUseBlock(
type="tool_use",
name=tool_call["function"]["name"],
input=input_,
id=tool_call["id"],
),
)
msg_obj = Msg(
name=role,
content=content_blocks,
role=role,
metadata=msg_dict.get("metadata"),
)
msg_list.append(msg_obj)
return msg_list
async def retrieve(self, *args: Any, **kwargs: Any) -> None:
"""Retrieve operation is not implemented for ReMe short-term memory.
ReMe focuses on working memory management (compaction and compression)
rather than retrieval from long-term storage.
Raises:
NotImplementedError: This operation is not supported.
"""
raise NotImplementedError

View File

@@ -0,0 +1,188 @@
# -*- coding: utf-8 -*-
"""Example demonstrating ReMeShortTermMemory usage with ReActAgent."""
# noqa: E402
import asyncio
import os
from dotenv import load_dotenv
from agentscope.agent import ReActAgent
from agentscope.formatter import DashScopeChatFormatter
from agentscope.message import Msg, TextBlock
from agentscope.model import DashScopeChatModel
from agentscope.tool import ToolResponse, Toolkit, view_text_file
load_dotenv()
async def main() -> None:
"""Main function demonstrating ReMeShortTermMemory with tool usage."""
from reme_short_term_memory import ReMeShortTermMemory
toolkit = Toolkit()
async def grep(file_path: str, pattern: str, limit: str) -> ToolResponse:
"""A powerful search tool for finding patterns in files using regular
expressions.
Supports full regex syntax (e.g., "log.*Error", "function\\s+\\w+"),
glob pattern filtering, and result limiting. Ideal for searching code
or text content across multiple files.
Args:
file_path (`str`):
The path to the file to search in. Can be an absolute or
relative path.
pattern (`str`):
The search pattern or regular expression to match. Supports
full regex syntax for complex pattern matching.
limit (`str`):
The maximum number of matching results to return. Use this to
control output size for large files. Should not exceed 50.
"""
from reme_ai.retrieve.working import GrepOp
op = GrepOp()
await op.async_call(file_path=file_path, pattern=pattern, limit=limit)
return ToolResponse(
content=[
TextBlock(
type="text",
text=op.output,
),
],
)
async def read_file(
file_path: str,
offset: int,
limit: int,
) -> ToolResponse:
"""Reads and returns the content of a specified file.
For text files, it can read specific line ranges using the 'offset' and
'limit' parameters. Use offset and limit to paginate through large
files.
Note: It's recommended to use the `grep` tool first to locate the line
numbers of interest before calling this function.
Args:
file_path (`str`):
The path to the file to read. Can be an absolute or relative
path.
offset (`int`):
The starting line number to read from (0-indexed). Use this to
skip to a specific position in the file.
limit (`int`):
The maximum number of lines to read from the offset position.
Helps control memory usage when reading large files. Should
not exceed 100.
"""
return await view_text_file(file_path, ranges=[offset, offset + limit])
# These two tools are provided as examples. You can replace them with your
# own retrieval tools, such as vector database embedding retrieval or other
# search solutions that fit your use case.
toolkit.register_tool_function(grep)
toolkit.register_tool_function(read_file)
llm = DashScopeChatModel(
model_name="qwen3-max",
# model_name="qwen3-coder-30b-a3b-instruct",
api_key=os.environ.get("DASHSCOPE_API_KEY"),
stream=False,
generate_kwargs={
"temperature": 0.001,
"seed": 0,
},
)
short_term_memory = ReMeShortTermMemory(
model=llm,
working_summary_mode="auto",
compact_ratio_threshold=0.75,
max_total_tokens=20000,
max_tool_message_tokens=2000,
group_token_threshold=None, # Max tokens per compression batch
keep_recent_count=1, # Set to 1 for demo; use 10 in production
store_dir="inmemory",
)
async with short_term_memory:
# Simulate ultra long context
f = open("../../../../README.md", encoding="utf-8")
readme_content = f.read()
f.close()
memories = [
{
"role": "user",
"content": "Search for project information",
},
{
"role": "assistant",
"content": None,
"tool_calls": [
{
"index": 0,
"id": "call_6596dafa2a6a46f7a217da",
"function": {
"arguments": "{}",
"name": "web_search",
},
"type": "function",
},
],
},
{
"role": "tool",
"content": readme_content * 10,
"tool_call_id": "call_6596dafa2a6a46f7a217da",
},
]
await short_term_memory.add(
ReMeShortTermMemory.list_to_msg(memories),
allow_duplicates=True,
)
agent = ReActAgent(
name="react",
sys_prompt=(
"You are a helpful assistant. "
"Tool calls may be cached locally. "
"You can first use `Grep` to match keywords or regular "
"expressions to find line numbers, then use `ReadFile` "
"to read the code near that location. "
"If no matches are found, never give up trying - try "
"other parameters or relax the matching conditions, such "
"as searching for only partial keywords. "
"After `Grep`, you can use the `ReadFile` command to "
"view content starting from a specified offset position "
"`offset` with length `limit`. "
"The maximum limit is 100. "
"If the current content is insufficient, the `ReadFile` "
"command can continuously try different `offset` and "
"`limit` parameters."
),
model=llm,
formatter=DashScopeChatFormatter(),
toolkit=toolkit,
memory=short_term_memory,
max_iters=20,
)
msg = Msg(
role="user",
content=(
"In the project documentation, who is the first author "
"of the agentscope_v1 paper?"
),
name="user",
)
msg = await agent(msg)
print(f"✓ Agent response: {msg.get_text_content()}\n")
if __name__ == "__main__":
asyncio.run(main())