chore: initialize sandbox and overwrite remote content
Some checks failed
Pre-commit / run (ubuntu-latest) (push) Has been cancelled
Deploy Sphinx documentation to Pages / build_en (ubuntu-latest, 3.10) (push) Has been cancelled
Deploy Sphinx documentation to Pages / build_zh (ubuntu-latest, 3.10) (push) Has been cancelled
Python Unittest Coverage / test (macos-15, 3.10) (push) Has been cancelled
Python Unittest Coverage / test (macos-15, 3.11) (push) Has been cancelled
Python Unittest Coverage / test (macos-15, 3.12) (push) Has been cancelled
Python Unittest Coverage / test (ubuntu-latest, 3.10) (push) Has been cancelled
Python Unittest Coverage / test (ubuntu-latest, 3.11) (push) Has been cancelled
Python Unittest Coverage / test (ubuntu-latest, 3.12) (push) Has been cancelled
Python Unittest Coverage / test (windows-latest, 3.10) (push) Has been cancelled
Python Unittest Coverage / test (windows-latest, 3.11) (push) Has been cancelled
Python Unittest Coverage / test (windows-latest, 3.12) (push) Has been cancelled
Some checks failed
Pre-commit / run (ubuntu-latest) (push) Has been cancelled
Deploy Sphinx documentation to Pages / build_en (ubuntu-latest, 3.10) (push) Has been cancelled
Deploy Sphinx documentation to Pages / build_zh (ubuntu-latest, 3.10) (push) Has been cancelled
Python Unittest Coverage / test (macos-15, 3.10) (push) Has been cancelled
Python Unittest Coverage / test (macos-15, 3.11) (push) Has been cancelled
Python Unittest Coverage / test (macos-15, 3.12) (push) Has been cancelled
Python Unittest Coverage / test (ubuntu-latest, 3.10) (push) Has been cancelled
Python Unittest Coverage / test (ubuntu-latest, 3.11) (push) Has been cancelled
Python Unittest Coverage / test (ubuntu-latest, 3.12) (push) Has been cancelled
Python Unittest Coverage / test (windows-latest, 3.10) (push) Has been cancelled
Python Unittest Coverage / test (windows-latest, 3.11) (push) Has been cancelled
Python Unittest Coverage / test (windows-latest, 3.12) (push) Has been cancelled
This commit is contained in:
@@ -0,0 +1,422 @@
|
||||
# AlibabaCloud MySQL Vector Store Example
|
||||
|
||||
This example demonstrates how to use the `AlibabaCloudMySQLStore` class in AgentScope's RAG system for vector storage and similarity search operations using AlibabaCloud MySQL (RDS) with native vector functions.
|
||||
|
||||
## Features
|
||||
|
||||
AlibabaCloudMySQLStore provides:
|
||||
- Vector storage using MySQL's native VECTOR data type
|
||||
- Automatic vector index creation (CREATE VECTOR INDEX) based on distance metric
|
||||
- Vector functions (VEC_FROMTEXT, VEC_DISTANCE_COSINE, VEC_DISTANCE_EUCLIDEAN)
|
||||
- Database-level distance calculation and sorting via ORDER BY
|
||||
- Two distance metrics: COSINE and EUCLIDEAN (supported by AlibabaCloud MySQL)
|
||||
- Metadata filtering support
|
||||
- CRUD operations (Create, Read, Update, Delete)
|
||||
- Support for chunked documents
|
||||
- Direct access to the underlying MySQL connection for advanced operations
|
||||
- Full integration with AlibabaCloud RDS MySQL instances
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### 1. AlibabaCloud RDS MySQL Instance
|
||||
|
||||
You need an AlibabaCloud RDS MySQL instance with vector support:
|
||||
|
||||
- **Version**: MySQL 8.0+
|
||||
- **Vector Plugin**: Ensure the vector search plugin is enabled (check `vidx_disabled` parameter is OFF)
|
||||
- **Network Access**: Configure security group and whitelist to allow access
|
||||
|
||||
#### Create RDS MySQL Instance on AlibabaCloud:
|
||||
|
||||
1. Go to [AlibabaCloud RDS Console](https://rdsnext.console.aliyun.com/)
|
||||
2. Click "Create Instance"
|
||||
3. Select MySQL 8.0 or higher
|
||||
4. Configure specifications based on your needs
|
||||
5. Set up network and security settings
|
||||
6. Note down the connection endpoint (e.g., `rm-xxxxx.mysql.rds.aliyuncs.com`)
|
||||
|
||||
#### Configure Database:
|
||||
|
||||
```sql
|
||||
-- Connect to your RDS MySQL instance
|
||||
mysql -h rm-xxxxx.mysql.rds.aliyuncs.com -P 3306 -u your_username -p
|
||||
|
||||
-- Check if vector capability is enabled (vidx_disabled should be OFF)
|
||||
SHOW VARIABLES LIKE 'vidx_disabled';
|
||||
-- Expected result: vidx_disabled | OFF
|
||||
-- If OFF, vector capability is enabled
|
||||
-- If ON, contact AlibabaCloud support to enable vector search plugin
|
||||
|
||||
-- Create database
|
||||
CREATE DATABASE agentscope_test;
|
||||
|
||||
-- Use the database
|
||||
USE agentscope_test;
|
||||
|
||||
-- Verify vector functions are available
|
||||
SELECT VEC_FROMTEXT('[1,2,3]');
|
||||
```
|
||||
|
||||
### 2. Python Dependencies
|
||||
|
||||
```bash
|
||||
pip install mysql-connector-python agentscope
|
||||
```
|
||||
|
||||
### 3. Network Configuration
|
||||
|
||||
Ensure your local machine or server can access the RDS instance:
|
||||
- Add your IP to the RDS whitelist
|
||||
- Configure security group rules
|
||||
- Use SSL connection if required
|
||||
|
||||
## Configuration
|
||||
|
||||
Update the connection parameters in `main.py`:
|
||||
|
||||
```python
|
||||
store = AlibabaCloudMySQLStore(
|
||||
host="rm-xxxxx.mysql.rds.aliyuncs.com", # Your RDS endpoint
|
||||
port=3306,
|
||||
user="your_username", # Your RDS username
|
||||
password="your_password", # Your RDS password
|
||||
database="agentscope_test",
|
||||
table_name="test_vectors",
|
||||
dimensions=768, # Set to your embedding dimension
|
||||
distance="COSINE",
|
||||
# Optional: SSL configuration
|
||||
# connection_kwargs={
|
||||
# "ssl_ca": "/path/to/ca.pem",
|
||||
# "ssl_verify_cert": True,
|
||||
# }
|
||||
)
|
||||
```
|
||||
|
||||
## Running the Example
|
||||
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
## Example Tests
|
||||
|
||||
The example includes three comprehensive tests:
|
||||
|
||||
### 1. Basic CRUD Operations
|
||||
- Initialize AlibabaCloudMySQLStore
|
||||
- Add documents with embeddings
|
||||
- Search for similar documents
|
||||
- Delete documents
|
||||
- Get the underlying MySQL connection
|
||||
|
||||
### 2. Search with Metadata Filtering
|
||||
- Add documents with different categories
|
||||
- Search with and without filters
|
||||
- Use SQL WHERE clauses for filtering
|
||||
|
||||
### 3. Different Distance Metrics
|
||||
- Test COSINE similarity (best for normalized vectors)
|
||||
- Test EUCLIDEAN distance (best for absolute distance)
|
||||
|
||||
## Key Features Explained
|
||||
|
||||
### Distance Metrics
|
||||
|
||||
AlibabaCloud MySQL supports two distance metrics:
|
||||
|
||||
- **COSINE**: Measures the cosine of the angle between vectors. Values range from 0 (identical) to 2 (opposite). Best for text embeddings and normalized vectors.
|
||||
- **EUCLIDEAN**: Measures the straight-line Euclidean distance between vectors. Lower values indicate similarity. Best for absolute distance measurements.
|
||||
|
||||
**Note**: Unlike some other vector databases, AlibabaCloud MySQL currently only supports COSINE and EUCLIDEAN distance functions. Inner Product (IP) is not supported.
|
||||
|
||||
### Metadata Filtering
|
||||
|
||||
Use SQL WHERE clauses to filter search results:
|
||||
|
||||
```python
|
||||
results = await store.search(
|
||||
query_embedding=embedding,
|
||||
limit=10,
|
||||
filter='doc_id LIKE "ai%" AND chunk_id > 0',
|
||||
)
|
||||
```
|
||||
|
||||
### Table Structure
|
||||
|
||||
The implementation automatically creates a table with the following structure:
|
||||
|
||||
```sql
|
||||
CREATE TABLE IF NOT EXISTS table_name (
|
||||
id VARCHAR(255) PRIMARY KEY,
|
||||
embedding VECTOR(dimensions) NOT NULL,
|
||||
doc_id VARCHAR(255) NOT NULL,
|
||||
chunk_id INT NOT NULL,
|
||||
content TEXT NOT NULL,
|
||||
total_chunks INT NOT NULL,
|
||||
INDEX idx_doc_id (doc_id),
|
||||
INDEX idx_chunk_id (chunk_id),
|
||||
VECTOR INDEX (embedding) M=16 DISTANCE=cosine -- or DISTANCE=euclidean
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
|
||||
```
|
||||
|
||||
**Note**: The vector index is created directly within the `CREATE TABLE` statement, not as a separate SQL command. The `M` parameter controls the HNSW algorithm's graph connectivity (default: 16).
|
||||
|
||||
### Performance Considerations
|
||||
|
||||
- **VECTOR Data Type**: Uses MySQL's native VECTOR type for efficient storage
|
||||
- **Vector Index**: Automatically creates a vector index with the specified distance metric for fast similarity search
|
||||
- **Database-Level Distance Calculation**: Vector distance calculations are performed at the database level using MySQL's native vector functions (VEC_DISTANCE_COSINE, VEC_DISTANCE_EUCLIDEAN), with sorting done via SQL ORDER BY
|
||||
- **Native Vector Support**: MySQL 8.0+ has built-in vector functions that are highly optimized for vector operations
|
||||
- **Supported Distance Metrics**: Only COSINE and EUCLIDEAN are supported
|
||||
- **Small to Medium Datasets**: AlibabaCloudMySQLStore performs well for datasets up to 100K vectors
|
||||
- **Large Datasets**: For datasets with millions of vectors, consider using dedicated vector databases (MilvusLite, Qdrant) with specialized indexing (HNSW, IVF, etc.)
|
||||
- **RDS Performance**: Leverage AlibabaCloud RDS features like read replicas, backup, and monitoring
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Direct Database Access
|
||||
|
||||
```python
|
||||
# Get the MySQL connection for advanced operations
|
||||
conn = store.get_client()
|
||||
cursor = conn.cursor()
|
||||
|
||||
# Execute custom SQL queries
|
||||
cursor.execute("SELECT COUNT(*) FROM test_vectors")
|
||||
count = cursor.fetchone()
|
||||
print(f"Total vectors: {count[0]}")
|
||||
```
|
||||
|
||||
### Using MySQL Native Vector Functions
|
||||
|
||||
MySQL's native vector functions can be used directly in SQL queries:
|
||||
|
||||
```python
|
||||
conn = store.get_client()
|
||||
cursor = conn.cursor()
|
||||
|
||||
# Use MySQL native vector functions directly
|
||||
query_vector = "[0.1,0.2,0.3,0.4]"
|
||||
cursor.execute("""
|
||||
SELECT
|
||||
doc_id,
|
||||
VEC_DISTANCE_COSINE(vector, VEC_FROMTEXT(%s)) as distance
|
||||
FROM test_vectors
|
||||
ORDER BY distance ASC
|
||||
LIMIT 10
|
||||
""", (query_vector,))
|
||||
|
||||
results = cursor.fetchall()
|
||||
|
||||
# Available MySQL vector functions in AlibabaCloud:
|
||||
# - VEC_FROMTEXT(text) - Convert text "[1,2,3]" to vector
|
||||
# - VEC_DISTANCE_COSINE(v1, v2) - Cosine distance
|
||||
# - VEC_DISTANCE_EUCLIDEAN(v1, v2) - Euclidean distance
|
||||
```
|
||||
|
||||
### SSL Connection
|
||||
|
||||
For secure connections to AlibabaCloud RDS:
|
||||
|
||||
```python
|
||||
store = AlibabaCloudMySQLStore(
|
||||
host="rm-xxxxx.mysql.rds.aliyuncs.com",
|
||||
port=3306,
|
||||
user="your_username",
|
||||
password="your_password",
|
||||
database="agentscope_test",
|
||||
table_name="vectors",
|
||||
dimensions=768,
|
||||
distance="COSINE",
|
||||
connection_kwargs={
|
||||
"ssl_ca": "/path/to/ca.pem",
|
||||
"ssl_verify_cert": True,
|
||||
"ssl_verify_identity": True,
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
### Batch Operations
|
||||
|
||||
```python
|
||||
# Add large batches of documents
|
||||
batch_size = 1000
|
||||
for i in range(0, len(all_documents), batch_size):
|
||||
batch = all_documents[i:i + batch_size]
|
||||
await store.add(batch)
|
||||
```
|
||||
|
||||
### Connection Pooling
|
||||
|
||||
```python
|
||||
store = AlibabaCloudMySQLStore(
|
||||
host="rm-xxxxx.mysql.rds.aliyuncs.com",
|
||||
port=3306,
|
||||
user="your_username",
|
||||
password="your_password",
|
||||
database="agentscope_test",
|
||||
table_name="vectors",
|
||||
dimensions=768,
|
||||
distance="COSINE",
|
||||
connection_kwargs={
|
||||
"pool_name": "mypool",
|
||||
"pool_size": 10,
|
||||
"pool_reset_session": True,
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### MySQL Version Check
|
||||
|
||||
Ensure your RDS MySQL version supports vector functions:
|
||||
|
||||
```sql
|
||||
SELECT VERSION();
|
||||
-- Should be MySQL 8.0 or higher
|
||||
|
||||
-- Check if vector capability is enabled (critical check)
|
||||
SHOW VARIABLES LIKE 'vidx_disabled';
|
||||
-- Expected result: vidx_disabled | OFF (vector capability enabled)
|
||||
|
||||
-- Test vector functions
|
||||
SELECT VEC_FROMTEXT('[1,2,3]');
|
||||
```
|
||||
|
||||
### Connection Errors
|
||||
|
||||
If you get connection errors:
|
||||
|
||||
1. **Check Whitelist**: Ensure your IP is in the RDS whitelist
|
||||
2. **Security Group**: Verify security group rules allow port 3306
|
||||
3. **Network Type**: Ensure you're using the correct endpoint (public/private)
|
||||
4. **Credentials**: Double-check username and password
|
||||
|
||||
```bash
|
||||
# Test connection from command line
|
||||
mysql -h rm-xxxxx.mysql.rds.aliyuncs.com -P 3306 -u your_username -p
|
||||
```
|
||||
|
||||
### Vector Function Errors
|
||||
|
||||
If you get errors about VEC_DISTANCE_COSINE or VECTOR type not being recognized:
|
||||
|
||||
1. **Check if vector capability is enabled**:
|
||||
|
||||
```sql
|
||||
-- Check vidx_disabled parameter (must be OFF)
|
||||
SHOW VARIABLES LIKE 'vidx_disabled';
|
||||
-- Expected result: vidx_disabled | OFF
|
||||
-- If ON, vector capability is disabled, contact AlibabaCloud support
|
||||
```
|
||||
|
||||
2. Verify MySQL version is 8.0 or higher
|
||||
|
||||
```sql
|
||||
SELECT VERSION();
|
||||
```
|
||||
|
||||
3. Test vector functions availability:
|
||||
|
||||
```sql
|
||||
-- Check if vector functions are available
|
||||
SELECT VEC_FROMTEXT('[1,2,3]');
|
||||
|
||||
-- Check if VECTOR type is supported
|
||||
CREATE TABLE test_vector (v VECTOR(3));
|
||||
DROP TABLE test_vector;
|
||||
|
||||
-- List vector indexes
|
||||
SHOW INDEX FROM your_table_name WHERE Index_type = 'VECTOR';
|
||||
```
|
||||
|
||||
If `vidx_disabled` is ON, contact AlibabaCloud support to enable the vector search plugin for your RDS instance.
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
For large datasets on AlibabaCloud RDS:
|
||||
|
||||
1. **Upgrade Instance**: Consider higher specifications (CPU, Memory)
|
||||
2. **Read Replicas**: Use read replicas for read-heavy workloads
|
||||
3. **Indexes**: Add indexes on frequently filtered columns
|
||||
4. **Connection Pool**: Use connection pooling for concurrent operations
|
||||
5. **Monitor**: Use AlibabaCloud CloudMonitor for performance insights
|
||||
|
||||
### Timeout Errors
|
||||
|
||||
If you experience timeout errors:
|
||||
|
||||
```python
|
||||
store = AlibabaCloudMySQLStore(
|
||||
host="rm-xxxxx.mysql.rds.aliyuncs.com",
|
||||
port=3306,
|
||||
user="your_username",
|
||||
password="your_password",
|
||||
database="agentscope_test",
|
||||
table_name="vectors",
|
||||
dimensions=768,
|
||||
distance="COSINE",
|
||||
connection_kwargs={
|
||||
"connect_timeout": 30,
|
||||
"read_timeout": 60,
|
||||
"write_timeout": 60,
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
## AlibabaCloud RDS Best Practices
|
||||
|
||||
1. **Backup**: Enable automatic backups in RDS console
|
||||
2. **Monitoring**: Set up alerts for CPU, memory, and connection usage
|
||||
3. **Security**: Use private network connections when possible
|
||||
4. **Scaling**: Consider read-only instances for read-heavy workloads
|
||||
5. **Cost Optimization**: Use reserved instances for long-term usage
|
||||
|
||||
## Related Resources
|
||||
|
||||
- [AlibabaCloud RDS Documentation](https://www.alibabacloud.com/help/en/apsaradb-for-rds)
|
||||
- [AlibabaCloud MySQL Vector Functions](https://www.alibabacloud.com/help/en/rds/apsaradb-rds-for-mysql/vector-storage-1)
|
||||
- [AgentScope RAG Tutorial](https://doc.agentscope.io/tutorial/task_rag.html)
|
||||
- [MySQL Connector Python](https://dev.mysql.com/doc/connector-python/en/)
|
||||
|
||||
## Example Use Cases
|
||||
|
||||
### RAG System with AlibabaCloud
|
||||
|
||||
```python
|
||||
from agentscope.rag import AlibabaCloudMySQLStore, KnowledgeBase
|
||||
|
||||
# Initialize vector store with AlibabaCloud RDS
|
||||
store = AlibabaCloudMySQLStore(
|
||||
host="rm-xxxxx.mysql.rds.aliyuncs.com",
|
||||
port=3306,
|
||||
user="your_username",
|
||||
password="your_password",
|
||||
database="rag_system",
|
||||
table_name="knowledge_vectors",
|
||||
dimensions=768,
|
||||
distance="COSINE",
|
||||
)
|
||||
|
||||
# Create knowledge base
|
||||
kb = KnowledgeBase(store=store)
|
||||
|
||||
# Add documents
|
||||
await kb.add_documents(documents)
|
||||
|
||||
# Search
|
||||
results = await kb.search("What is AI?", top_k=5)
|
||||
```
|
||||
|
||||
## Support
|
||||
|
||||
For issues related to:
|
||||
- **AlibabaCloudMySQLStore**: Open an issue on AgentScope GitHub
|
||||
- **RDS MySQL**: Contact AlibabaCloud Support
|
||||
- **Vector Functions**: Check MySQL documentation or AlibabaCloud support
|
||||
|
||||
## License
|
||||
|
||||
This example is part of the AgentScope project and follows the same license.
|
||||
|
||||
@@ -0,0 +1,282 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Example of using AlibabaCloudMySQLStore in AgentScope RAG system."""
|
||||
import asyncio
|
||||
from agentscope.rag import (
|
||||
AlibabaCloudMySQLStore,
|
||||
Document,
|
||||
DocMetadata,
|
||||
)
|
||||
from agentscope.message import TextBlock
|
||||
|
||||
|
||||
async def example_basic_operations() -> None:
|
||||
"""The example of basic CRUD operations with AlibabaCloudMySQLStore."""
|
||||
print("\n" + "=" * 60)
|
||||
print("Test 1: Basic CRUD Operations")
|
||||
print("=" * 60)
|
||||
|
||||
# Initialize AlibabaCloudMySQLStore
|
||||
# Replace with your AlibabaCloud MySQL connection details
|
||||
store = AlibabaCloudMySQLStore(
|
||||
host="rm-xxxxx.mysql.rds.aliyuncs.com", # Your RDS endpoint
|
||||
port=3306,
|
||||
user="your_username",
|
||||
password="your_password",
|
||||
database="agentscope_test",
|
||||
table_name="test_vectors",
|
||||
dimensions=4, # Small dimension for testing
|
||||
distance="COSINE",
|
||||
)
|
||||
|
||||
print("✓ AlibabaCloudMySQLStore initialized")
|
||||
|
||||
# Create test documents with embeddings
|
||||
test_docs = [
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(
|
||||
text="Artificial Intelligence is the future",
|
||||
),
|
||||
doc_id="doc_1",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.1, 0.2, 0.3, 0.4],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Machine Learning is a subset of AI"),
|
||||
doc_id="doc_2",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.2, 0.3, 0.4, 0.5],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Deep Learning uses neural networks"),
|
||||
doc_id="doc_3",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.3, 0.4, 0.5, 0.6],
|
||||
),
|
||||
]
|
||||
|
||||
# Test add operation
|
||||
await store.add(test_docs)
|
||||
print(f"✓ Added {len(test_docs)} documents to the store")
|
||||
|
||||
# Test search operation
|
||||
query_embedding = [0.15, 0.25, 0.35, 0.45]
|
||||
results = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=2,
|
||||
)
|
||||
|
||||
print(f"\n✓ Search completed, found {len(results)} results:")
|
||||
for i, result in enumerate(results, 1):
|
||||
print(f" {i}. Score: {result.score:.4f}")
|
||||
print(f" Content: {result.metadata.content}")
|
||||
print(f" Doc ID: {result.metadata.doc_id}")
|
||||
|
||||
# Test search with score threshold
|
||||
results_filtered = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=5,
|
||||
score_threshold=0.9,
|
||||
)
|
||||
print(f"\n✓ Search with threshold (>0.9): {len(results_filtered)} results")
|
||||
|
||||
# Test delete operation
|
||||
await store.delete(filter='doc_id = "doc_2"')
|
||||
print("\n✓ Deleted document with doc_id='doc_2'")
|
||||
|
||||
# Verify deletion
|
||||
results_after_delete = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=5,
|
||||
)
|
||||
print(f"✓ After deletion: {len(results_after_delete)} documents remain")
|
||||
|
||||
# Get client for advanced operations
|
||||
client = store.get_client()
|
||||
print(f"\n✓ Got MySQL connection: {type(client).__name__}")
|
||||
|
||||
# Close connection
|
||||
store.close()
|
||||
print("✓ Connection closed")
|
||||
|
||||
|
||||
async def example_filter_search() -> None:
|
||||
"""The example of search with metadata filtering."""
|
||||
print("\n" + "=" * 60)
|
||||
print("Test 2: Search with Metadata Filtering")
|
||||
print("=" * 60)
|
||||
|
||||
store = AlibabaCloudMySQLStore(
|
||||
host="rm-xxxxx.mysql.rds.aliyuncs.com",
|
||||
port=3306,
|
||||
user="your_username",
|
||||
password="your_password",
|
||||
database="agentscope_test",
|
||||
table_name="filter_vectors",
|
||||
dimensions=4,
|
||||
distance="COSINE",
|
||||
)
|
||||
|
||||
# Create documents with different categories
|
||||
docs = [
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Python is a programming language"),
|
||||
doc_id="prog_1",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.1, 0.2, 0.3, 0.4],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(
|
||||
text="Java is used for enterprise applications",
|
||||
),
|
||||
doc_id="prog_2",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.2, 0.3, 0.4, 0.5],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Neural networks are used in AI"),
|
||||
doc_id="ai_1",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.3, 0.4, 0.5, 0.6],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Deep learning requires GPUs"),
|
||||
doc_id="ai_2",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.4, 0.5, 0.6, 0.7],
|
||||
),
|
||||
]
|
||||
|
||||
await store.add(docs)
|
||||
print(f"✓ Added {len(docs)} documents with different doc_id prefixes")
|
||||
|
||||
# Search without filter
|
||||
query_embedding = [0.25, 0.35, 0.45, 0.55]
|
||||
all_results = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=4,
|
||||
)
|
||||
print(f"\n✓ Search without filter: {len(all_results)} results")
|
||||
for i, result in enumerate(all_results, 1):
|
||||
doc_id = result.metadata.doc_id
|
||||
score = result.score
|
||||
print(f" {i}. Doc ID: {doc_id}, Score: {score:.4f}")
|
||||
|
||||
# Search with filter for programming docs
|
||||
prog_results = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=4,
|
||||
filter='doc_id LIKE "prog%"',
|
||||
)
|
||||
filter_msg = 'doc_id LIKE "prog%"'
|
||||
print(f"\n✓ Search with filter ({filter_msg}): {len(prog_results)}")
|
||||
for i, result in enumerate(prog_results, 1):
|
||||
doc_id = result.metadata.doc_id
|
||||
score = result.score
|
||||
print(f" {i}. Doc ID: {doc_id}, Score: {score:.4f}")
|
||||
|
||||
# Search with filter for AI docs
|
||||
ai_results = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=4,
|
||||
filter='doc_id LIKE "ai%"',
|
||||
)
|
||||
filter_msg = 'doc_id LIKE "ai%"'
|
||||
print(f"\n✓ Search with filter ({filter_msg}): {len(ai_results)}")
|
||||
for i, result in enumerate(ai_results, 1):
|
||||
doc_id = result.metadata.doc_id
|
||||
score = result.score
|
||||
print(f" {i}. Doc ID: {doc_id}, Score: {score:.4f}")
|
||||
|
||||
store.close()
|
||||
|
||||
|
||||
async def example_distance_metrics() -> None:
|
||||
"""The example of different distance metrics."""
|
||||
print("\n" + "=" * 60)
|
||||
print("Test 3: Different Distance Metrics")
|
||||
print("=" * 60)
|
||||
|
||||
# Test with different metrics
|
||||
# Note: AlibabaCloud MySQL only supports COSINE and EUCLIDEAN
|
||||
metrics = ["COSINE", "EUCLIDEAN"]
|
||||
|
||||
for metric in metrics:
|
||||
print(f"\n--- Testing {metric} metric ---")
|
||||
store = AlibabaCloudMySQLStore(
|
||||
host="rm-xxxxx.mysql.rds.aliyuncs.com",
|
||||
port=3306,
|
||||
user="your_username",
|
||||
password="your_password",
|
||||
database="agentscope_test",
|
||||
table_name=f"{metric.lower()}_vectors",
|
||||
dimensions=4,
|
||||
distance=metric,
|
||||
)
|
||||
|
||||
docs = [
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text=f"Test doc for {metric}"),
|
||||
doc_id=f"doc_{metric}_1",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.1, 0.2, 0.3, 0.4],
|
||||
),
|
||||
]
|
||||
|
||||
await store.add(docs)
|
||||
results = await store.search(
|
||||
query_embedding=[0.1, 0.2, 0.3, 0.4],
|
||||
limit=1,
|
||||
)
|
||||
|
||||
print(f"✓ {metric} metric: Score = {results[0].score:.4f}")
|
||||
store.close()
|
||||
|
||||
|
||||
async def main() -> None:
|
||||
"""Run all examples."""
|
||||
print("\n" + "=" * 60)
|
||||
print("AlibabaCloud MySQL Vector Store Test Suite")
|
||||
print("=" * 60)
|
||||
|
||||
try:
|
||||
await example_basic_operations()
|
||||
await example_filter_search()
|
||||
await example_distance_metrics()
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("✓ All tests completed successfully!")
|
||||
print("=" * 60)
|
||||
|
||||
except Exception as e:
|
||||
print(f"\n✗ Test failed with error: {e}")
|
||||
import traceback
|
||||
|
||||
traceback.print_exc()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
128
examples/functionality/vector_store/milvus_lite/README.md
Normal file
128
examples/functionality/vector_store/milvus_lite/README.md
Normal file
@@ -0,0 +1,128 @@
|
||||
# MilvusLite Vector Store
|
||||
|
||||
This example demonstrates how to use **MilvusLiteStore** for vector storage and semantic search in AgentScope.
|
||||
It includes four test scenarios covering CRUD operations, metadata filtering, document chunking, and distance metrics.
|
||||
|
||||
### Quick Start
|
||||
|
||||
Install agentscope first, and then the MilvusLite dependency:
|
||||
|
||||
```bash
|
||||
# In MacOS/Linux
|
||||
pip install pymilvus\[milvus_lite\]
|
||||
|
||||
# In Windows
|
||||
pip install pymilvus[milvus_lite]
|
||||
```
|
||||
|
||||
Run the example script, which showcases adding, searching with/without filters in MilvusLite vector store:
|
||||
|
||||
```bash
|
||||
python milvuslite_store.py
|
||||
```
|
||||
|
||||
> **Note:** The script creates `.db` files in the current directory. You can delete them after testing.
|
||||
|
||||
## Usage
|
||||
|
||||
### Initialize Store
|
||||
```python
|
||||
from agentscope.rag import MilvusLiteStore
|
||||
|
||||
store = MilvusLiteStore(
|
||||
uri="./milvus_test.db",
|
||||
collection_name="test_collection",
|
||||
dimensions=768, # Match your embedding model
|
||||
distance="COSINE", # COSINE, L2, or IP
|
||||
)
|
||||
```
|
||||
|
||||
### Add Documents
|
||||
|
||||
```python
|
||||
from agentscope.rag import Document, DocMetadata
|
||||
from agentscope.message import TextBlock
|
||||
|
||||
doc = Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(type="text", text="Your document text"),
|
||||
doc_id="doc_1",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.1, 0.2, ...], # Your embedding vector
|
||||
)
|
||||
|
||||
await store.add([doc])
|
||||
```
|
||||
|
||||
### Search
|
||||
|
||||
```python
|
||||
results = await store.search(
|
||||
query_embedding=[0.15, 0.25, ...],
|
||||
limit=5,
|
||||
score_threshold=0.9, # Optional
|
||||
filter='doc_id like "prefix%"', # Optional
|
||||
)
|
||||
```
|
||||
|
||||
### Delete
|
||||
|
||||
```python
|
||||
await store.delete(filter_expr='doc_id == "doc_1"')
|
||||
```
|
||||
|
||||
## Distance Metrics
|
||||
|
||||
| Metric | Description | Best For |
|
||||
|--------|-------------|----------|
|
||||
| **COSINE** | Cosine similarity | Text embeddings (recommended) |
|
||||
| **L2** | Euclidean distance | Spatial data |
|
||||
| **IP** | Inner Product | Recommendation systems |
|
||||
|
||||
## Filter Expressions
|
||||
|
||||
```python
|
||||
# Exact match
|
||||
filter='doc_id == "doc_1"'
|
||||
|
||||
# Pattern matching
|
||||
filter='doc_id like "prefix%"'
|
||||
|
||||
# Numeric and logical operators
|
||||
filter='chunk_id >= 0 and total_chunks > 1'
|
||||
```
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Access Underlying Client
|
||||
```python
|
||||
client = store.get_client()
|
||||
stats = client.get_collection_stats(collection_name="test_collection")
|
||||
```
|
||||
|
||||
### Document Metadata
|
||||
- `content`: Text content (TextBlock)
|
||||
- `doc_id`: Unique document identifier
|
||||
- `chunk_id`: Chunk position (0-indexed)
|
||||
- `total_chunks`: Total chunks in document
|
||||
|
||||
## FAQ
|
||||
|
||||
**What embedding dimension should I use?**
|
||||
Match your embedding model's output dimension (e.g., 768 for BERT, 1536 for OpenAI ada-002).
|
||||
|
||||
**Can I change the distance metric after creation?**
|
||||
No, create a new collection with the desired metric.
|
||||
|
||||
**How do I delete the database?**
|
||||
Delete the `.db` file specified in the `uri` parameter.
|
||||
|
||||
**Is this suitable for production?**
|
||||
MilvusLite works well for development and small-scale applications. For production at scale, consider Milvus standalone or cluster mode.
|
||||
|
||||
## References
|
||||
|
||||
- [Milvus Documentation](https://milvus.io/docs)
|
||||
- [AgentScope RAG Tutorial](https://doc.agentscope.io/tutorial/task_rag.html)
|
||||
327
examples/functionality/vector_store/milvus_lite/main.py
Normal file
327
examples/functionality/vector_store/milvus_lite/main.py
Normal file
@@ -0,0 +1,327 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Example of using MilvusLiteStore in AgentScope RAG system."""
|
||||
import asyncio
|
||||
from agentscope.rag import (
|
||||
MilvusLiteStore,
|
||||
Document,
|
||||
DocMetadata,
|
||||
)
|
||||
from agentscope.message import TextBlock
|
||||
|
||||
|
||||
async def example_basic_operations() -> None:
|
||||
"""The example of basic CRUD operations with MilvusLiteStore."""
|
||||
print("\n" + "=" * 60)
|
||||
print("Test 1: Basic CRUD Operations")
|
||||
print("=" * 60)
|
||||
|
||||
# Initialize MilvusLiteStore with a local file
|
||||
store = MilvusLiteStore(
|
||||
uri="./milvus_test.db",
|
||||
collection_name="test_collection",
|
||||
dimensions=4, # Small dimension for testing
|
||||
distance="COSINE",
|
||||
)
|
||||
|
||||
print("✓ MilvusLiteStore initialized")
|
||||
|
||||
# Create test documents with embeddings
|
||||
test_docs = [
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(
|
||||
text="Artificial Intelligence is the future",
|
||||
),
|
||||
doc_id="doc_1",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.1, 0.2, 0.3, 0.4],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Machine Learning is a subset of AI"),
|
||||
doc_id="doc_2",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.2, 0.3, 0.4, 0.5],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Deep Learning uses neural networks"),
|
||||
doc_id="doc_3",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.3, 0.4, 0.5, 0.6],
|
||||
),
|
||||
]
|
||||
|
||||
# Test add operation
|
||||
await store.add(test_docs)
|
||||
print(f"✓ Added {len(test_docs)} documents to the store")
|
||||
|
||||
# Test search operation
|
||||
query_embedding = [0.15, 0.25, 0.35, 0.45]
|
||||
results = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=2,
|
||||
)
|
||||
|
||||
print(f"\n✓ Search completed, found {len(results)} results:")
|
||||
for i, result in enumerate(results, 1):
|
||||
print(f" {i}. Score: {result.score:.4f}")
|
||||
print(f" Content: {result.metadata.content}")
|
||||
print(f" Doc ID: {result.metadata.doc_id}")
|
||||
|
||||
# Test search with score threshold
|
||||
results_filtered = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=5,
|
||||
score_threshold=0.9,
|
||||
)
|
||||
print(f"\n✓ Search with threshold (>0.9): {len(results_filtered)} results")
|
||||
|
||||
# Test delete operation
|
||||
# Note: We need to use filter expression to delete by doc_id
|
||||
await store.delete(filter='doc_id == "doc_2"')
|
||||
print("\n✓ Deleted document with doc_id='doc_2'")
|
||||
|
||||
# Verify deletion
|
||||
results_after_delete = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=5,
|
||||
)
|
||||
print(f"✓ After deletion: {len(results_after_delete)} documents remain")
|
||||
|
||||
# Get client for advanced operations
|
||||
client = store.get_client()
|
||||
print(f"\n✓ Got MilvusClient: {type(client).__name__}")
|
||||
|
||||
|
||||
async def example_filter_search() -> None:
|
||||
"""The example of search with metadata filtering."""
|
||||
print("\n" + "=" * 60)
|
||||
print("Test 2: Search with Metadata Filtering")
|
||||
print("=" * 60)
|
||||
|
||||
store = MilvusLiteStore(
|
||||
uri="./milvus_filter_test.db",
|
||||
collection_name="filter_collection",
|
||||
dimensions=4,
|
||||
distance="COSINE",
|
||||
)
|
||||
|
||||
# Create documents with different categories
|
||||
docs = [
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Python is a programming language"),
|
||||
doc_id="prog_1",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.1, 0.2, 0.3, 0.4],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(
|
||||
text="Java is used for enterprise applications",
|
||||
),
|
||||
doc_id="prog_2",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.2, 0.3, 0.4, 0.5],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Neural networks are used in AI"),
|
||||
doc_id="ai_1",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.3, 0.4, 0.5, 0.6],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Deep learning requires GPUs"),
|
||||
doc_id="ai_2",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.4, 0.5, 0.6, 0.7],
|
||||
),
|
||||
]
|
||||
|
||||
await store.add(docs)
|
||||
print(f"✓ Added {len(docs)} documents with different doc_id prefixes")
|
||||
|
||||
# Search without filter
|
||||
query_embedding = [0.25, 0.35, 0.45, 0.55]
|
||||
all_results = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=4,
|
||||
)
|
||||
print(f"\n✓ Search without filter: {len(all_results)} results")
|
||||
for i, result in enumerate(all_results, 1):
|
||||
doc_id = result.metadata.doc_id
|
||||
score = result.score
|
||||
print(f" {i}. Doc ID: {doc_id}, Score: {score:.4f}")
|
||||
|
||||
# Search with filter for programming docs
|
||||
prog_results = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=4,
|
||||
filter='doc_id like "prog%"',
|
||||
)
|
||||
filter_msg = "doc_id like 'prog%'"
|
||||
print(f"\n✓ Search with filter ({filter_msg}): {len(prog_results)}")
|
||||
for i, result in enumerate(prog_results, 1):
|
||||
doc_id = result.metadata.doc_id
|
||||
score = result.score
|
||||
print(f" {i}. Doc ID: {doc_id}, Score: {score:.4f}")
|
||||
|
||||
# Search with filter for AI docs
|
||||
ai_results = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=4,
|
||||
filter='doc_id like "ai%"',
|
||||
)
|
||||
filter_msg = "doc_id like 'ai%'"
|
||||
print(f"\n✓ Search with filter ({filter_msg}): {len(ai_results)}")
|
||||
for i, result in enumerate(ai_results, 1):
|
||||
doc_id = result.metadata.doc_id
|
||||
score = result.score
|
||||
print(f" {i}. Doc ID: {doc_id}, Score: {score:.4f}")
|
||||
|
||||
|
||||
async def example_multiple_chunks() -> None:
|
||||
"""The example of documents with multiple chunks."""
|
||||
print("\n" + "=" * 60)
|
||||
print("Test 3: Documents with Multiple Chunks")
|
||||
print("=" * 60)
|
||||
|
||||
store = MilvusLiteStore(
|
||||
uri="./milvus_chunks_test.db",
|
||||
collection_name="chunks_collection",
|
||||
dimensions=4,
|
||||
distance="COSINE",
|
||||
)
|
||||
|
||||
# Create a document split into multiple chunks
|
||||
chunks = [
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Chapter 1: Introduction to AI"),
|
||||
doc_id="book_1",
|
||||
chunk_id=0,
|
||||
total_chunks=3,
|
||||
),
|
||||
embedding=[0.1, 0.2, 0.3, 0.4],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Chapter 2: Machine Learning Basics"),
|
||||
doc_id="book_1",
|
||||
chunk_id=1,
|
||||
total_chunks=3,
|
||||
),
|
||||
embedding=[0.2, 0.3, 0.4, 0.5],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Chapter 3: Deep Learning Advanced"),
|
||||
doc_id="book_1",
|
||||
chunk_id=2,
|
||||
total_chunks=3,
|
||||
),
|
||||
embedding=[0.3, 0.4, 0.5, 0.6],
|
||||
),
|
||||
]
|
||||
|
||||
await store.add(chunks)
|
||||
print(f"✓ Added document with {len(chunks)} chunks")
|
||||
|
||||
# Search and verify chunk information
|
||||
query_embedding = [0.2, 0.3, 0.4, 0.5]
|
||||
results = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=3,
|
||||
)
|
||||
|
||||
print("\n✓ Search results for multi-chunk document:")
|
||||
for i, result in enumerate(results, 1):
|
||||
chunk_info = (
|
||||
f"{result.metadata.chunk_id}/{result.metadata.total_chunks}"
|
||||
)
|
||||
print(f" {i}. Chunk {chunk_info}")
|
||||
print(f" Content: {result.metadata.content}")
|
||||
print(f" Score: {result.score:.4f}")
|
||||
|
||||
|
||||
async def example_distance_metrics() -> None:
|
||||
"""The example of different distance metrics."""
|
||||
print("\n" + "=" * 60)
|
||||
print("Test 4: Different Distance Metrics")
|
||||
print("=" * 60)
|
||||
|
||||
# Test with different metrics
|
||||
metrics = ["COSINE", "L2", "IP"]
|
||||
|
||||
for metric in metrics:
|
||||
print(f"\n--- Testing {metric} metric ---")
|
||||
store = MilvusLiteStore(
|
||||
uri=f"./milvus_{metric.lower()}_test.db",
|
||||
collection_name=f"{metric.lower()}_collection",
|
||||
dimensions=4,
|
||||
distance=metric,
|
||||
)
|
||||
|
||||
docs = [
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text=f"Test doc for {metric}"),
|
||||
doc_id=f"doc_{metric}_1",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.1, 0.2, 0.3, 0.4],
|
||||
),
|
||||
]
|
||||
|
||||
await store.add(docs)
|
||||
results = await store.search(
|
||||
query_embedding=[0.1, 0.2, 0.3, 0.4],
|
||||
limit=1,
|
||||
)
|
||||
|
||||
print(f"✓ {metric} metric: Score = {results[0].score:.4f}")
|
||||
|
||||
|
||||
async def main() -> None:
|
||||
"""Run all example."""
|
||||
print("\n" + "=" * 60)
|
||||
print("MilvusLiteStore Comprehensive Test Suite")
|
||||
print("=" * 60)
|
||||
|
||||
try:
|
||||
await example_basic_operations()
|
||||
await example_filter_search()
|
||||
await example_multiple_chunks()
|
||||
await example_distance_metrics()
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("✓ All tests completed successfully!")
|
||||
print("=" * 60)
|
||||
|
||||
except Exception as e:
|
||||
print(f"\n✗ Test failed with error: {e}")
|
||||
import traceback
|
||||
|
||||
traceback.print_exc()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
211
examples/functionality/vector_store/mongodb/README.md
Normal file
211
examples/functionality/vector_store/mongodb/README.md
Normal file
@@ -0,0 +1,211 @@
|
||||
# MongoDB Vector Store
|
||||
|
||||
This example demonstrates how to use **MongoDBStore** for vector storage and semantic search in AgentScope using MongoDB's Vector Search capabilities.
|
||||
It includes comprehensive test scenarios covering CRUD operations, metadata filtering, document chunking, and distance metrics.
|
||||
|
||||
### Quick Start
|
||||
|
||||
Install agentscope first, and then the MongoDB dependency:
|
||||
|
||||
```bash
|
||||
pip install pymongo
|
||||
```
|
||||
|
||||
**Important:** Before running the example, you need to set the `MONGODB_HOST`
|
||||
environment variable with your MongoDB connection string:
|
||||
|
||||
```bash
|
||||
# For local MongoDB
|
||||
export MONGODB_HOST="mongodb://localhost:27017/?directConnection=true"
|
||||
|
||||
# For MongoDB Atlas (replace with your connection string)
|
||||
# export MONGODB_HOST=${YOUR_MONGODB_HOST}
|
||||
```
|
||||
|
||||
Run the example script, which showcases adding, searching, and deleting in MongoDB vector store:
|
||||
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
> **Note:** The script connects to MongoDB Atlas or local MongoDB instance. Make sure you have a valid MongoDB connection string.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Confirm your MongoDB instance supports Vector Search functionality
|
||||
- Valid MongoDB connection string (local or Atlas)
|
||||
|
||||
## Usage
|
||||
|
||||
### Initialize Store
|
||||
|
||||
```python
|
||||
from agentscope.rag import MongoDBStore
|
||||
|
||||
# For MongoDB Atlas
|
||||
store = MongoDBStore(
|
||||
host="mongodb+srv://username:password@cluster.mongodb.net/",
|
||||
db_name="test_db",
|
||||
collection_name="test_collection",
|
||||
dimensions=768, # Match your embedding model
|
||||
distance="cosine", # cosine, euclidean, or dotProduct
|
||||
)
|
||||
|
||||
# For local MongoDB
|
||||
store = MongoDBStore(
|
||||
host="mongodb://localhost:27017/?directConnection=true",
|
||||
db_name="test_db",
|
||||
collection_name="test_collection",
|
||||
dimensions=768,
|
||||
distance="cosine",
|
||||
)
|
||||
|
||||
# To enable filtering in search, specify filter_fields:
|
||||
store = MongoDBStore(
|
||||
host="mongodb://localhost:27017/?directConnection=true",
|
||||
db_name="test_db",
|
||||
collection_name="test_collection",
|
||||
dimensions=768,
|
||||
distance="cosine",
|
||||
filter_fields=["payload.doc_id", "payload.chunk_id"], # Fields for filtering
|
||||
)
|
||||
|
||||
# No manual initialization needed - everything is automatic!
|
||||
# Database, collection, and vector search index are created automatically
|
||||
# when you first call add() or search()
|
||||
```
|
||||
|
||||
### Add Documents
|
||||
|
||||
```python
|
||||
from agentscope.rag import Document, DocMetadata
|
||||
from agentscope.message import TextBlock
|
||||
|
||||
doc = Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(type="text", text="Your document text"),
|
||||
doc_id="doc_1",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.1, 0.2, ...], # Your embedding vector
|
||||
)
|
||||
|
||||
await store.add([doc])
|
||||
```
|
||||
|
||||
### Search
|
||||
|
||||
```python
|
||||
results = await store.search(
|
||||
query_embedding=[0.15, 0.25, ...],
|
||||
limit=5,
|
||||
score_threshold=0.9, # Optional
|
||||
filter={"payload.doc_id": {"$in": ["doc_1", "doc_2"]}}, # Optional filter
|
||||
)
|
||||
# Note:
|
||||
# - To use filter, the field must be declared in filter_fields when creating store
|
||||
# - MongoDB $vectorSearch filter supports: $gt, $gte, $lt, $lte,
|
||||
# $eq, $ne, $in, $nin, $exists, $not (NOT $regex)
|
||||
```
|
||||
|
||||
### Delete
|
||||
|
||||
```python
|
||||
# Delete by document IDs (no initialization needed)
|
||||
await store.delete(ids=["doc_1", "doc_2"])
|
||||
|
||||
# Delete entire collection (use with caution)
|
||||
await store.delete_collection()
|
||||
|
||||
# Delete entire database (use with caution)
|
||||
await store.delete_database()
|
||||
```
|
||||
|
||||
## Distance Metrics
|
||||
|
||||
| Metric | Description | Best For |
|
||||
|--------|-------------|----------|
|
||||
| **cosine** | Cosine similarity | Text embeddings (recommended) |
|
||||
| **euclidean** | Euclidean distance | Spatial data |
|
||||
| **dotProduct** | Inner Product | Recommendation systems |
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Access Underlying Client
|
||||
|
||||
```python
|
||||
client = store.get_client()
|
||||
# Use MongoDB client for advanced operations
|
||||
stats = await client[store.db_name].command("collStats", store.collection_name)
|
||||
```
|
||||
|
||||
### Document Metadata
|
||||
|
||||
- `content`: Text content (TextBlock)
|
||||
- `doc_id`: Unique document identifier
|
||||
- `chunk_id`: Chunk position (0-indexed)
|
||||
- `total_chunks`: Total chunks in document
|
||||
|
||||
### Vector Search Index
|
||||
|
||||
MongoDBStore automatically creates vector search indexes with the following configuration:
|
||||
|
||||
```python
|
||||
{
|
||||
"fields": [
|
||||
{
|
||||
"type": "vector",
|
||||
"path": "vector",
|
||||
"similarity": "cosine", # or euclidean, dotProduct
|
||||
"numDimensions": 768
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Connection Examples
|
||||
|
||||
### MongoDB Atlas
|
||||
|
||||
```python
|
||||
store = MongoDBStore(
|
||||
host="<YOUR_MONGO_ATLAS_CONNECTION_STRING>",
|
||||
db_name="production_db",
|
||||
collection_name="documents",
|
||||
dimensions=1536,
|
||||
distance="cosine",
|
||||
)
|
||||
```
|
||||
|
||||
### Local MongoDB
|
||||
|
||||
#### Without Authentication
|
||||
|
||||
```python
|
||||
store = MongoDBStore(
|
||||
host="mongodb://localhost:27017?directConnection=true",
|
||||
db_name="local_db",
|
||||
collection_name="test_collection",
|
||||
dimensions=768,
|
||||
distance="cosine",
|
||||
)
|
||||
```
|
||||
|
||||
#### With Authentication
|
||||
|
||||
```python
|
||||
store = MongoDBStore(
|
||||
host="mongodb://user:pass@localhost:27017/?directConnection=true",
|
||||
db_name="test_db",
|
||||
collection_name="test_collection",
|
||||
dimensions=768,
|
||||
distance="cosine",
|
||||
)
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- [MongoDB Vector Search Documentation](https://www.mongodb.com/docs/atlas/atlas-search/vector-search/)
|
||||
- [MongoDB Atlas Documentation](https://www.mongodb.com/docs/atlas/)
|
||||
- [AgentScope RAG Tutorial](https://doc.agentscope.io/tutorial/task_rag.html)
|
||||
351
examples/functionality/vector_store/mongodb/main.py
Normal file
351
examples/functionality/vector_store/mongodb/main.py
Normal file
@@ -0,0 +1,351 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Example of using MongoDBStore in AgentScope RAG system."""
|
||||
import asyncio
|
||||
import os
|
||||
|
||||
from agentscope.rag import (
|
||||
MongoDBStore,
|
||||
Document,
|
||||
DocMetadata,
|
||||
)
|
||||
from agentscope.message import TextBlock
|
||||
|
||||
|
||||
async def example_basic_operations() -> None:
|
||||
"""The example of basic CRUD operations with MongoDBStore."""
|
||||
print("\n" + "=" * 60)
|
||||
print("Test 1: Basic CRUD Operations")
|
||||
print("=" * 60)
|
||||
|
||||
# Initialize MongoDBStore with MongoDB connection
|
||||
store = MongoDBStore(
|
||||
host=os.getenv("MONGODB_HOST"),
|
||||
db_name="test_db",
|
||||
collection_name="test_collection",
|
||||
dimensions=4, # Small dimension for testing
|
||||
distance="cosine",
|
||||
)
|
||||
|
||||
print("✓ MongoDBStore initialized")
|
||||
|
||||
# Create test documents with embeddings
|
||||
test_docs = [
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(
|
||||
text="Artificial Intelligence is the future",
|
||||
),
|
||||
doc_id="doc_1",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.1, 0.2, 0.3, 0.4],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Machine Learning is a subset of AI"),
|
||||
doc_id="doc_2",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.2, 0.3, 0.4, 0.5],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Deep Learning uses neural networks"),
|
||||
doc_id="doc_3",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.3, 0.4, 0.5, 0.6],
|
||||
),
|
||||
]
|
||||
|
||||
# Test add operation (automatically creates database, collection,
|
||||
# and index)
|
||||
await store.add(test_docs)
|
||||
print(f"✓ Added {len(test_docs)} documents to the store")
|
||||
|
||||
# Test search operation (automatically waits for index to be ready)
|
||||
query_embedding = [0.15, 0.25, 0.35, 0.45]
|
||||
results = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=2,
|
||||
)
|
||||
|
||||
print(f"\n✓ Search completed, found {len(results)} results:")
|
||||
for i, result in enumerate(results, 1):
|
||||
print(f" {i}. Score: {result.score:.4f}")
|
||||
print(f" Content: {result.metadata.content}")
|
||||
print(f" Doc ID: {result.metadata.doc_id}")
|
||||
|
||||
# Test search with score threshold (also waits for index if needed)
|
||||
results_filtered = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=5,
|
||||
score_threshold=0.3,
|
||||
)
|
||||
print(f"\n✓ Search with threshold (>0.3): {len(results_filtered)} results")
|
||||
|
||||
# Test delete operation (no initialization needed)
|
||||
# Note: MongoDBStore uses ids parameter for deletion
|
||||
await store.delete(ids=["doc_2", "doc_3", "doc_1"])
|
||||
print("\n✓ Deleted documents with specified doc_ids")
|
||||
|
||||
# Verify deletion (search will wait for index if needed)
|
||||
results_after_delete = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=5,
|
||||
)
|
||||
print(f"✓ After deletion: {len(results_after_delete)} documents remain")
|
||||
|
||||
# Get client for advanced operations
|
||||
client = store.get_client()
|
||||
print(f"\n✓ Got MongoDB Client: {type(client).__name__}")
|
||||
|
||||
await store.close()
|
||||
|
||||
|
||||
async def example_filter_search() -> None:
|
||||
"""The example of search with metadata filtering."""
|
||||
print("\n" + "=" * 60)
|
||||
print("Test 2: Search with Metadata Filtering")
|
||||
print("=" * 60)
|
||||
|
||||
# To use filter in search, specify filter_fields when creating the store.
|
||||
# These fields will be indexed for filtering in $vectorSearch.
|
||||
store = MongoDBStore(
|
||||
host=os.getenv("MONGODB_HOST"),
|
||||
db_name="filter_test_db",
|
||||
collection_name="filter_collection",
|
||||
dimensions=4,
|
||||
distance="cosine",
|
||||
filter_fields=["payload.doc_id"], # Enable filtering on doc_id
|
||||
)
|
||||
|
||||
# Create documents with different categories
|
||||
docs = [
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Python is a programming language"),
|
||||
doc_id="prog_1",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.1, 0.2, 0.3, 0.4],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(
|
||||
text="Java is used for enterprise applications",
|
||||
),
|
||||
doc_id="prog_2",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.2, 0.3, 0.4, 0.5],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Neural networks are used in AI"),
|
||||
doc_id="ai_1",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.3, 0.4, 0.5, 0.6],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Deep learning requires GPUs"),
|
||||
doc_id="ai_2",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.4, 0.5, 0.6, 0.7],
|
||||
),
|
||||
]
|
||||
|
||||
# Add documents (automatically creates database, collection, and index)
|
||||
await store.add(docs)
|
||||
print(f"✓ Added {len(docs)} documents with different doc_id prefixes")
|
||||
|
||||
# Search without filter (automatically waits for index if needed)
|
||||
query_embedding = [0.25, 0.35, 0.45, 0.55]
|
||||
all_results = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=4,
|
||||
)
|
||||
print(f"\n✓ Search without filter: {len(all_results)} results")
|
||||
for i, result in enumerate(all_results, 1):
|
||||
doc_id = result.metadata.doc_id
|
||||
score = result.score
|
||||
print(f" {i}. Doc ID: {doc_id}, Score: {score:.4f}")
|
||||
|
||||
# Search with filter for programming docs
|
||||
# Note: doc_id is stored in payload.doc_id in MongoDB documents
|
||||
# MongoDB $vectorSearch filter supports: $gt, $gte, $lt, $lte, $eq, $ne,
|
||||
# $in, $nin, $exists, $not (NOT $regex)
|
||||
prog_results = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=4,
|
||||
filter={"payload.doc_id": {"$in": ["prog_1", "prog_2"]}},
|
||||
)
|
||||
print(f"\n✓ Search with filter (prog docs): {len(prog_results)} results")
|
||||
for i, result in enumerate(prog_results, 1):
|
||||
doc_id = result.metadata.doc_id
|
||||
score = result.score
|
||||
print(f" {i}. Doc ID: {doc_id}, Score: {score:.4f}")
|
||||
|
||||
# Search with filter for AI docs
|
||||
ai_results = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=4,
|
||||
filter={"payload.doc_id": {"$in": ["ai_1", "ai_2"]}},
|
||||
)
|
||||
print(f"\n✓ Search with filter (ai docs): {len(ai_results)} results")
|
||||
for i, result in enumerate(ai_results, 1):
|
||||
doc_id = result.metadata.doc_id
|
||||
score = result.score
|
||||
print(f" {i}. Doc ID: {doc_id}, Score: {score:.4f}")
|
||||
|
||||
await store.close()
|
||||
|
||||
|
||||
async def example_multiple_chunks() -> None:
|
||||
"""The example of documents with multiple chunks."""
|
||||
print("\n" + "=" * 60)
|
||||
print("Test 3: Documents with Multiple Chunks")
|
||||
print("=" * 60)
|
||||
|
||||
store = MongoDBStore(
|
||||
host=os.getenv("MONGODB_HOST"),
|
||||
db_name="chunks_test_db",
|
||||
collection_name="chunks_collection",
|
||||
dimensions=4,
|
||||
distance="cosine",
|
||||
)
|
||||
|
||||
# Create a document split into multiple chunks
|
||||
chunks = [
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Chapter 1: Introduction to AI"),
|
||||
doc_id="book_1",
|
||||
chunk_id=0,
|
||||
total_chunks=3,
|
||||
),
|
||||
embedding=[0.1, 0.2, 0.3, 0.4],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Chapter 2: Machine Learning Basics"),
|
||||
doc_id="book_1",
|
||||
chunk_id=1,
|
||||
total_chunks=3,
|
||||
),
|
||||
embedding=[0.2, 0.3, 0.4, 0.5],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text="Chapter 3: Deep Learning Advanced"),
|
||||
doc_id="book_1",
|
||||
chunk_id=2,
|
||||
total_chunks=3,
|
||||
),
|
||||
embedding=[0.3, 0.4, 0.5, 0.6],
|
||||
),
|
||||
]
|
||||
|
||||
# Add chunks (automatically creates database, collection, and index)
|
||||
await store.add(chunks)
|
||||
print(f"✓ Added document with {len(chunks)} chunks")
|
||||
|
||||
# Search and verify chunk information (automatically waits for index if
|
||||
# needed)
|
||||
query_embedding = [0.2, 0.3, 0.4, 0.5]
|
||||
results = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=3,
|
||||
)
|
||||
|
||||
print("\n✓ Search results for multi-chunk document:")
|
||||
for i, result in enumerate(results, 1):
|
||||
chunk_info = (
|
||||
f"{result.metadata.chunk_id}/{result.metadata.total_chunks}"
|
||||
)
|
||||
print(f" {i}. Chunk {chunk_info}")
|
||||
print(f" Content: {result.metadata.content}")
|
||||
print(f" Score: {result.score:.4f}")
|
||||
|
||||
await store.close()
|
||||
|
||||
|
||||
async def example_distance_metrics() -> None:
|
||||
"""The example of different distance metrics."""
|
||||
print("\n" + "=" * 60)
|
||||
print("Test 4: Different Distance Metrics")
|
||||
print("=" * 60)
|
||||
|
||||
# Test with different metrics
|
||||
metrics = ["cosine", "euclidean", "dotProduct"]
|
||||
|
||||
for metric in metrics:
|
||||
print(f"\n--- Testing {metric} metric ---")
|
||||
store = MongoDBStore(
|
||||
host=os.getenv("MONGODB_HOST"),
|
||||
db_name=f"{metric}_test_db",
|
||||
collection_name=f"{metric}_collection",
|
||||
dimensions=4,
|
||||
distance=metric,
|
||||
)
|
||||
|
||||
docs = [
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(text=f"Test doc for {metric}"),
|
||||
doc_id=f"doc_{metric}_1",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.1, 0.2, 0.3, 0.4],
|
||||
),
|
||||
]
|
||||
|
||||
# Add and search (automatically creates database/collection/index
|
||||
# and waits for index)
|
||||
await store.add(docs)
|
||||
results = await store.search(
|
||||
query_embedding=[0.1, 0.2, 0.3, 0.4],
|
||||
limit=1,
|
||||
)
|
||||
|
||||
print(f"✓ {metric} metric: Score = {results[0].score:.4f}")
|
||||
|
||||
await store.close()
|
||||
|
||||
|
||||
async def main() -> None:
|
||||
"""Run all example."""
|
||||
print("\n" + "=" * 60)
|
||||
print("MongoDBStore Comprehensive Test Suite")
|
||||
print("=" * 60)
|
||||
|
||||
try:
|
||||
# await example_basic_operations()
|
||||
# await example_filter_search()
|
||||
# await example_multiple_chunks()
|
||||
await example_distance_metrics()
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("✓ All tests completed successfully!")
|
||||
print("=" * 60)
|
||||
|
||||
except Exception as e:
|
||||
print(f"\n✗ Test failed with error: {e}")
|
||||
import traceback
|
||||
|
||||
traceback.print_exc()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
164
examples/functionality/vector_store/oceanbase/README.md
Normal file
164
examples/functionality/vector_store/oceanbase/README.md
Normal file
@@ -0,0 +1,164 @@
|
||||
# OceanBase Vector Store
|
||||
|
||||
This example demonstrates how to use **OceanBaseStore** for vector storage and semantic search in AgentScope.
|
||||
It includes CRUD operations, metadata filtering, document chunking, and distance metric tests.
|
||||
|
||||
### Quick Start
|
||||
|
||||
Install dependencies (including `pyobvector`):
|
||||
|
||||
```bash
|
||||
pip install -e .[full]
|
||||
```
|
||||
|
||||
Start seekdb (a minimal OceanBase-compatible instance):
|
||||
|
||||
```bash
|
||||
docker run -d -p 2881:2881 oceanbase/seekdb
|
||||
```
|
||||
|
||||
Run the example script:
|
||||
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
> **Note:** The script defaults to `127.0.0.1:2881`, user `root`, database `test`.
|
||||
> If you use a multi-tenant OceanBase account (e.g., `root@test`), override via environment variables.
|
||||
|
||||
## Usage
|
||||
|
||||
### Initialize Store
|
||||
|
||||
```python
|
||||
from agentscope.rag import OceanBaseStore
|
||||
|
||||
store = OceanBaseStore(
|
||||
collection_name="test_collection",
|
||||
dimensions=768,
|
||||
distance="COSINE",
|
||||
uri="127.0.0.1:2881",
|
||||
user="root",
|
||||
password="",
|
||||
db_name="test",
|
||||
)
|
||||
```
|
||||
|
||||
### Add Documents
|
||||
|
||||
```python
|
||||
from agentscope.rag import Document, DocMetadata
|
||||
from agentscope.message import TextBlock
|
||||
|
||||
doc = Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(type="text", text="Your document text"),
|
||||
doc_id="doc_1",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.1, 0.2, 0.3],
|
||||
)
|
||||
|
||||
await store.add([doc])
|
||||
```
|
||||
|
||||
### Search
|
||||
|
||||
```python
|
||||
results = await store.search(
|
||||
query_embedding=[0.1, 0.2, 0.3],
|
||||
limit=5,
|
||||
score_threshold=0.9,
|
||||
)
|
||||
```
|
||||
|
||||
### Filter Search
|
||||
|
||||
```python
|
||||
client = store.get_client()
|
||||
table = client.load_table(collection_name="test_collection")
|
||||
|
||||
results = await store.search(
|
||||
query_embedding=[0.1, 0.2, 0.3],
|
||||
limit=5,
|
||||
flter=[table.c["doc_id"].like("doc%")],
|
||||
)
|
||||
```
|
||||
|
||||
> Note: The parameter name is `flter` (missing the "i") to avoid clashing with
|
||||
> Python's built-in `filter` and follows the underlying library's convention.
|
||||
|
||||
### Delete
|
||||
|
||||
```python
|
||||
client = store.get_client()
|
||||
table = client.load_table(collection_name="test_collection")
|
||||
|
||||
await store.delete(where=[table.c["doc_id"] == "doc_1"])
|
||||
```
|
||||
|
||||
## Distance Metrics
|
||||
|
||||
| Metric | Description | Best For |
|
||||
|--------|-------------|----------|
|
||||
| **COSINE** | Cosine similarity | Text embeddings (recommended) |
|
||||
| **L2** | Euclidean distance | Spatial data |
|
||||
| **IP** | Inner product | Recommendation systems |
|
||||
|
||||
## Filter Expressions
|
||||
|
||||
Build filters using SQLAlchemy expressions and pass them via `flter`:
|
||||
|
||||
```python
|
||||
table = store.get_client().load_table("test_collection")
|
||||
|
||||
filters = [
|
||||
table.c["doc_id"] == "doc_1",
|
||||
table.c["doc_id"].like("prefix%"),
|
||||
table.c["chunk_id"] >= 0,
|
||||
]
|
||||
```
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Access Underlying Client
|
||||
|
||||
```python
|
||||
client = store.get_client()
|
||||
stats = client.get_collection_stats(collection_name="test_collection")
|
||||
```
|
||||
|
||||
### Document Metadata
|
||||
|
||||
- `content`: Text content (TextBlock)
|
||||
- `doc_id`: Unique document identifier
|
||||
- `chunk_id`: Chunk position (0-indexed)
|
||||
- `total_chunks`: Total chunks in document
|
||||
|
||||
## FAQ
|
||||
|
||||
**What embedding dimension should I use?**
|
||||
Match your embedding model's output dimension (e.g., 768 for BERT, 1536 for OpenAI ada-002).
|
||||
|
||||
**Can I change the distance metric after creation?**
|
||||
No, create a new collection with the desired metric.
|
||||
|
||||
**How do I clean up test data?**
|
||||
Drop the collection via the underlying client or remove the seekdb container volume.
|
||||
|
||||
## Environment Variables
|
||||
|
||||
The script supports the following environment variables to override connection settings:
|
||||
|
||||
```bash
|
||||
export OCEANBASE_URI="127.0.0.1:2881"
|
||||
export OCEANBASE_USER="root"
|
||||
export OCEANBASE_PASSWORD=""
|
||||
export OCEANBASE_DB="test"
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- [OceanBase Vector Store](https://github.com/oceanbase/pyobvector)
|
||||
- [AgentScope RAG Tutorial](https://doc.agentscope.io/tutorial/task_rag.html)
|
||||
350
examples/functionality/vector_store/oceanbase/main.py
Normal file
350
examples/functionality/vector_store/oceanbase/main.py
Normal file
@@ -0,0 +1,350 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Example of using OceanBaseStore in AgentScope RAG system."""
|
||||
import asyncio
|
||||
import os
|
||||
|
||||
from agentscope.rag import (
|
||||
OceanBaseStore,
|
||||
Document,
|
||||
DocMetadata,
|
||||
)
|
||||
from agentscope.message import TextBlock
|
||||
|
||||
|
||||
def _create_store(
|
||||
collection_name: str,
|
||||
dimensions: int = 4,
|
||||
distance: str = "COSINE",
|
||||
) -> OceanBaseStore:
|
||||
return OceanBaseStore(
|
||||
collection_name=collection_name,
|
||||
dimensions=dimensions,
|
||||
distance=distance,
|
||||
uri=os.getenv("OCEANBASE_URI", "127.0.0.1:2881"),
|
||||
user=os.getenv("OCEANBASE_USER", "root"),
|
||||
password=os.getenv("OCEANBASE_PASSWORD", ""),
|
||||
db_name=os.getenv("OCEANBASE_DB", "test"),
|
||||
)
|
||||
|
||||
|
||||
async def example_basic_operations() -> None:
|
||||
"""The example of basic CRUD operations with OceanBaseStore."""
|
||||
print("\n" + "=" * 60)
|
||||
print("Test 1: Basic CRUD Operations")
|
||||
print("=" * 60)
|
||||
|
||||
store = _create_store(collection_name="ob_basic_collection")
|
||||
store.get_client().drop_collection("ob_basic_collection")
|
||||
|
||||
print("✓ OceanBaseStore initialized")
|
||||
|
||||
test_docs = [
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(
|
||||
type="text",
|
||||
text="Artificial Intelligence is the future",
|
||||
),
|
||||
doc_id="doc_1",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.1, 0.2, 0.3, 0.4],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(
|
||||
type="text",
|
||||
text="Machine Learning is a subset of AI",
|
||||
),
|
||||
doc_id="doc_2",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.2, 0.3, 0.4, 0.5],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(
|
||||
type="text",
|
||||
text="Deep Learning uses neural networks",
|
||||
),
|
||||
doc_id="doc_3",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.3, 0.4, 0.5, 0.6],
|
||||
),
|
||||
]
|
||||
|
||||
await store.add(test_docs)
|
||||
print(f"✓ Added {len(test_docs)} documents to the store")
|
||||
|
||||
query_embedding = [0.15, 0.25, 0.35, 0.45]
|
||||
results = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=2,
|
||||
)
|
||||
|
||||
print(f"\n✓ Search completed, found {len(results)} results:")
|
||||
for i, result in enumerate(results, 1):
|
||||
print(f" {i}. Score: {result.score:.4f}")
|
||||
print(f" Content: {result.metadata.content}")
|
||||
print(f" Doc ID: {result.metadata.doc_id}")
|
||||
|
||||
results_filtered = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=5,
|
||||
score_threshold=0.9,
|
||||
)
|
||||
print(f"\n✓ Search with threshold (>0.9): {len(results_filtered)} results")
|
||||
|
||||
client = store.get_client()
|
||||
table = client.load_table(collection_name="ob_basic_collection")
|
||||
await store.delete(where=[table.c["doc_id"] == "doc_2"])
|
||||
print("\n✓ Deleted document with doc_id='doc_2'")
|
||||
|
||||
results_after_delete = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=5,
|
||||
)
|
||||
print(f"✓ After deletion: {len(results_after_delete)} documents remain")
|
||||
|
||||
print(f"\n✓ Got MilvusLikeClient: {type(client).__name__}")
|
||||
|
||||
|
||||
async def example_filter_search() -> None:
|
||||
"""The example of search with metadata filtering."""
|
||||
print("\n" + "=" * 60)
|
||||
print("Test 2: Search with Metadata Filtering")
|
||||
print("=" * 60)
|
||||
|
||||
store = _create_store(collection_name="ob_filter_collection")
|
||||
client = store.get_client()
|
||||
client.drop_collection("ob_filter_collection")
|
||||
|
||||
docs = [
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(
|
||||
type="text",
|
||||
text="Python is a programming language",
|
||||
),
|
||||
doc_id="prog_1",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.1, 0.2, 0.3, 0.4],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(
|
||||
type="text",
|
||||
text="Java is used for enterprise applications",
|
||||
),
|
||||
doc_id="prog_2",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.2, 0.3, 0.4, 0.5],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(
|
||||
type="text",
|
||||
text="Neural networks are used in AI",
|
||||
),
|
||||
doc_id="ai_1",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.3, 0.4, 0.5, 0.6],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(
|
||||
type="text",
|
||||
text="Deep learning requires GPUs",
|
||||
),
|
||||
doc_id="ai_2",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.4, 0.5, 0.6, 0.7],
|
||||
),
|
||||
]
|
||||
|
||||
await store.add(docs)
|
||||
print(f"✓ Added {len(docs)} documents with different doc_id prefixes")
|
||||
|
||||
query_embedding = [0.25, 0.35, 0.45, 0.55]
|
||||
all_results = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=4,
|
||||
)
|
||||
print(f"\n✓ Search without filter: {len(all_results)} results")
|
||||
for i, result in enumerate(all_results, 1):
|
||||
print(
|
||||
f" {i}. Doc ID: {result.metadata.doc_id}, "
|
||||
f"Score: {result.score:.4f}",
|
||||
)
|
||||
|
||||
table = client.load_table(collection_name="ob_filter_collection")
|
||||
prog_results = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=4,
|
||||
flter=[table.c["doc_id"].like("prog%")],
|
||||
)
|
||||
print("\n✓ Search with filter (doc_id like 'prog%'):")
|
||||
for i, result in enumerate(prog_results, 1):
|
||||
print(
|
||||
f" {i}. Doc ID: {result.metadata.doc_id}, "
|
||||
f"Score: {result.score:.4f}",
|
||||
)
|
||||
|
||||
ai_results = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=4,
|
||||
flter=[table.c["doc_id"].like("ai%")],
|
||||
)
|
||||
print("\n✓ Search with filter (doc_id like 'ai%'):")
|
||||
for i, result in enumerate(ai_results, 1):
|
||||
print(
|
||||
f" {i}. Doc ID: {result.metadata.doc_id}, "
|
||||
f"Score: {result.score:.4f}",
|
||||
)
|
||||
|
||||
|
||||
async def example_multiple_chunks() -> None:
|
||||
"""The example of documents with multiple chunks."""
|
||||
print("\n" + "=" * 60)
|
||||
print("Test 3: Documents with Multiple Chunks")
|
||||
print("=" * 60)
|
||||
|
||||
store = _create_store(collection_name="ob_chunks_collection")
|
||||
store.get_client().drop_collection("ob_chunks_collection")
|
||||
|
||||
chunks = [
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(
|
||||
type="text",
|
||||
text="Chapter 1: Introduction to AI",
|
||||
),
|
||||
doc_id="book_1",
|
||||
chunk_id=0,
|
||||
total_chunks=3,
|
||||
),
|
||||
embedding=[0.1, 0.2, 0.3, 0.4],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(
|
||||
type="text",
|
||||
text="Chapter 2: Machine Learning Basics",
|
||||
),
|
||||
doc_id="book_1",
|
||||
chunk_id=1,
|
||||
total_chunks=3,
|
||||
),
|
||||
embedding=[0.2, 0.3, 0.4, 0.5],
|
||||
),
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(
|
||||
type="text",
|
||||
text="Chapter 3: Deep Learning Advanced",
|
||||
),
|
||||
doc_id="book_1",
|
||||
chunk_id=2,
|
||||
total_chunks=3,
|
||||
),
|
||||
embedding=[0.3, 0.4, 0.5, 0.6],
|
||||
),
|
||||
]
|
||||
|
||||
await store.add(chunks)
|
||||
print(f"✓ Added document with {len(chunks)} chunks")
|
||||
|
||||
query_embedding = [0.2, 0.3, 0.4, 0.5]
|
||||
results = await store.search(
|
||||
query_embedding=query_embedding,
|
||||
limit=3,
|
||||
)
|
||||
|
||||
print("\n✓ Search results for multi-chunk document:")
|
||||
for i, result in enumerate(results, 1):
|
||||
chunk_info = (
|
||||
f"{result.metadata.chunk_id}/{result.metadata.total_chunks}"
|
||||
)
|
||||
print(f" {i}. Chunk {chunk_info}")
|
||||
print(f" Content: {result.metadata.content}")
|
||||
print(f" Score: {result.score:.4f}")
|
||||
|
||||
|
||||
async def example_distance_metrics() -> None:
|
||||
"""The example of different distance metrics."""
|
||||
print("\n" + "=" * 60)
|
||||
print("Test 4: Different Distance Metrics")
|
||||
print("=" * 60)
|
||||
|
||||
metrics = ["COSINE", "L2", "IP"]
|
||||
|
||||
for metric in metrics:
|
||||
print(f"\n--- Testing {metric} metric ---")
|
||||
collection_name = f"ob_{metric}_collection"
|
||||
store = _create_store(
|
||||
collection_name=collection_name,
|
||||
distance=metric,
|
||||
)
|
||||
store.get_client().drop_collection(collection_name)
|
||||
|
||||
docs = [
|
||||
Document(
|
||||
metadata=DocMetadata(
|
||||
content=TextBlock(
|
||||
type="text",
|
||||
text=f"Test doc for {metric}",
|
||||
),
|
||||
doc_id=f"doc_{metric}_1",
|
||||
chunk_id=0,
|
||||
total_chunks=1,
|
||||
),
|
||||
embedding=[0.1, 0.2, 0.3, 0.4],
|
||||
),
|
||||
]
|
||||
|
||||
await store.add(docs)
|
||||
results = await store.search(
|
||||
query_embedding=[0.1, 0.2, 0.3, 0.4],
|
||||
limit=1,
|
||||
)
|
||||
|
||||
print(f"✓ {metric} metric: Score = {results[0].score:.4f}")
|
||||
|
||||
|
||||
async def main() -> None:
|
||||
"""Run all example."""
|
||||
print("\n" + "=" * 60)
|
||||
print("OceanBaseStore Comprehensive Test Suite")
|
||||
print("=" * 60)
|
||||
|
||||
try:
|
||||
await example_basic_operations()
|
||||
await example_filter_search()
|
||||
await example_multiple_chunks()
|
||||
await example_distance_metrics()
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("✓ All tests completed successfully!")
|
||||
print("=" * 60)
|
||||
|
||||
except Exception as e:
|
||||
print(f"\n✗ Test failed with error: {e}")
|
||||
import traceback
|
||||
|
||||
traceback.print_exc()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
Reference in New Issue
Block a user