Overview
Thedocument-api is the knowledge backend for the Rapida platform. It processes documents from PDF, Word, CSV, and other formats into searchable vector embeddings and full-text search indices. At call time, assistant-api queries this service to inject relevant knowledge context into the LLM prompt.
Port
9010 — HTTP (FastAPI / uvicorn)Language
Python 3.11+
FastAPI + Celery
Storage
PostgreSQL
assistant_db
Redis (Celery broker)
OpenSearch (vectors + text)Document processing is asynchronous. When a document is uploaded, the API immediately returns a
document_id with status: processing. Text extraction, chunking, and embedding generation are handled by Celery workers in the background.Components
Document Ingestion Pipeline
Document Ingestion Pipeline
The processing pipeline runs as a Celery task after each upload. The stages are sequential and the document status is updated at each step.
| Stage | Library | Configurable |
|---|---|---|
| Text extraction | format-specific (see table below) | No |
| Chunking | custom splitter | CHUNK_SIZE, CHUNK_OVERLAP |
| Embeddings | sentence-transformers | EMBEDDINGS_MODEL |
| Full-text index | OpenSearch | — |
Supported File Formats
Supported File Formats
| Format | Library | What is extracted |
|---|---|---|
| PyPDF2, pdfplumber | Text content + metadata | |
| Word (.docx) | python-docx | Text + paragraph structure |
| Excel (.xlsx) | openpyxl, pandas | Cell values as text |
| CSV | pandas | Row data as text |
| Markdown (.md) | built-in | Text preserving structure |
| HTML | BeautifulSoup | Cleaned text from HTML |
| Plain text (.txt) | built-in | Direct read |
| Images | pytesseract (OCR) | OCR-extracted text |
Embedding Models
Embedding Models
Embeddings are generated using sentence-transformers. The model is configurable:
Set via
| Model | Dimensions | Speed | Quality | Notes |
|---|---|---|---|---|
all-MiniLM-L6-v2 | 384 | Fast | Good | Default, ~80 MB |
all-mpnet-base-v2 | 768 | Medium | High | Larger model |
all-MiniLM-L12-v2 | 384 | Faster | Good | Lighter variant |
multilingual-e5-base | 768 | Medium | Good | 100+ languages |
EMBEDDINGS_MODEL in the config.Audio Noise Reduction (RNNoise)
Audio Noise Reduction (RNNoise)
The document-api includes RNNoise, a recurrent neural network noise suppressor, for processing audio documents. When enabled, noise reduction is applied before transcription.
| Setting | Variable | Values |
|---|---|---|
| Enable/disable | RNNOISE_ENABLED | true · false |
| Suppression level | RNNOISE_LEVEL | 0.0 (off) to 1.0 (maximum) |
Semantic Search
At call time,assistant-api queries document-api with a text query. The service performs vector similarity search and returns the top-k most relevant chunks.
Search request
Configuration
The document-api uses a YAML config file atdocker/document-api/config.yaml combined with environment variables.
Required settings
| Variable | Required | Default | Description |
|---|---|---|---|
postgres.host | ✅ Yes | localhost | PostgreSQL host |
postgres.db | ✅ Yes | assistant_db | Database name |
postgres.auth.user | ✅ Yes | rapida_user | Database user |
postgres.auth.password | ✅ Yes | — | Database password |
elastic_search.host | ✅ Yes | localhost | OpenSearch host |
celery.broker | ✅ Yes | redis://localhost:6379/0 | Celery broker URL |
celery.backend | ✅ Yes | redis://localhost:6379/0 | Celery result backend URL |
Tuning settings
| Setting | Default | Description |
|---|---|---|
CHUNK_SIZE | 1000 | Characters per document chunk |
CHUNK_OVERLAP | 100 | Character overlap between adjacent chunks |
MAX_FILE_SIZE | 52428800 | Maximum upload size in bytes (50 MB) |
EMBEDDINGS_MODEL | all-MiniLM-L6-v2 | Sentence-transformers model name |
EMBEDDINGS_DIMENSION | 384 | Embedding vector dimension |
CELERY_WORKERS | 4 | Number of Celery worker processes |
RNNOISE_ENABLED | true | Enable audio noise reduction |
RNNOISE_LEVEL | 0.5 | Noise reduction level (0.0–1.0) |
Full config file (docker/document-api/config.yaml)
Running
- Docker Compose
- From Source
document-api is part of the knowledge Docker Compose profile and is not started by default.Health & Observability
| Endpoint | Purpose |
|---|---|
GET /readiness/ | Reports whether the service is ready |
GET /healthz/ | Liveness probe |
Troubleshooting
Document stuck in 'processing' status
Document stuck in 'processing' status
The Celery worker is likely not running. Check:
Embedding generation is slow
Embedding generation is slow
Reduce batch size to lower memory pressure, or increase it for throughput on capable hardware:
OpenSearch index errors
OpenSearch index errors
High memory usage
High memory usage
Next Steps
Assistant API
How assistants use knowledge bases during calls.
Architecture
Full system topology and data flow diagrams.
Installation Guide
Deploy the full platform with Docker Compose.
Configuration Reference
Full environment variable reference.