Usage¶
This guide covers how to use es-translator to translate documents in Elasticsearch.
Quick Start¶
Translate documents from French to English:
es-translator \
--url "http://localhost:9200" \
--index my-index \
--source-language fr \
--target-language en
Installation¶
pip install es-translator
docker run -it icij/es-translator es-translator --help
git clone https://github.com/icij/es-translator.git
cd es-translator
make install
Optional: Install Apertium¶
Apertium is only required if you want to use the Apertium interpreter. es-translator works out of the box with Argos (the default).
wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
sudo apt install apertium-all-dev
Basic Translation¶
Translate a Field¶
By default, es-translator translates the content field and stores results in content_translated:
es-translator \
--url "http://localhost:9200" \
--index my-index \
--source-language fr \
--target-language en
Translate a Different Field¶
To translate a different field (e.g., title):
es-translator \
--url "http://localhost:9200" \
--index my-index \
--source-language fr \
--target-language en \
--source-field title \
--target-field title_translated
Filter Documents¶
Use Elasticsearch query strings to filter which documents to translate:
es-translator \
--url "http://localhost:9200" \
--index my-index \
--source-language fr \
--target-language en \
--query-string "type:article AND status:published"
Force Re-translation¶
By default, es-translator skips already translated documents. To re-translate:
es-translator \
--url "http://localhost:9200" \
--index my-index \
--source-language fr \
--target-language en \
--force
Choosing an Interpreter¶
es-translator supports two translation backends:
| Feature | Argos (default) | Apertium |
|---|---|---|
| Type | Neural Machine Translation | Rule-based Translation |
| Quality | Higher quality | Good for related languages |
| Speed | Slower (ML models) | Faster |
| Offline | Yes (downloads models) | Yes (system packages) |
| Languages | ~30 languages | 40+ language pairs |
| Intermediary | Not supported | Supported |
| Installation | Automatic | Requires system packages |
Using Argos (Default)¶
Argos provides neural machine translation with automatic model downloading:
es-translator \
--url "http://localhost:9200" \
--index my-index \
--source-language fr \
--target-language en \
--interpreter argos
Using Apertium¶
Apertium provides rule-based translation, ideal for related languages:
es-translator \
--url "http://localhost:9200" \
--index my-index \
--source-language es \
--target-language pt \
--interpreter apertium
Intermediary Languages¶
When a direct translation pair isn't available, use an intermediary language:
es-translator \
--url "http://localhost:9200" \
--index my-index \
--source-language pt \
--target-language en \
--interpreter apertium \
--intermediary-language es
List Available Pairs¶
# Show remotely available pairs
es-translator-pairs
# Show locally installed pairs
es-translator-pairs --local
Performance Tuning¶
Parallel Processing¶
Use multiple worker processes for faster translation:
es-translator \
--url "http://localhost:9200" \
--index my-index \
--source-language fr \
--target-language en \
--pool-size 4
Large Datasets¶
For large datasets, increase the scroll timeout to prevent context loss:
es-translator \
--url "http://localhost:9200" \
--index my-index \
--source-language fr \
--target-language en \
--scan-scroll 30m
Limit Content Length¶
Prevent issues with very large documents:
es-translator \
--url "http://localhost:9200" \
--index my-index \
--source-language fr \
--target-language en \
--max-content-length 10M
Throttling¶
Add delay between translations to reduce Elasticsearch load:
es-translator \
--url "http://localhost:9200" \
--index my-index \
--source-language fr \
--target-language en \
--throttle 100 # 100ms delay
GPU Acceleration¶
Argos supports GPU acceleration via CUDA for faster neural translation:
es-translator \
--url "http://localhost:9200" \
--index my-index \
--source-language fr \
--target-language en \
--device cuda
Available device options:
| Device | Description |
|---|---|
auto | Use CUDA if available, otherwise CPU (default) |
cuda | Force GPU usage (requires CUDA) |
cpu | Force CPU usage |
You can also set the device via environment variable:
export ES_TRANSLATOR_DEVICE=cuda
es-translator ...
CUDA Requirements
GPU acceleration requires a CUDA-compatible GPU and the appropriate CUDA libraries installed. If CUDA is not available and --device cuda is specified, translation will fail. See Install NVIDIA drivers on Ubuntu AWS instances for setup instructions.
Distributed Translation¶
For very large datasets, distribute translation across multiple servers using Celery and Redis.
Step 1: Plan the Translation¶
Queue documents for translation:
es-translator \
--url "http://localhost:9200" \
--index my-index \
--source-language fr \
--target-language en \
--broker-url "redis://redis:6379" \
--plan
Step 2: Start Workers¶
Start one or more workers to process the queue:
es-translator-tasks \
--broker-url "redis://redis:6379" \
--concurrency 2
Using Docker for Workers¶
Run workers as a service with automatic restart:
docker run \
--detach \
--restart on-failure \
--name es-translator-worker \
icij/es-translator es-translator-tasks \
--broker-url "redis://redis:6379" \
--concurrency 2
Step 3: Monitor Progress¶
Use the live monitoring interface to track translation progress:
es-translator-monitor --broker-url "redis://redis:6379"
The monitor displays:
- Queue Status: Pending, active, and completed tasks
- Progress: Completion percentage, remaining tasks, and ETA
- Workers: Connected workers with per-worker throughput (tasks/sec)
- Throughput: Real-time graph of translation speed over time
Testing & Debugging¶
Dry Run¶
Test without saving to Elasticsearch:
es-translator \
--url "http://localhost:9200" \
--index my-index \
--source-language fr \
--target-language en \
--dry-run
Debug Logging¶
Enable verbose logging:
es-translator \
--url "http://localhost:9200" \
--index my-index \
--source-language fr \
--target-language en \
--stdout-loglevel DEBUG
CLI Reference¶
es-translator¶
Main command for translating documents.
es-translator [OPTIONS]
Options:
-u, --url TEXT Elasticsearch URL
-i, --index TEXT Elasticsearch Index
-r, --interpreter TEXT Interpreter (argos or apertium)
-s, --source-language TEXT Source language [required]
-t, --target-language TEXT Target language [required]
--intermediary-language TEXT Intermediary language for indirect translation
--source-field TEXT Field to translate (default: content)
--target-field TEXT Field for translations (default: content_translated)
-q, --query-string TEXT Filter documents with query string
-d, --data-dir PATH Directory for language models
--scan-scroll TEXT Scroll duration (default: 5m)
--dry-run Don't save to Elasticsearch
-f, --force Re-translate existing translations
--pool-size INTEGER Number of parallel workers
--pool-timeout INTEGER Worker timeout in seconds
--throttle INTEGER Delay between translations (ms)
--progressbar / --no-progressbar Show progress bar
--plan Queue for distributed translation
--broker-url TEXT Redis URL for distributed mode
--max-content-length TEXT Max content length (e.g., 10M, 1G)
--device [cpu|cuda|auto] Device for Argos translation (default: auto)
--stdout-loglevel TEXT Log level (DEBUG, INFO, WARNING, ERROR)
--help Show help
es-translator-tasks¶
Start Celery workers for distributed translation.
es-translator-tasks [OPTIONS]
Options:
--broker-url TEXT Redis URL
--concurrency INTEGER Number of concurrent workers
--stdout-loglevel TEXT Log level
--help Show help
es-translator-pairs¶
List available Apertium language pairs.
es-translator-pairs [OPTIONS]
Options:
--data-dir PATH Directory for language packs
--local Show only locally installed pairs
--help Show help
es-translator-monitor¶
Live monitoring interface for distributed translation workers.
es-translator-monitor [OPTIONS]
Options:
--broker-url TEXT Redis URL (default: redis://localhost:6379)
--refresh FLOAT Refresh interval in seconds (default: 2.0)
--history FLOAT Throughput history duration in seconds (default: 60.0)
--chart-scale [s|min|h] Throughput scale for chart (default: s)
--worker-scale [s|min|h] Throughput scale for workers table (default: min)
--worker-throughput-lifespan FLOAT Duration to average per-worker throughput (default: 30.0)
--help Show help
Throughput Scale Options¶
The --chart-scale and --worker-scale options control how throughput is displayed:
| Scale | Description | Example |
|---|---|---|
s | Tasks per second | 2.5 tasks/s |
min | Tasks per minute | 150.0 tasks/min |
h | Tasks per hour | 9000.0 tasks/h |
Example: Show tasks/second in the chart, tasks/minute in the workers table:
es-translator-monitor --chart-scale s --worker-scale min
Worker Throughput Smoothing¶
The --worker-throughput-lifespan option controls how per-worker throughput is calculated. Instead of showing instantaneous rates (which can be noisy), it averages the throughput over the specified duration.
- Default: 30 seconds
- Higher values = smoother, more stable readings
- Lower values = more responsive to changes
Example: Use a 60-second window for smoother per-worker throughput:
es-translator-monitor --worker-throughput-lifespan 60