Skip to content

Using with Datashare

Datashare is ICIJ's open-source document analysis platform. es-translator was specifically designed to work with Datashare's Elasticsearch indices.

Overview

Datashare stores extracted document content in Elasticsearch. es-translator can translate this content, making documents searchable in multiple languages.

┌─────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Datashare  │────▶│  Elasticsearch  │◀────│  es-translator  │
│  (extract)  │     │    (storage)    │     │   (translate)   │
└─────────────┘     └─────────────────┘     └─────────────────┘

Document Structure

Datashare documents in Elasticsearch have this structure:

{
  "content": "Original document text...",
  "contentTranslated": [
    {
      "content": "Translated text...",
      "source_language": "FRENCH",
      "target_language": "ENGLISH",
      "translator": "ARGOS"
    }
  ],
  "type": "Document",
  "path": "/path/to/file.pdf",
  ...
}

Basic Usage

Translate All Documents

Translate all documents from French to English:

es-translator \
  --url "http://localhost:9200" \
  --index local-datashare \
  --source-language fr \
  --target-language en \
  --source-field content \
  --target-field contentTranslated

Translate Only Documents (Skip Named Entities)

Datashare indices contain both documents and named entities. To translate only documents:

es-translator \
  --url "http://localhost:9200" \
  --index local-datashare \
  --source-language fr \
  --target-language en \
  --source-field content \
  --target-field contentTranslated \
  --query-string "type:Document AND language:FRENCH"

Translate Specific Project

If you have multiple Datashare projects:

es-translator \
  --url "http://localhost:9200" \
  --index my-project \
  --source-language fr \
  --target-language en \
  --source-field content \
  --target-field contentTranslated

Handling Large Datasets

Datashare projects can contain millions of documents. Here's how to handle them efficiently.

Use Distributed Translation

For large projects, use the planning mode to distribute work:

# Step 1: Queue all documents
es-translator \
  --url "http://localhost:9200" \
  --index local-datashare \
  --source-language fr \
  --target-language en \
  --source-field content \
  --target-field contentTranslated \
  --broker-url "redis://redis:6379" \
  --query-string "type:Document AND language:FRENCH" \
  --plan

# Step 2: Start multiple workers
es-translator-tasks --broker-url "redis://redis:6379" --concurrency 4

Handle Content Length Limits

Datashare's highlighting feature has content length limits. Use --max-content-length to truncate translations:

es-translator \
  --url "http://localhost:9200" \
  --index local-datashare \
  --source-language fr \
  --target-language en \
  --source-field content \
  --target-field contentTranslated \
  --max-content-length 19G

Why 19G?

Datashare uses Lucene's highlighting which has internal limits. The 19G value matches Datashare's expected maximum. See datashare#1184 for details.

Prevent Scroll Timeout

For very large indices, increase the scroll duration:

es-translator \
  --url "http://localhost:9200" \
  --index local-datashare \
  --source-language fr \
  --target-language en \
  --source-field content \
  --target-field contentTranslated \
  --scan-scroll 30m

Multiple Languages

Sequential Translation

To translate documents into multiple target languages:

# French to English
es-translator \
  --url "http://localhost:9200" \
  --index local-datashare \
  --source-language fr \
  --target-language en \
  --source-field content \
  --target-field contentTranslated

# French to Spanish
es-translator \
  --url "http://localhost:9200" \
  --index local-datashare \
  --source-language fr \
  --target-language es \
  --source-field content \
  --target-field contentTranslated

Translations are appended to the contentTranslated array, so multiple translations can coexist.

Troubleshooting

"Connection refused" to Elasticsearch

If running es-translator in Docker and Elasticsearch is on the host:

# Use host network mode
docker run --network host icij/es-translator es-translator ...

# Or use host.docker.internal (Docker Desktop)
docker run icij/es-translator es-translator \
  --url "http://host.docker.internal:9200" ...

Documents Not Being Translated

Check if documents are being filtered correctly:

# Test with dry-run and debug logging
es-translator \
  --url "http://localhost:9200" \
  --index local-datashare \
  --source-language fr \
  --target-language en \
  --source-field content \
  --target-field contentTranslated \
  --query-string "type:Document" \
  --dry-run \
  --stdout-loglevel DEBUG

Existing Translations Not Overwritten

By default, es-translator skips already-translated documents. Use --force to re-translate:

es-translator \
  --url "http://localhost:9200" \
  --index local-datashare \
  --source-language fr \
  --target-language en \
  --source-field content \
  --target-field contentTranslated \
  --force

Environment Variables

When deploying with Datashare, you can configure es-translator via environment variables:

Variable Description Default
ES_TRANSLATOR_ELASTICSEARCH_URL Elasticsearch URL http://localhost:9200
ES_TRANSLATOR_ELASTICSEARCH_INDEX Default index local-datashare
ES_TRANSLATOR_REDIS_URL Redis URL for Celery redis://localhost:6379
ES_TRANSLATOR_INTERPRETER Default interpreter ARGOS
ES_TRANSLATOR_SOURCE_FIELD Default source field content
ES_TRANSLATOR_TARGET_FIELD Default target field content_translated
ES_TRANSLATOR_MAX_CONTENT_LENGTH Max content length 19G