Dependency injection¶
What are dependencies ?¶
We call "lifepsan dependency" or just "dependency" any object used by an activity workers and which needs to live longer then the activity duration.
Common dependencies include: - clients - connections and connection pools - ML models
Dependency injection make sure that these dependencies are available in your activity code and also makes sure resources are correctly freed when they are no longer needed.
How datashare-python dependency injection works ?¶
datashare-python dependency injection is inspired by fastapi's
lifespan events handling.
The idea is to: 1. provide an number of functions and/or context managers initializing dependencies and storing them in the worker thread context 2. define functions access this context and letting you access these variables in your code
Providing dependencies¶
If you are building an automatic speech recognition worker, you might implement the following activity:
from pathlib import Path
from transformers import CohereAsrForConditionalGeneration
from utils import activity_defn
@activity_defn(name="asr-transcription")
def asr_activity(audios: list[Path]) -> list:
ml_model = CohereAsrForConditionalGeneration.from_pretrained( # (1)!
"CohereLabs/cohere-transcribe-03-2026", device_map="auto"
)
return ml_model.transcribe(audios)
- this is awfully heavy, we don't want to reload the model everytime !
The obvious problem with this implementation is that we'll reload the model each time we receive audios to process. Ideally we'd like to have the model preloaded in memory and just run inference.
Instead, we'll define a lifespan dependency which loads the model and stores it into the worker thread content variables:
from contextvars import ContextVar
from transformers import CohereAsrForConditionalGeneration
ML_MODEL: ContextVar[dict | None] = ContextVar("ml_model") # (1)!
def load_ml_model() -> None:
ml_model = CohereAsrForConditionalGeneration.from_pretrained( # (2)!
"CohereLabs/cohere-transcribe-03-2026", device_map="auto"
)
ML_MODEL.set(ml_model) # (3)!
- register a context variable with the
ml_modelname - load the model
- store the model into the registered context variable
A better version of this dependency uses context manager to make sure resource are freed when worker no longer needs the dependency:
@contextmanager
def load_ml_model() -> Generator[None, None, None]:
ml_model = CohereAsrForConditionalGeneration.from_pretrained(
"CohereLabs/cohere-transcribe-03-2026", device_map="auto"
)
ML_MODEL.set(ml_model)
try:
yield # (1)!
finally: # (2)!
del ml_model
torch.cuda.empty_cache()
gc.collect()
ML_MODEL.set(None)
- let the calling code run
- clean everything up when the caller is done
Accessing dependencies¶
Now that we've registered our dependency in the thread context, we need to update our activity to access the context
variable. We can do it directly by calling ContextVar("ml_model").get(), but we can more elegantly define the
following dependency function:
def lifespan_ml_model() -> CohereAsrForConditionalGeneration:
try:
return ML_MODEL.get()
except LookupError as e:
raise DependencyInjectionError("ml model") from e
Next, we'll use this function in our activity:
from pathlib import Path
from utils import activity_defn
from .dependencies import lifespan_ml_model
@activity_defn(name="asr-transcription")
def asr_activity(audios: list[Path]) -> list:
ml_model = lifespan_ml_model() # (1)!
return ml_model.transcribe(audios)
- load cached model rather than reloading it
Worker dependency discovery¶
In order for dependencies to by discoverable by datashare-python's CLI, they need to be registered.
Under the hood, the DEPENDENCIES variable is registered as plugin entrypoint
[project.entry-points."datashare.dependencies"]
dependencies = "asr_worker.dependencies:DEPENDENCIES"
When running a worker using datashare-python worker start CLI, datashare-python will look for any variable registered under
the "datashare.dependencies" key and the dependencies entry point name and will be able to run activities registered in these variables.
You can register as dependency sets as you want in the bounded variable. You can use the variable name of your choice
for the dict registry, as long as it's bound under the "datashare.dependencies" key the dependencies entry point name.
Selecting dependencies when running datashare-python's CLI¶
When running an activity worker using
the datashare-python will auto discover dependencies and if the registry has a single entry in it, it will
automatically use this dependency sets.
In case your registry contains multiple dependency sets, you can provide call the CLI providing the set's key (here "base") as argument: