Data Management¶
This project treats large datasets as external assets, not repository contents.
Principles¶
Keep Git history lightweight and reviewable.
Keep tests reproducible with small synthetic fixtures.
Keep real datasets in external storage (local cache, object store, or shared filesystem).
Dataset Registry¶
Use scripts/data/dataset_registry.json as the source of truth for downloadable datasets.
id: stable dataset identifierurl: download URLfilename: local filename under dataset foldersha256: optional integrity checkextract: auto-extract archives when trueenabled: include/exclude from fetch runs
Fetch Datasets¶
# list enabled datasets
scripts/data/fetch_datasets.py --list
# fetch selected datasets to default cache (~/.cache/chatspatial/datasets)
scripts/data/fetch_datasets.py --dataset your_dataset_id
# fetch all enabled datasets to a custom location
scripts/data/fetch_datasets.py --dest /data/chatspatial
Register Local Paths (Team/Personal)¶
For datasets that should not be downloaded publicly:
scripts/data/register_external_dataset.py my_dataset /absolute/path/to/data.h5ad
This creates data/datasets.local.json (gitignored).
Workspace Cleanup¶
Clean local build/test artifacts without touching source files:
scripts/maintenance/clean_workspace.sh
Testing Contract¶
Default test suite must not depend on
code/data.Use fixtures that generate temporary
.h5adfiles undertmp_path.Heavy dependency checks are marked
@pytest.mark.slowand excluded by default.