Dataset configuration
NeuRepTrace can run time-resolved decoding from JSON or YAML configs. The config
keeps dataset-specific file conventions outside Python workflow code, which makes
legacy project layouts such as PyMEGDec's Part*Data.mat files expressible as a
recipe rather than a separate package.
Commands
Validate a config and all referenced input files:
neureptrace validate-dataset-config configs/bush_meg/stimulus_decoding.yml --check-files
Print the effective config after overrides, participant expansion, and input-file resolution:
neureptrace validate-dataset-config configs/bush_meg/stimulus_decoding.yml \
--set participants.ids=2 \
--print-effective-config
Run the configured decoder:
neureptrace decode-from-config configs/bush_meg/stimulus_decoding.yml
Override individual values without copying the config:
neureptrace decode-from-config configs/bush_meg/stimulus_decoding.yml \
--set participants.ids='[2,3,4]' \
--set decoding.classifier=lda \
--set preprocessing.pca_components=50
Override values are parsed as JSON scalars when possible, so lists should be quoted as JSON.
Config structure
A config has five conceptual layers:
schema_version: neureptrace.dataset.v1
paths:
base: cwd # cwd | config_dir | project_root
dataset:
name: bush_meg
type: fieldtrip_mat
root: ${BUSH_MEG_DATA_DIR}
participant_file: "Part{participant}Data.mat"
variable: data
participants:
ids: "1-4,6,8"
validation:
channel_policy: exact
metadata:
columns:
- name: stimulus_class
index: 0
maps:
stimulus_class:
1: face
2: object
filters:
- column: stimulus_class
include: [1, 2]
preprocessing:
tmin: -0.2
tmax: 0.6
window_size: 0.1
window_step: 0.05
decoding:
label_column: stimulus_class
group_column: participant
classifier: multiclass-svm
outputs:
base_dir: results/bush_meg
summary_csv: stimulus_summary.csv
observations_csv: stimulus_observations.csv
provenance: true
dataset describes how files are found and read. metadata maps source
metadata into named columns, optionally recodes values, and can filter trials.
preprocessing controls windowing and normalization. decoding selects labels,
grouping, classifiers, calibration, and tuning. outputs names the generated
CSVs and provenance sidecars.
Path semantics
Input files are resolved relative to dataset.root when that key is present,
otherwise relative to the config file directory. Output paths are intentionally
separate: relative output paths are resolved against outputs.base_dir when it
is configured, or against paths.base otherwise.
paths.base supports:
cwd: the current working directory; this is the default for output paths.config_dir: the directory containing the YAML/JSON config.project_root: the nearest parent containingpyproject.tomlor.git, with the current working directory as fallback.
This prevents a config under configs/bush_meg/ from unexpectedly writing to
configs/bush_meg/results/....
Dataset types
mne_epochs
Use this when the data already exists as an MNE Epochs FIF file:
dataset:
type: mne_epochs
epochs: data/sub-01_epo.fif
metadata_csv: data/sub-01_events.csv
If the epochs file already contains metadata, metadata_csv can be omitted. When
--check-files is used, both the epochs file and metadata_csv are checked.
fieldtrip_mat
Use this for FieldTrip-like MATLAB structs with trial, time, label, and
trialinfo fields:
dataset:
type: fieldtrip_mat
root: ${BUSH_MEG_DATA_DIR}
participant_file: "Part{participant}Data.mat"
variable: data
fields:
trial: trial
time: time
label: label
trialinfo: trialinfo
participants:
ids: "1-4,6,8"
For main/cue transfer or other multi-file setups, use explicit files and attach metadata to each file:
dataset:
type: fieldtrip_mat
root: ${BUSH_MEG_DATA_DIR}
files:
- path: "Part2Data.mat"
participant: 2
split: main
- path: "Part2CueData.mat"
participant: 2
split: cue
Extra keys on each file entry become metadata columns after loading.
Channel alignment
validation.channel_policy controls multi-file concatenation:
exact: all files must have identical channel names in identical order.intersection: keep only channels present in every file, in the first file's order, and record dropped channels in provenance.first_dataset: keep the first file's channel set and order; later files may have extra channels, but must contain all first-file channels.
exact is the recommended default for scientific reproducibility.
Provenance
Config-driven workflows write sidecar files such as:
stimulus_decoding_summary.csv.provenance.json
The sidecar records the config path, effective config hash, input files, optional input-file SHA-256 hashes, and the fixed random seed used by current decoding workflows. Disable sidecars with:
outputs:
provenance: false
Disable input hashing for very large files with:
outputs:
hash_input_files: false