Troubleshooting
Common issues and solutions for DREAM
Troubleshooting
Common issues and their solutions when working with DREAM.
Memory Issues
"CUDA out of memory"
Problem: Running out of GPU memory during training.
Solutions:
-
Reduce batch size:
batch_size = 16 # instead of 32 -
Reduce model size:
config = DREAMConfig( hidden_dim=128, # instead of 256 rank=8 # instead of 16 ) -
Use gradient accumulation:
accumulation_steps = 4 optimizer.zero_grad() for i, batch in enumerate(dataloader): loss = compute_loss(batch) / accumulation_steps loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad() -
Use mixed precision:
from torch.cuda.amp import autocast, GradScaler scaler = GradScaler() with autocast(): output, state = model(x) loss = criterion(output, target) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() -
Clear cache:
torch.cuda.empty_cache()
Numerical Stability
"Loss is NaN" or "Loss exploded"
Problem: Training produces NaN or Inf values.
Causes:
- Learning rate too high
- Time step too large
- Unstable initialization
- Gradient explosion
Solutions:
-
Reduce learning rate:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4) # instead of 1e-3 -
Smaller time step:
config = DREAMConfig(time_step=0.05) # instead of 0.1 -
Larger time constant:
config = DREAMConfig(ltc_tau_sys=15.0) # instead of 10.0 -
Smaller weights:
config = DREAMConfig(target_norm=1.5) # instead of 2.0 -
Gradient clipping:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) -
Check input normalization:
# Ensure inputs are normalized x = (x - x.mean()) / x.std()
Learning Issues
"Model doesn't learn"
Problem: Loss stays constant or decreases very slowly.
Check:
# 1. Gradients are flowing
for name, param in model.named_parameters():
if param.grad is None:
print(f"No gradient for {name}")
# 2. Surprise is non-zero
print(f"Surprise: {state.avg_surprise.mean().item()}")
# 3. Input is normalized
print(f"Input mean: {x.mean().item()}, std: {x.std().item()}")
# 4. Check loss computation
print(f"Loss: {loss.item()}")Solutions:
-
Increase plasticity:
config = DREAMConfig(base_plasticity=0.2) # instead of 0.1 -
Reduce threshold:
config = DREAMConfig(base_threshold=0.3) # instead of 0.5 -
Enable LTC:
config = DREAMConfig(ltc_enabled=True) -
Check data loading:
# Verify data is correct print(f"Input shape: {x.shape}") print(f"Target shape: {y.shape}") print(f"Unique labels: {y.unique()}") -
Try different initialization:
def init_weights(m): if hasattr(m, 'weight') and m.weight.dim() > 1: torch.nn.init.xavier_uniform_(m.weight) model.apply(init_weights)
"Model converges too slowly"
Problem: Training takes many epochs to converge.
Solutions:
-
Increase learning rate:
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-3) -
Increase plasticity:
config = DREAMConfig(base_plasticity=0.15) -
Use learning rate warmup:
from torch.optim.lr_scheduler import LinearLR warmup = LinearLR(optimizer, start_factor=0.1, end_factor=1.0, total_iters=1000) -
Reduce sequence length:
# Use shorter sequences for faster iteration seq_len = 50 # instead of 100 -
Use larger batch size:
batch_size = 64 # instead of 32
Performance Issues
"Training is slow"
Problem: Training takes too long per epoch.
Optimizations:
-
Use GPU:
model = model.to('cuda') x = x.to('cuda') -
Reduce rank:
config = DREAMConfig(rank=8) # instead of 16 -
Disable LTC if not needed:
config = DREAMConfig(ltc_enabled=False) -
Use mixed precision:
from torch.cuda.amp import autocast with autocast(): output, state = model(x) -
Increase batch size:
batch_size = 64 # larger batches are more efficient -
Use multiple workers for data loading:
loader = DataLoader(dataset, batch_size=32, num_workers=4) -
Profile to find bottlenecks:
with torch.profiler.profile() as prof: output, state = model(x) print(prof.key_averages().table(sort_by="cuda_time_total"))
State Management Issues
"State shape mismatch"
Problem: State tensors have incorrect shapes.
Solution:
# Ensure batch size matches
state = model.init_state(batch_size=x.shape[0])
# Check state shapes
print(f"Input shape: {x.shape}")
print(f"State h shape: {state.h.shape}")
print(f"State U shape: {state.U.shape}")"Memory leak with state"
Problem: GPU memory grows over time.
Solution:
# Detach state for truncated BPTT
state = state.detach()
# Or reinitialize state periodically
if step % 100 == 0:
state = model.init_state(batch_size)Import Errors
"No module named 'dream'"
Problem: Cannot import DREAM.
Solutions:
-
Verify installation:
pip list | grep dreamnn -
Reinstall:
pip install --upgrade dreamnn -
Check Python environment:
python -c "import sys; print(sys.executable)" pip --version -
Install from source:
git clone https://github.com/karl4th/dream-nn.git cd dream-nn pip install -e .
"ImportError: cannot import name 'DREAM'"
Problem: Import statement is incorrect.
Solution:
# Correct imports
from dream import DREAM, DREAMConfig, DREAMCell
from dream import DREAMStack, DREAMState
# Or
import dream
model = dream.DREAM(input_dim=64, hidden_dim=128)Version Compatibility
"PyTorch version mismatch"
Problem: Incompatible PyTorch version.
Solution:
# Check PyTorch version
python -c "import torch; print(torch.__version__)"
# Upgrade if needed
pip install --upgrade torch
# DREAM requires PyTorch >= 2.0.0Debugging Tips
Enable Debug Mode
import logging
logging.basicConfig(level=logging.DEBUG)
# Add debug prints
model.train()
for batch in dataloader:
print(f"Input shape: {batch.shape}")
output, state = model(batch)
print(f"Output shape: {output.shape}")
print(f"Surprise: {state.avg_surprise.mean().item()}")Check Gradients
for name, param in model.named_parameters():
if param.grad is not None:
print(f"{name}: grad norm = {param.grad.norm().item():.4f}")
else:
print(f"{name}: no gradient")Monitor State Statistics
@torch.no_grad()
def log_state(state, step):
print(f"Step {step}:")
print(f" h norm: {state.h.norm().item():.4f}")
print(f" U norm: {state.U.norm().item():.4f}")
print(f" Surprise: {state.avg_surprise.mean().item():.4f}")
print(f" Adaptive tau: {state.adaptive_tau.mean().item():.4f}")Still Having Issues?
If you can't find a solution here:
- Check the FAQ for common questions
- Search existing issues on GitHub
- Create a new issue with:
- Minimal reproducible example
- Error message (full traceback)
- Environment details (Python, PyTorch, OS versions)
- What you've already tried
Next Steps
- FAQ - Frequently asked questions
- Contributing - Contribute to DREAM
- Training Best Practices - Optimize training