An AI agent that autonomously runs your deep learning experiments 24/7 while you sleep.
English | ไธญๆ | ๆฅๆฌ่ช | ํ๊ตญ์ด
2026-04-09
- Reduced token growth by resetting leader context between cycles.
- Added a lightweight fallback to avoid repeated no-progress loops.
- Hardened tool execution against path traversal and shell injection.
2026-04-08
- Added progress tracking exports for experiment monitoring.
- Supports optional Obsidian sync for a live dashboard plus daily notes.
- If no Obsidian vault is configured, progress falls back to project-local text files under
workspace/progress_tracking/.
If you only want the shortest path to a working experiment loop, do this:
- Create a project folder with one file:
PROJECT_BRIEF.md - Run
/auto-experiment --project /path/to/project --gpu 0 - Check progress with
/experiment-statusor optional Obsidian/local text notes
Prefer AI-guided setup? Open AI_GUIDE.md in Claude / ChatGPT / Codex and let the assistant walk you through it.
| Requirement | Required | Notes |
|---|---|---|
| Python 3.10+ | Yes | Runtime |
| 1+ NVIDIA GPU | Yes | For training |
| API key | Yes | Anthropic or OpenAI |
PROJECT_BRIEF.md |
Yes | Main control file |
Project config.yaml |
Optional | Only if you want to override defaults |
| Obsidian vault | Optional | If absent, notes fall back to local text files |
The smallest project you can launch looks like this:
my-first-experiment/
โโโ PROJECT_BRIEF.md
โโโ workspace/ # auto-created
Minimal PROJECT_BRIEF.md:
# Goal
Train a ResNet-50 on CIFAR-100 to reach 80%+ accuracy.
# Codebase
Create the training code from scratch in PyTorch.
# What to Try
- Start with a basic ResNet-50 baseline.
- If accuracy < 75%, improve optimization and schedule.
- If accuracy is 75-80%, try augmentation.
- If accuracy > 80%, stop and report.
# Constraints
- Use GPU 0 only
- Max 100 epochs per runThat is enough to start. Everything else is optional refinement.
This project is for people who already know what experiment they want to run, but do not want to babysit the loop:
- edit code
- launch training
- monitor runs
- parse logs
- decide the next variation
- keep going while you sleep
It is not trying to replace the researcher. It is trying to take over the repetitive experiment-ops layer.
- It does not just launch one run. It keeps iterating.
- It does not just monitor. It reflects and decides the next step.
- It stays cheap because training-time monitoring makes zero LLM calls.
- It stays controllable because the human can override direction at any cycle.
- It now supports persistent progress notes in Obsidian or local text files.
You control the research direction through three files:
PROJECT_BRIEF.md: stable goal, constraints, allowed search spaceHUMAN_DIRECTIVE.md: temporary redirect for the next cycleworkspace/MEMORY_LOG.md: rolling memory of results and decisions
Common control patterns:
# Keep the search narrow
- Only tune augmentation.
- Do not change the backbone.
- Keep training budget fixed.# Make the agent stop exploring a weak direction
- If gain stays below 0.3 points for 3 runs, stop this branch.
- Return to the last trusted baseline and try a different idea.# Force result verification
- If a result looks unusually strong, rerun with the same seed and one new seed.
- Do not claim improvement until both reproduce.You should never have to guess what the agent is doing.
/experiment-statusshows current goal, best result, cycle count, running status, and recent decisions/progress-reportgenerates a structured summary/obsidian-syncrefreshes persistent notes manuallyworkspace/progress_tracking/stores local text notes when no Obsidian vault is configured
If you want a dashboard outside the terminal:
obsidian:
enabled: true
vault_path: "~/Documents/MyObsidianVault" # Optional
auto_append_daily: trueIf vault_path is empty, the same information is saved locally:
workspace/progress_tracking/Dashboard.txt
workspace/progress_tracking/Daily/YYYY-MM-DD.txt
Our hope is simple: science stays pure, and the human stays in the loop.
We built this framework for one reason โ to take the repetitive, mechanical parts of running deep learning experiments off the researcher's plate (launching jobs, watching GPUs, parsing logs, sweeping hyperparameters) so that more of your time can go into the part that actually matters: thinking.
If you're here because you want to spend less time babysitting training runs and more time reading, reasoning, and chasing your own ideas โ welcome. That's exactly who we built this for.
A gentle thought we'd love every user to share with us:
The agent is happy to run the experiments. But please let the ideas, the interpretation, and the scientific judgment remain yours. We don't see automation and academic integrity as being in tension โ quite the opposite. The hours this tool gives back are meant to be reinvested in deeper thinking, not in skipping it.
So we'd kindly ask that this project not be used to fabricate results, to generate "research" with no human in the loop, or to shortcut the parts of science that depend on a human actually understanding what they're doing. That isn't the future we want to help build โ and we don't think it's the one most of you want either.
Science should stay pure. The agent can run the experiments โ but the ideas, the interpretation, and the responsibility belong to the human.
ๅญฆๆฏๅบๅฝไฟๆ็บฏ็ฒนใ Agent ๅฏไปฅๆฟไฝ ่ทๅฎ้ช๏ผไฝ ideaใๅคๆญไธ่ดฃไปป๏ผ่ฏท็็ปไบบๆฅๆฟๆ ใๆไปฌ็ๅฟๅธๆๆฏไธไฝไฝฟ็จ่ ้ฝ่ฝ human in the loop ๅฐๅปๆ่๏ผๆ่ฟไธชๅทฅๅ ท็ไธๆฅ็ๆถ้ด๏ผๆๅ ฅๅฐ็ๆญฃๅฑไบไฝ ่ชๅทฑ็็ ็ฉถๆนๅ้ใ
็งๅญฆใฏ็ด็ฒใงใใในใใงใใ Agent ใฏๅฎ้จใ่ตฐใใใใใจใใงใใพใใใใขใคใใขใป่งฃ้ใป่ฒฌไปปใฏใใฉใใไบบ้ใฎๆใซๆฎใใฆใใ ใใใ
๊ณผํ์ ์์ํด์ผ ํฉ๋๋ค. Agent๋ ์คํ์ ๋์ ์คํํด ์ค ์ ์์ง๋ง, ์์ด๋์ด์ ํด์, ๊ทธ๋ฆฌ๊ณ ์ฑ ์์ ๋ถ๋ ์ฌ๋์ ๋ชซ์ผ๋ก ๋จ๊ฒจ์ฃผ์ธ์.
We trust the people who pick up this tool to take that seriously โ and we built it because we believe most of you already do. Thank you for being one of them. ๐
You design the experiment. The agent handles the repetitive loop.
Deep Researcher Agent:
- Thinks โ Reads your project brief, analyzes previous results, plans the next experiment
- Executes โ Modifies code/configs, runs a dry-run, launches training on GPU
- Monitors โ Watches training at zero LLM cost (just process checks + log reads)
- Reflects โ Parses results, compares with baselines, decides what to try next
- Repeats โ 24/7, without human intervention
You sleep 8 hours โ Agent runs 3 experiment cycles
You go on vacation โ Agent explores 50+ hyperparameter configs
You write your paper โ Agent already has the results table ready
Not benchmarks. Real results from months of 24/7 autonomous operation across research projects.
| Metric | Result |
|---|---|
| Autonomous experiment cycles completed | 500+ |
| Best single-project improvement | 52% over baseline (across 200+ auto-run experiments) |
| Concurrent projects managed | 4 projects across 4 GPU servers |
| Longest continuous autonomous operation | 30+ days without human intervention |
| Average LLM cost per 24h cycle | ~$0.08 |
The #1 concern with running LLM agents 24/7: cost.
Most agent frameworks call the LLM every few minutes to "check progress". That's $50+/day.
Experiment Agent sleeps during training โ zero API calls. It only wakes the LLM when training finishes.
LLM Active Zero Cost LLM Active
โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโ
โ THINK โ โ TRAIN & MONITOR โ โ REFLECT โ
โ (5-10 min) โ โ (hours/days) โ โ (5-10 min) โ
โ โ โ โ โ โ
โ โข Analyze โ โ โข kill -0 $PID โ โ โข Parse โ
โ โข Plan โ โ โข nvidia-smi โ โ logs โ
โ โข Code โ โ โข tail log โ โ โข Compare โ
โ โ โ โ โ โข Decide โ
โ ~$0.05 โ โ $0.00 โ โ ~$0.03 โ
โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโ
24-hour cycle with 8 hours of training: ~$0.08 in LLM calls.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ
โ โ THINK โโโโโโ EXECUTE โโโโโโ REFLECT โโโโ โ
โ โ โ โ โ โ โ โ โ
โ โ Analyze โ โ Dry-run โ โ Evaluate โ โ โ
โ โ Plan โ โ Launch โ โ Compare โ โ โ
โ โ Decide โ โ Monitor โ โ Update โ โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โป 24/7 Loop โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Only ONE worker runs at a time. Others idle at zero cost.
โโโโโโโโโโโโโโโโโ
โ Leader โ Persistent conversation
โ (Planner) โ within each cycle
โโโโโฌโโโโฌโโโโฌโโโโ
โ โ โ
โโโโโโโโโ โ โโโโโโโโโ
โ โ โ
โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ
โ Idea โ โ Code โ โ Writing โ
โ Agent โ โ Agent โ โ Agent โ
โ (4 tools)โ โ (5 tools)โ โ (3 tools)โ
โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Tier 1: PROJECT_BRIEF.md โ
โ โข Frozen project reference โ
โ โข Max 3,000 chars โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Tier 2: MEMORY_LOG.md โ
โ โข Key Results (auto-compact at 1,200ch) โ
โ โข Recent Decisions (rolling last 15) โ
โ โข Max 2,000 chars โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Total: ~5K chars / ~1,500 tokens โ
โ SAME whether running 1 day or 6 months โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
| # | Strategy | Savings |
|---|---|---|
| 1 | Zero-LLM monitoring during training | 90%+ of runtime is free |
| 2 | Two-Tier memory with auto-compaction | Fixed context window |
| 3 | Leader conversation persists within cycle | Brief sent once per cycle |
| 4 | Anthropic prompt caching | System/tools cached |
| 5 | Per-agent minimal tool sets (3-5 tools) | Less schema overhead |
| 6 | Slim system prompts | Fewer input tokens |
| 7 | State trimmed before sending | No bloat |
| 8 | Single worker at a time | No parallel LLM costs |
Complete beginner? Follow every step below. You'll go from zero to a running experiment agent in ~10 minutes.
Prefer AI-guided setup? Open
AI_GUIDE.mdin Claude Code, ChatGPT, or Codex โ the AI will walk you through everything interactively.
| Requirement | Why | How to Check |
|---|---|---|
| Python 3.10+ | Runtime | python3 --version |
| Claude Code | The AI backbone | claude --version |
| 1+ NVIDIA GPU | For training | nvidia-smi |
| Anthropic API key | LLM calls | echo $ANTHROPIC_API_KEY |
Don't have an API key? Get one at console.anthropic.com and set it:
export ANTHROPIC_API_KEY="sk-ant-xxxxx"
# Add to ~/.bashrc or ~/.zshrc to make it permanent# Clone the repo
git clone https://github.com/Xiangyue-Zhang/auto-deep-researcher-24x7.git
cd auto-deep-researcher-24x7
# Install Python dependencies
pip install -r requirements.txt
# Install 8 slash commands into Claude Code
python install.py
# Verify everything works
python -m core.loop --checkYou should see:
Deep Researcher Agent โ Installer
========================================
โ /auto-experiment
โ /experiment-status
โ /gpu-monitor
โ /daily-papers
โ /paper-analyze
โ /conf-search
โ /progress-report
โ /obsidian-sync
Done! 8 skills installed.
Let's say you want to train a ResNet on CIFAR-100. Create a project folder with a PROJECT_BRIEF.md:
mkdir ~/my-first-experiment
cd ~/my-first-experimentNow write the brief โ this is the most important file. It tells the agent what you want:
cat > PROJECT_BRIEF.md << 'EOF'
# Goal
Train a ResNet-50 on CIFAR-100 to reach 80%+ test accuracy.
# Codebase
The agent should create the training code from scratch using PyTorch.
- Use torchvision for the dataset (auto-download)
- Save checkpoints to ./checkpoints/
- Log metrics to ./logs/
# What to Try
- Start with a basic ResNet-50, lr=0.1, SGD, 100 epochs
- If accuracy < 75%, try cosine annealing + warmup
- If accuracy 75-80%, try adding mixup or cutout augmentation
- If accuracy > 80%, the goal is reached
# Constraints
- Use GPU 0 only
- Max 100 epochs per run
- Batch size 128
# Current Status
No experiments run yet. Starting from scratch.
EOFTips for writing a good brief:
- Be specific about the goal (metric + target value)
- Tell it where the code/data is (or say "create from scratch")
- List constraints (which GPU, max epochs, etc.)
- Give it a decision tree ("if X, try Y") โ this guides the agent like you would guide a junior student
Option A: Through Claude Code (recommended)
Open Claude Code and type:
/auto-experiment --project ~/my-first-experiment --gpu 0
Option B: Through Python directly
python -m core.loop \
--project ~/my-first-experiment \
--gpu 0 \
--max-cycles 5 # Stop after 5 cycles (remove for unlimited)The agent will now do everything automatically. Here's what each cycle looks like:
=== Cycle 1 ===
[THINK] Reading PROJECT_BRIEF.md...
Goal: ResNet-50 on CIFAR-100, target 80%+
No previous experiments. Starting with baseline.
Plan: Basic ResNet-50, lr=0.1, SGD with momentum, 100 epochs.
[EXECUTE] Creating train.py...
Creating config.yaml...
Running dry-run (2 steps)... โ No errors
Launching training: nohup python train.py --config config.yaml
PID: 12345, Log: logs/exp001.log
[MONITOR] Training in progress... (zero LLM cost)
15:00 โ PID alive, GPU 98%, Epoch 12/100, loss=2.34
15:15 โ PID alive, GPU 97%, Epoch 25/100, loss=1.87
15:30 โ PID alive, GPU 98%, Epoch 38/100, loss=1.54
...
17:45 โ PID alive, GPU 97%, Epoch 100/100, loss=0.82
18:00 โ PID terminated. Training complete.
[REFLECT] Parsing logs... test accuracy = 76.3%
Result: 76.3% โ below 80% target
Brief says: "If < 75%, try cosine annealing"
76.3% > 75%, so try augmentation instead.
Decision: Add mixup augmentation, keep lr=0.1 + cosine
Milestone logged: "Exp001: ResNet-50 baseline, 76.3%"
=== Cycle 2 ===
[THINK] Best so far: 76.3% (Exp001)
Plan: Add mixup (alpha=0.2) + cosine annealing schedule
...
While the agent is running, you can check on it:
# In Claude Code:
/experiment-status --project ~/my-first-experiment
# Or check GPU usage:
/gpu-monitorYou'll see something like:
# Experiment Status โ my-first-experiment
## Goal
ResNet-50 on CIFAR-100 โ 80%+ accuracy
## Progress
- Cycles completed: 3
- Current best: 79.1% (Exp003: ResNet-50 + mixup + cosine)
- Status: TRAINING (PID 12389, GPU 0, running 1.5h)
## Key Results
[04-07 15:00] Exp001: ResNet-50 baseline, 76.3%
[04-07 18:30] Exp002: + cosine annealing, 77.8%
[04-07 22:00] Exp003: + mixup ฮฑ=0.2, 79.1% โ best
## Current Training
Epoch 67/100 | loss: 0.71 | acc: 79.4%
Enable progress export in your project config.yaml:
obsidian:
enabled: true
vault_path: "~/Documents/MyObsidianVault" # Optional
project_subdir: "DeepResearcher/{project_name}"
auto_append_daily: trueIf vault_path is set, the agent writes:
DeepResearcher/my-first-experiment/Dashboard.md
DeepResearcher/my-first-experiment/Daily/YYYY-MM-DD.md
If vault_path is empty, it falls back to project-local files:
workspace/progress_tracking/Dashboard.txt
workspace/progress_tracking/Daily/YYYY-MM-DD.txt
Manual refresh:
/obsidian-sync --project ~/my-first-experiment
# or
python -m core.obsidian --project ~/my-first-experimentWant to change direction? Three ways, from anywhere:
# Way 1: Drop a directive file (agent reads it next cycle)
echo "Stop trying ResNet. Switch to ViT-B/16, start with lr=1e-3" \
> ~/my-first-experiment/workspace/HUMAN_DIRECTIVE.md
# Way 2: Command-line flag
python -m core.loop --project ~/my-first-experiment \
--directive "Try label smoothing 0.1"
# Way 3: Edit memory directly (for permanent changes)
vim ~/my-first-experiment/workspace/MEMORY_LOG.mdUse the agent as an operator, not a replacement researcher.
Human decides:
- goal
- constraints
- forbidden directions
- when to pivot
Agent executes:
- code edits
- runs
- monitoring
- summaries
Write stable rules in PROJECT_BRIEF.md, and temporary steering in HUMAN_DIRECTIVE.md.
# HUMAN_DIRECTIVE.md
- Do not change the dataset.
- Try label smoothing 0.1 before changing the backbone.
- Stop this direction if gain stays below 0.3 for 3 runs.
- Compare against the last trusted baseline, not just the latest run.Case 1: Safer ablation
- Only change augmentation.
- Keep model, optimizer, and training budget fixed.
- Report a clean comparison table after each run.Case 2: Deliberate pivot
- Current ResNet line is saturated.
- Switch to ViT-B/16 only if the last 3 runs plateau.
- Before switching, write a short rationale.Case 3: Suspicious result
- Accuracy jumped unexpectedly.
- Re-run with the same seed and one new seed.
- Do not claim improvement until both runs reproduce.Rule of thumb: let the agent handle repetition, but keep direction, interpretation, and responsibility human.
Step 7: Mobile Monitoring with Happy Coder (Optional)
Want to check experiments from your phone? Install Happy Coder (iOS / Android):
# Install CLI (one time)
npm install -g happy-coder
# Start session through Happy instead of claude
happy
# Inside the session, launch your experiment:
/auto-experiment --project ~/my-first-experiment --gpu 0Now on your phone you can:
- Get push notifications when experiments finish or the agent needs input
- Check results while commuting
- Send directives ("try learning rate 1e-5") from anywhere
- Switch between phone and desktop seamlessly
- All communication is end-to-end encrypted
โโโโโโโโโโโโ encrypted โโโโโโโโโโโโ
โ Desktop โ โโโโโโโโโโโโโโโโบ โ Phone โ
โ Claude โ relay โ Happy โ
โ Code โ โ Coder โ
โโโโโโโโโโโโค โโโโโโโโโโโโค
โ Agent โ โ push notify โโ โ "Try โ
โ running โ โ lr=1e-5"โ
โ 24/7 โ โโ status โโโโโบ โ โ Got it โ
โโโโโโโโโโโโ โโโโโโโโโโโโ
The brief is your main lever. Here are examples for different scenarios:
Example: Fine-tuning a pretrained model
# Goal
Fine-tune ViT-B/16 (pretrained on ImageNet-21K) on Oxford Flowers-102.
Target: 95%+ test accuracy.
# Codebase
- Training script: finetune.py (already exists)
- Config: configs/vit_flowers.yaml
- Data: /data/flowers102/ (already downloaded)
- Pretrained weights: /models/vit-b16-21k.pth
# What to Try
1. First: freeze backbone, train classifier head only (10 epochs, lr=1e-2)
2. Then: unfreeze all, fine-tune end-to-end (30 epochs, lr=1e-4)
3. If stuck below 93%: try layer-wise lr decay (0.65)
4. If above 94%: try test-time augmentation
# Constraints
- GPU 0, batch size 64
- Save best checkpoint based on val accuracyExample: Hyperparameter search
# Goal
Find the best hyperparameters for our GAN on CelebA-HQ 256x256.
Target: FID < 15.
# Codebase
- train_gan.py, configs/celeba_gan.yaml
- Data: /data/celeba_hq_256/
- Evaluation: eval_fid.py --real_dir /data/celeba_hq_256/val
# Search Space
- Learning rate: [1e-4, 2e-4, 5e-4]
- Beta1: [0.0, 0.5]
- Discriminator steps per generator step: [1, 2, 5]
- Spectral norm: [yes, no]
# Strategy
Start with lr=2e-4, beta1=0.0, d_steps=1, spectral_norm=yes (baseline).
Change ONE variable at a time. Run each for 50K steps.
Always evaluate FID after training.
# Constraints
- GPU 0-1 (can use both)
- Max 50K steps per run (~4 hours)Example: Debugging a training issue
# Goal
Figure out why our transformer model diverges after epoch 20.
Currently: loss explodes from 0.5 to NaN around epoch 20-25.
# Codebase
- train_transformer.py, model/transformer.py
- Config: configs/base.yaml
- Logs from failed runs: logs/failed_run_001.log, logs/failed_run_002.log
# What to Investigate
1. Check gradient norms โ add gradient clipping (max_norm=1.0)
2. Try lower learning rate (current: 1e-3, try: 1e-4, 5e-5)
3. Check if it's a specific layer โ add per-layer gradient logging
4. Try warmup (1000 steps) if not already present
5. Check data โ are there any NaN/Inf in the dataset?
# Constraints
- GPU 0, run each test for 30 epochs (enough to see if it diverges)
- Log gradient norms every 100 stepsQ: How much does it cost to run?
About $0.08 per 24-hour cycle (if training takes 8 hours). The secret: zero LLM calls during training. You only pay for the THINK and REFLECT phases (~10 min each).
Q: Can it modify my existing code?
Yes. The Code Agent can read, write, and modify any file in your project. It will make changes, dry-run to verify, then launch training. It won't touch protected files (PROJECT_BRIEF.md, MEMORY_LOG.md).
Q: What if the agent goes in a wrong direction?
Drop a directive: echo "Stop. Go back to the ResNet approach" > workspace/HUMAN_DIRECTIVE.md. The agent reads it next cycle with highest priority.
Q: Can I run multiple projects at the same time?
Yes. Launch separate agent instances in different terminals/tmux sessions, each pointing to a different project and GPU.
Q: What happens if training crashes?
The monitor detects the process died, captures the error log, and passes it to REFLECT. The agent will analyze the crash, fix the code, and retry.
Q: Can I use it with PyTorch / TensorFlow / JAX?
Yes. The agent works with any training framework. It just launches shell commands and reads log files โ it doesn't care what framework produces them.
All features are packaged as Claude Code slash commands. One command to install:
python install.pyAfter installation, you get 8 slash commands in Claude Code:
| Command | What It Does |
|---|---|
/auto-experiment |
Launch the 24/7 autonomous THINKโEXECUTEโREFLECT experiment loop |
/experiment-status |
Check running experiments: progress, metrics, cycle count, GPU usage |
/gpu-monitor |
Quick GPU status: free/busy, memory, utilization, running processes |
| Command | What It Does |
|---|---|
/daily-papers |
Daily arXiv recommendations with automatic dedup |
/paper-analyze 2312.12345 |
Deep paper analysis + extract real figures from arXiv source |
/conf-search --venue CVPR2025 --query "motion" |
Search CVPR/NeurIPS/ICML/ICLR/AAAI/ECCV... |
/progress-report |
Generate structured progress report with metrics |
/obsidian-sync |
Refresh Obsidian or local progress notes |
# Step 1: Install skills (one time)
python install.py
# Step 2: In Claude Code, launch an experiment loop
/auto-experiment --project /path/to/my_project --gpu 0
# Step 3: Check how it's going
/experiment-status --project /path/to/my_project
# Step 4: Check GPU resources
/gpu-monitor
# Step 5: Read papers while the agent trains for you
/daily-papers --topics "vision transformer, image classification"python install.py --uninstallWorks with both Anthropic and OpenAI out of the box. Pick your provider:
| Tier | Anthropic (Claude) | OpenAI (Codex/GPT) | Best For |
|---|---|---|---|
| Fast | claude-sonnet-4-6 |
codex-5.3 |
Daily experiments, iteration |
| Strongest | claude-opus-4-6 |
gpt-5.4 |
Complex reasoning, architecture decisions |
Switch provider in config.yaml:
agent:
provider: "openai" # or "anthropic"
model: "codex-5.3" # or "claude-sonnet-4-6"Or set via environment:
# For Anthropic
export ANTHROPIC_API_KEY="sk-ant-xxxxx"
# For OpenAI
export OPENAI_API_KEY="sk-xxxxx"# config.yaml
project:
name: "my-research"
brief: "PROJECT_BRIEF.md"
agent:
provider: "anthropic" # "anthropic" or "openai"
model: "claude-sonnet-4-6" # See model table above
max_cycles: -1 # -1 = run forever
max_steps_per_cycle: 3 # Max worker dispatches per cycle
cooldown_interval: 300 # Smart cooldown polling (seconds)
memory:
brief_max_chars: 3000 # Tier 1 cap
log_max_chars: 2000 # Tier 2 cap
milestone_max_chars: 1200 # Key results cap
max_recent_entries: 15 # Rolling decision count
gpu:
auto_detect: true
reserve_last: true # Reserve last GPU for keep-alive
monitor:
poll_interval: 900 # Check every 15 min during training
zero_llm: true # No LLM during monitoring
experiment:
mandatory_dry_run: true # Always dry-run before real training
max_parallel: 1 # Concurrent experiments| Deep Researcher Agent | Claude Scholar | AI Scientist | OpenHands | SWE-Agent | |
|---|---|---|---|---|---|
| Runs experiments autonomously | โ | โ | โ | โ | โ |
| Zero-cost training monitoring | โ | โ | โ | โ | โ |
| GPU management | โ | โ | โ | โ | โ |
| 24/7 continuous operation | โ | โ | โ | โ | โ |
| Constant-size memory | โ | โ | โ | โ | โ |
| Paper writing | Basic | โ | โ | โ | โ |
| Knowledge management | Basic | โ | โ | โ | โ |
| General coding | โ | โ | โ | โ | โ |
Deep Researcher Agent is the only framework built for running deep learning research, not just writing about it.
auto-deep-researcher-24x7/
โโโ core/ # Autonomous experiment loop engine
โ โโโ loop.py # THINK โ EXECUTE โ REFLECT cycle
โ โโโ memory.py # Two-Tier constant-size memory
โ โโโ monitor.py # Zero-LLM experiment monitoring
โ โโโ agents.py # Leader-Worker agent dispatch
โ โโโ tools.py # Minimal per-agent tool registry
โโโ skills/ # Claude Code slash commands (python install.py)
โ โโโ auto-experiment/ # 24/7 autonomous experiment loop
โ โโโ experiment-status/ # Check experiment progress
โ โโโ gpu-monitor/ # GPU status & availability
โ โโโ daily-papers/ # Daily arXiv recommendations
โ โโโ paper-analyze/ # Deep paper analysis + figure extraction
โ โโโ conf-search/ # Conference paper search
โ โโโ progress-report/ # Progress report generation
โโโ agents/ # Agent prompt definitions
โ โโโ leader.md # Central decision-maker
โ โโโ idea_agent.md # Literature & hypothesis
โ โโโ code_agent.md # Experiment execution
โ โโโ writing_agent.md # Reporting & writing
โโโ gpu/ # GPU utilities
โ โโโ detect.py # Detection & monitoring
โ โโโ keeper.py # Cloud instance keep-alive
โโโ examples/ # Ready-to-run demos
โโโ docs/ # Docs + translations (CN/JP)
โโโ install.py # Claude Code skill installer
โโโ config.yaml # Default configuration
โโโ requirements.txt # Dependencies
Areas where we'd love help:
- More cloud GPU platforms (AWS, GCP, Lambda Labs, RunPod)
- Experiment tracker integration (W&B, MLflow, TensorBoard)
- New research skills (visualization, result comparison)
- Metric extraction for more training frameworks
See CONTRIBUTING.md.
If you find this work useful, please cite our paper:
@article{zhang2026autodeepresearcher,
title={Deep Researcher Agent: Autonomous Deep Learning Experiment Framework},
author={Zhang, Xiangyue},
journal={arXiv preprint arXiv:2604.05854},
year={2026},
url={https://arxiv.org/abs/2604.05854}
}Or cite the software release:
@software{auto_deep_researcher_24x7,
title={Deep Researcher Agent: Autonomous Deep Learning Experiment Framework},
author={Xiangyue Zhang},
year={2026},
url={https://github.com/Xiangyue-Zhang/auto-deep-researcher-24x7}
}Apache 2.0 โ see LICENSE.
"Experiments run through the night. Results arrive at dawn."

