Research Prototype · Hybrid AI Music Pipeline

AI-Music-
Generation
Hybrid Framework

A research-oriented implementation combining Transformer-based symbolic planning, diffusion-based audio synthesis, and preference alignment — demonstrating an end-to-end pipeline for generating MIDI and audio outputs with a modular deep learning architecture.

3
AI Layers
~40M
Parameters
512
REMI+ Tokens
4
Sample Outputs
NOTE_ON
BAR_CHANGE
VELOCITY
01 / Architecture

The Hybrid
Three-Layer
Architecture

Implementation of the hybrid three-layer AI music generation framework from the paper "AI in Music Generation", exactly as proposed. Click each layer to expand.

Symbolic Planning
Hierarchical Transformer
src/models/symbolic_planner.py
Small config~15M params · 1–3h GPU
Full config~80M params · 48–72h 8×A100
Encoder4-layer structure encoder
Decoder6-layer detail decoder
Cross-attention between encoder and decoder. Outputs REMI+ tokenised multi-track MIDI with 512-token vocabulary. Conditions on text prompt, style, and emotion embeddings.
Audio Rendering
Mel-Spectrogram Diffusion U-Net
src/models/audio_renderer.py
Parameters~25M
SamplingDDIM · 50-step
TrainingDDPM
VocoderGriffin-Lim
Cross-attention conditioned on symbolic tokens. Mel-spectrogram → WAV via Griffin-Lim algorithm. Requires FluidSynth-rendered audio pairs from Stage 1.
Alignment (DPO)
Direct Preference Optimisation
src/models/alignment.py
KL Penalty (β)0.1
PreferencesHeuristic + human
Policy vs reference model fine-tuning with synthetic heuristic preference pairs. Supports human-labelled data if available. Targets the symbolic planner output.
Signal Flow
Text Prompt Style Emotion
Symbolic Planning (REMI+ 512-token vocabulary)
NOTE_ONBAR_CHANGE
VELOCITYTEMPO
Audio Rendering (Mel-spectrogram diffusion)
Mel-Spectrogram Griffin-Lim WAV
DPO Alignment (β = 0.1 KL penalty)
MIDI Output WAV Output
Architecture Table
Layer
Component
File
Symbolic
Hier. Transformer
symbolic_planner.py
Rendering
Diffusion U-Net
audio_renderer.py
Alignment
DPO
alignment.py
02/ Pipeline

End-to-End
Training Pipeline

Five stages from raw data to generated music. Small configs for single-GPU prototyping; full configs scale to 8×A100.

01
Data Preparation
Downloads POP909, Lakh MIDI, MAESTRO v3. Tokenises with REMI+ vocabulary (512 tokens). Renders MIDI to WAV via FluidSynth. Generates synthetic preference pairs.
# Download + preprocess all datasets
bash 01_prepare_data.sh
02
Symbolic Model Training
Trains the Hierarchical Music Transformer from random initialisation. Small config: ~15M params, 1–3h single GPU. Full config: ~80M params, 48–72h 8×A100.
python 02_train_symbolic.py --config config/symbolic_small.yaml
# Full: torchrun --nproc_per_node=8 02_train_symbolic.py --config config/symbolic_full.yaml
03
Audio Renderer Training
Trains the Mel-Spectrogram Diffusion U-Net conditioned on symbolic tokens. Requires FluidSynth-rendered audio pairs from Stage 1. ~25M parameters.
python 03_train_renderer.py --config config/renderer_small.yaml
04
DPO Preference Alignment
Fine-tunes the symbolic model with Direct Preference Optimisation using heuristic-scored synthetic preference pairs (or human-labelled data). Beta = 0.1 KL penalty.
python 04_preference_alignment.py \
    --symbolic_ckpt checkpoints/symbolic/latest.pt \
    --pref_data   data/subsets/preference_data.json
05
Song Generation
Generates MIDI + WAV from text prompt, style, emotion, and tempo. Nucleus sampling, temperature control, optional neural render bypass.
python 05_generate_song.py \
    --prompt   "epic orchestral battle theme with brass and percussion" \
    --style    CLASSICAL --emotion TENSE --tempo 140
# --temperature 0.8 --top_p 0.85 --skip_neural_render
03/ Live Demo

Live Prototype
Demo

⚠ Prototype scale

Runs the project's hybrid prototype pipeline end-to-end — Symbolic Planning → Audio Rendering → DPO Alignment — generating a brand-new audio clip from your prompt, entirely in the browser. No server required.

Generation Parameters
Command Preview
python 05_generate_song.py \
  --prompt "" \
  --style --emotion --tempo
Pipeline Status
🧠
Symbolic Planning Layer
🌊
Audio Rendering Layer
⚖️
Alignment Layer (DPO)
🎵
Audio Synthesis Engine
Prototype-scale generation · in-browser pipeline
WAV
✓ New audio generated from your prompt via the hybrid prototype pipeline — not a cached file
05/ Datasets

Training
Datasets

Four curated datasets spanning pop, classical, diverse genres, and evaluation metadata.

DatasetSizeUse
POP909909 MIDI filesPop symbolic training
Lakh MIDI (clean)~17k filesDiverse genre symbolic
MAESTRO v3~200h pianoClassical + audio pairs
MusicCapsMetadata onlyText-to-music evaluation
06/ Transparency

Technical Depth
& Limitations

📋 Prototype Disclosure — Verbatim from README

Note: The provided outputs represent sample results generated using a prototype implementation. Due to computational constraints, full-scale training was not performed, and the outputs are included to demonstrate the system pipeline and generation capability.

🖥️
Compute Requirements
Full-scale: 8× NVIDIA A100/A800 (80GB each), ~500GB storage, 48–96h training. Prototype uses single-GPU small configs.
🎵
Output Quality
Small-model outputs demonstrate pipeline function, not commercial quality. Griffin-Lim vocoder used. Use --skip_neural_render for FluidSynth fallback.
🔒
License
This project is now released under a proprietary license. All rights reserved. See LICENSE in the repository.
🌐
Live Demo Note
Live demo runs the small-config prototype pipeline in-browser to generate new audio. Output reflects prototype scale — not the full trained model weights. See Quick Start to run locally.
07/ Quick Start

Get
Started

Two commands from clone to generated music. First run ~5–10 minutes for environment setup.

Step 1 — Setup (~5–10 min)
# Run setup (installs everything)
bash setup.sh
Step 2 — Full Pipeline
# Run full pipeline (data + train + generate)
bash run_full_pipeline.sh
Custom Song Generation
conda activate ai-music-gen

python 05_generate_song.py \
    --prompt   "epic orchestral battle theme with brass and percussion" \
    --style    CLASSICAL \
    --emotion  TENSE \
    --tempo    140 \
    --bars_per_section 16 \
    --output_name my_custom_song
# Styles: POP | CLASSICAL | JAZZ | FOLK | ELECTRONIC | AMBIENT
# Emotions: HAPPY | SAD | TENSE | PEACEFUL
Full-Scale 8×A100
# Multi-GPU symbolic training
torchrun --nproc_per_node=8 02_train_symbolic.py --config config/symbolic_full.yaml

# Monitor via TensorBoard
tensorboard --logdir runs/