Research Prototype · Hybrid AI Music Pipeline

AI-Music-
Generation
Hybrid Framework

A research-oriented implementation combining Transformer-based symbolic planning, diffusion-based audio synthesis, and preference alignment — demonstrating an end-to-end pipeline for generating MIDI and audio outputs with a modular deep learning architecture.

▶ Generate Music Now ⭐ View on GitHub

AI Layers

~40M

Parameters

512

REMI+ Tokens

Sample Outputs

NOTE_ON

BAR_CHANGE

VELOCITY

01 / Architecture

The Hybrid
Three-Layer
Architecture

Implementation of the hybrid three-layer AI music generation framework from the paper "AI in Music Generation", exactly as proposed. Click each layer to expand.

Symbolic Planning

Hierarchical Transformer

src/models/symbolic_planner.py

›

Small config~15M params · 1–3h GPU

Full config~80M params · 48–72h 8×A100

Encoder4-layer structure encoder

Decoder6-layer detail decoder

Cross-attention between encoder and decoder. Outputs REMI+ tokenised multi-track MIDI with 512-token vocabulary. Conditions on text prompt, style, and emotion embeddings.

Audio Rendering

Mel-Spectrogram Diffusion U-Net

src/models/audio_renderer.py

›

Parameters~25M

SamplingDDIM · 50-step

TrainingDDPM

VocoderGriffin-Lim

Cross-attention conditioned on symbolic tokens. Mel-spectrogram → WAV via Griffin-Lim algorithm. Requires FluidSynth-rendered audio pairs from Stage 1.

Alignment (DPO)

Direct Preference Optimisation

src/models/alignment.py

›

KL Penalty (β)0.1

PreferencesHeuristic + human

Policy vs reference model fine-tuning with synthetic heuristic preference pairs. Supports human-labelled data if available. Targets the symbolic planner output.

Signal Flow

Text Prompt Style Emotion

Symbolic Planning (REMI+ 512-token vocabulary)

NOTE_ONBAR_CHANGE
VELOCITYTEMPO

Audio Rendering (Mel-spectrogram diffusion)

Mel-Spectrogram Griffin-Lim WAV

DPO Alignment (β = 0.1 KL penalty)

MIDI Output WAV Output

Architecture Table

Layer

Component

File

Symbolic

Hier. Transformer

symbolic_planner.py

Rendering

Diffusion U-Net

audio_renderer.py

Alignment

DPO

alignment.py

02/ Pipeline

End-to-End
Training Pipeline

Five stages from raw data to generated music. Small configs for single-GPU prototyping; full configs scale to 8×A100.

Data Preparation

Downloads POP909, Lakh MIDI, MAESTRO v3. Tokenises with REMI+ vocabulary (512 tokens). Renders MIDI to WAV via FluidSynth. Generates synthetic preference pairs.

# Download + preprocess all datasets
bash 01_prepare_data.sh

Symbolic Model Training

Trains the Hierarchical Music Transformer from random initialisation. Small config: ~15M params, 1–3h single GPU. Full config: ~80M params, 48–72h 8×A100.

            python 02_train_symbolic.py --config config/symbolic_small.yaml

            # Full: torchrun --nproc_per_node=8 02_train_symbolic.py --config config/symbolic_full.yaml

Audio Renderer Training

Trains the Mel-Spectrogram Diffusion U-Net conditioned on symbolic tokens. Requires FluidSynth-rendered audio pairs from Stage 1. ~25M parameters.

python 03_train_renderer.py --config config/renderer_small.yaml

DPO Preference Alignment

Fine-tunes the symbolic model with Direct Preference Optimisation using heuristic-scored synthetic preference pairs (or human-labelled data). Beta = 0.1 KL penalty.

            python 04_preference_alignment.py \

                --symbolic_ckpt checkpoints/symbolic/latest.pt \

                --pref_data   data/subsets/preference_data.json

Song Generation

Generates MIDI + WAV from text prompt, style, emotion, and tempo. Nucleus sampling, temperature control, optional neural render bypass.

            python 05_generate_song.py \

                --prompt   "epic orchestral battle theme with brass and percussion" \

                --style    CLASSICAL --emotion TENSE --tempo 140

            # --temperature 0.8  --top_p 0.85  --skip_neural_render

03/ Live Demo

Live Prototype
Demo

⚠ Prototype scale

Runs the project's hybrid prototype pipeline end-to-end — Symbolic Planning → Audio Rendering → DPO Alignment — generating a brand-new audio clip from your prompt, entirely in the browser. No server required.

Generation Parameters

Prompt

Style

Emotion

Tempo BPM

Command Preview

            python 05_generate_song.py \

              --prompt "" \

              --style  --emotion  --tempo

Pipeline Status

🧠

Symbolic Planning Layer

🌊

Audio Rendering Layer

⚖️

Alignment Layer (DPO)

🎵

Audio Synthesis Engine

✓ New audio generated from your prompt via the hybrid prototype pipeline — not a cached file

04/ Sample Outputs

Repository
Audio Gallery

Four prototype-scale outputs from ./outputs/ demonstrating cross-genre generation. Click to play the real .wav files.

POP · HAPPY

Song 01 — Pop Happy

"Upbeat pop with bright piano melody, catchy hooks, and light percussion. Energetic and cheerful."

WAV MIDI

CLASSICAL · SAD

Song 02 — Classical Sad

"Melancholic solo piano piece with slow tempo, minor key, and expressive dynamics."

WAV MIDI

AMBIENT · PEACEFUL

Song 03 — Ambient Peaceful

"Soft ambient textures with gentle pad layers, slow evolving harmonics, no percussion."

WAV MIDI

JAZZ · TENSE

Song 04 — Jazz Tense

"Complex jazz with dissonant chords, syncopated rhythm, muted trumpet, and walking bass."

WAV MIDI

05/ Datasets

Training
Datasets

Four curated datasets spanning pop, classical, diverse genres, and evaluation metadata.

Dataset	Size	Use
POP909	909 MIDI files	Pop symbolic training
Lakh MIDI (clean)	~17k files	Diverse genre symbolic
MAESTRO v3	~200h piano	Classical + audio pairs
MusicCaps	Metadata only	Text-to-music evaluation

06/ Transparency

Technical Depth
& Limitations

📋 Prototype Disclosure — Verbatim from README

Note: The provided outputs represent sample results generated using a prototype implementation. Due to computational constraints, full-scale training was not performed, and the outputs are included to demonstrate the system pipeline and generation capability.

🖥️

Compute Requirements

Full-scale: 8× NVIDIA A100/A800 (80GB each), ~500GB storage, 48–96h training. Prototype uses single-GPU small configs.

🎵

Output Quality

Small-model outputs demonstrate pipeline function, not commercial quality. Griffin-Lim vocoder used. Use --skip_neural_render for FluidSynth fallback.

🔒

License

🌐

Live Demo Note

Live demo runs the small-config prototype pipeline in-browser to generate new audio. Output reflects prototype scale — not the full trained model weights. See Quick Start to run locally.

07/ Quick Start

Get
Started

Two commands from clone to generated music. First run ~5–10 minutes for environment setup.

Step 1 — Setup (~5–10 min)

# Run setup (installs everything)
bash setup.sh

Step 2 — Full Pipeline

# Run full pipeline (data + train + generate)
bash run_full_pipeline.sh

Custom Song Generation

          conda activate ai-music-gen

          python 05_generate_song.py \

              --prompt   "epic orchestral battle theme with brass and percussion" \

              --style    CLASSICAL \

              --emotion  TENSE \

              --tempo    140 \

              --bars_per_section 16 \

              --output_name my_custom_song

          # Styles: POP | CLASSICAL | JAZZ | FOLK | ELECTRONIC | AMBIENT

          # Emotions: HAPPY | SAD | TENSE | PEACEFUL

Full-Scale 8×A100

          # Multi-GPU symbolic training

          torchrun --nproc_per_node=8 02_train_symbolic.py --config config/symbolic_full.yaml

          # Monitor via TensorBoard

          tensorboard --logdir runs/

⭐ View Full Repository on GitHub

AI-Music- Generation Hybrid Framework

The HybridThree-LayerArchitecture

End-to-EndTraining Pipeline

Live PrototypeDemo

RepositoryAudio Gallery

TrainingDatasets

Technical Depth& Limitations

GetStarted

AI-Music-
Generation
Hybrid Framework

The Hybrid
Three-Layer
Architecture

End-to-End
Training Pipeline

Live Prototype
Demo

Repository
Audio Gallery

Training
Datasets

Technical Depth
& Limitations

Get
Started