Audio Denoiser: Speech-Noise Separation with U-Net

Deep learning audio denoiser using 2D U-Net on spectrograms to separate speech from ambient noise. Predicts Ideal Ratio Mask via MSE loss and reconstructs clean audio through inverse STFT.

Technical writeup

Read the full technical breakdown

The Problem

Real-world speech recordings are frequently corrupted by ambient noise (hospital sounds, traffic, city noise), yet existing denoising solutions require complex setups or produce distorted output. There is no lightweight, deep learning-based tool specifically designed to remove environmental noise from single-speaker audio while preserving natural speech quality.

Our Approach

We built a U-Net-based audio denoiser operating in the frequency domain using STFT (Short-Time Fourier Transform). The model predicts an Ideal Ratio Mask (IRM) from log-scaled magnitude spectrograms, which is then multiplied with the noisy spectrogram and reconstructed via inverse STFT. Dynamic data augmentation mixes LJ Speech dataset (13,100 clips) with hospital and ambient noise samples at runtime, while MSE loss optimizes mask prediction.

Results

U-Net architecture with skip connections preserves high-frequency speech details

Ideal Ratio Mask (IRM) prediction enables clean speech extraction from noisy spectrograms

Dynamic noise mixing at runtime eliminates need for pre-paired clean/noisy datasets

16kHz resampling + mono conversion standardizes all inputs automatically

Emergency checkpointing saves progress on KeyboardInterrupt for long training runs

MSE loss optimization against IRM achieves effective speech-noise separation

Tech Stack

PythonPyTorchTorchaudioDatasetsKaggleHubSTFT/iSTFT

Need something similar?

Let's talk about how we can solve your specific problem.

Book a call