Deep learning audio denoiser using 2D U-Net on spectrograms to separate speech from ambient noise. Predicts Ideal Ratio Mask via MSE loss and reconstructs clean audio through inverse STFT.
Real-world speech recordings are frequently corrupted by ambient noise (hospital sounds, traffic, city noise), yet existing denoising solutions require complex setups or produce distorted output. There is no lightweight, deep learning-based tool specifically designed to remove environmental noise from single-speaker audio while preserving natural speech quality.
We built a U-Net-based audio denoiser operating in the frequency domain using STFT (Short-Time Fourier Transform). The model predicts an Ideal Ratio Mask (IRM) from log-scaled magnitude spectrograms, which is then multiplied with the noisy spectrogram and reconstructed via inverse STFT. Dynamic data augmentation mixes LJ Speech dataset (13,100 clips) with hospital and ambient noise samples at runtime, while MSE loss optimizes mask prediction.
U-Net architecture with skip connections preserves high-frequency speech details
Ideal Ratio Mask (IRM) prediction enables clean speech extraction from noisy spectrograms
Dynamic noise mixing at runtime eliminates need for pre-paired clean/noisy datasets
16kHz resampling + mono conversion standardizes all inputs automatically
Emergency checkpointing saves progress on KeyboardInterrupt for long training runs
MSE loss optimization against IRM achieves effective speech-noise separation