Deep Multi-channel Speech Source Separation with Time-frequency Masking for Spatially Filtered Microphone Input Signal - LY Corporation R&D

Publications

CONFERENCE (INTERNATIONAL) Deep Multi-channel Speech Source Separation with Time-frequency Masking for Spatially Filtered Microphone Input Signal

Masahito Togami

28th European Signal Processing Conference (EUSIPCO 2020)

January 18, 2021

In this paper, we propose a multi-channel speech source separation technique which connects an unsupervised spatial filtering without a deep neural network (DNN) to a DNN-based speech source separation in a cascade manner. In the speech source separation technique, estimation of a covariance matrix is a highly important part. Recent studies showed that it is effective to estimate the covariance matrix by multiplying cross-correlation of microphone input signal with a time-frequency mask (TFM) inferred by the DNN. However, this assumption is not valid actually and overlapping of multiple speech sources lead to degradation of estimation accuracy of the multi-channel covariance matrix. Instead, we propose a multichannel covariance matrix estimation technique which estimates the covariance matrix by a TFM for the separated speech signal by the unsupervised spatial filtering. Pre-filtered signal can reduce overlapping of multiple speech sources and increase estimation accuracy of the covariance matrix. Experimental results show that the proposed estimation technique of the multichannel covariance matrix is effective.

Speech Processing

Paper : Deep Multi-channel Speech Source Separation with Time-frequency Masking for Spatially Filtered Microphone Input Signal open into new tab or window (external link)