Audio Representation Learning with Deep Neural Networks

Mohammad Rasool Izadi

doi:10.7274/rb68x92428n

ScienceGate Book Chapters

JOURNAL ARTICLE

Audio Representation Learning with Deep Neural Networks

Mohammad Rasool Izadi

Year: 2023 Journal: OPAL (Open@LaTrobe) (La Trobe University) Publisher: La Trobe University

DOI: 10.7274/rb68x92428n

Get Full-Text PDF Get Analytical Report

Abstract

In this dissertation, we examined three sequence-to-sequence representation challenges: source detection and separation, sound event detection, and disentanglement. For each challenge, we introduced distinct models and assessed their performance by conducting experiments on several datasets and by comparing these results with those of other established models. Our study spanned areas such as deep learning and representation learning, and it touched on bioacoustics, urban sounds, singing voices, and speech across a range of specific tasks.
First, we developed a source segmentation model to isolate an undetermined number of bat echolocation calls from mixed sounds. This design used two interconnected models working together. The primary model identified potential sources, while the subsequent model isolated individual sources within the time-frequency domain. Next, inspired by attention and graph neural networks, we presented a method to include time-level similarities throughout the time-domain. We blended features across various layers with our adaptive affinity mixup technique. This enhancement boosted the event-F1 scores of our sound event detection model by 8.2\% when applied to urban sounds. Finally, we delved into weakly supervised disentanglement using a multi-rate latent space. We put forward a unique framework to represent and produce variable-length sequences through paired samples. Our method incorporates a straightforward swapping mechanism and variational transformers. We provided a theoretical demonstration that swapping can attain optimal disentanglement under weak supervision. Experimental results on singing voices, speech, and images confirm that our technique consistently outperforms other methods. In conclusion, this dissertation offers innovative approaches to sequence-to-sequence representation challenges, emphasizing the blend of cutting-edge techniques and practical applications. Our findings not only advance the current understanding of sound source detection, event detection, and sequential disentanglement but also set a precedent for future research in these areas. The consistent improvements observed across various tasks underscore the potential of our proposed methods in diverse audio domains, hinting at broader applications and further explorations in representation learning.

Keywords:

Representation (politics) Feature learning Segmentation Human echolocation Event (particle physics) Deep learning Deep neural networks

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.40

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Computational Physics and Python Applications

Physical Sciences → Computer Science → Artificial Intelligence

Gene expression and cancer classification

Life Sciences → Biochemistry, Genetics and Molecular Biology → Molecular Biology

Big Data and Digital Economy

Physical Sciences → Computer Science → Information Systems

Audio Representation Learning with Deep Neural Networks

Abstract

Metrics

Topics

Related Documents

Feature Representation Learning in Deep Neural Networks

Deep representation-based transfer learning for deep neural networks

Neural Audio Coding with Deep Complex Networks

Unsupervised Point Cloud Representation Learning With Deep Neural Networks: A Survey

Learning efficient binary representation for images with unsupervised deep neural networks