Publications

A Speech Emotion Recognition Approach Using Discrete Wavelet Transform and Deep Learning Techniques in a Brazilian Portuguese Corpus
NEW

The study of emotion encompasses the human mind's cognitive processes and psychological states. With the rapid advancement and declining costs of technology, researchers have become increasingly focused on capturing voice, gestures, facial expressions, and other expressions of emotion. In this study, we combine a Deep Learning model with the Wavelet Transform technique for the task of Speech Emotion Recognition, which aims to detect and identify emotions in informal and spontaneous speech, part of a Brazilian-Portuguese corpus, achieving a macro F1-score of 0.566 and a ROC-AUC score of 0.7217 on the CORAA database, while surpassing the results achieved in another work presented at the International Conference on Computational Processing of Portuguese Language 2022, which uses the same architecture together with transfer learning techniques, by up to 11% macro F1. Our methodology integrates a deep learning model with advanced signal processing techniques. Specifically, we leverage a pre-trained large-scale neural network architecture tailored for audio analysis, incorporating Discrete Wavelet Transform and Mel Spectrogram features to enhance the model’s performance. Additionally, we apply the SpecAugment technique for effective data augmentation. Our approach is positioned as the second-best overall and the top-performing method among those that do not utilize open-set techniques, such as using other datasets or using transfer learning techniques during the model training, while being one of the few works that excelled the proposed baselines when compared with the works presented at the event.

Environmental Monitoring with Low-Processing Embedded AI through Sound Event Classification

In this work, we propose an embedded low-processing Machine Learning solution designed to assist in environmental acoustic monitoring. The pre-processing stage employs the Wavelet Packet Transform, generating low-dimensional features that serve as inputs to a Gradient Boosting model for the near-real-time classification of relevant sound events. Subsequently, we introduce an event filter that checks if there is any relevant event occurring at the moment before sending the features to the model or ignores them until any sound event is detected. This approach enhances the robustness of our solution, making it resilient to noise and wind-contaminated samples while optimizing memory, battery, and computational power usage. Finally, we converted the processing pipeline and trained model to the C programming language, successfully embedding them into the Nordic Thingy:53, a low-power hardware device equipped with a built-in digital Pulse Density Modulation microphone (VM3011 from Vesper). To evaluate the efficacy of our proposed method, we compared it with a convolutional neural network approach using Mel-frequency cepstral coefficients and conducted tests using audio recordings of bird species found in forests located in the central and western regions of Brazil, as well as samples of human activity-related sounds. The favorable classification scores obtained, in conjunction with the embedded solution's substantial battery life capacity, have the potential to greatly reduce the necessity for extensive environmental monitoring field surveys.