2017

Enhancing acoustic based human trait recognition using intermediate features

Titel: Enhancing acoustic based human trait recognition using intermediate features

Dozent(in): Nithin Thomas

Termin: 11AM on 28-11-2017

Gebäude/Raum: Eichleitnerstraße 30 / 207

Ansprechpartner: Masters Thesis supervised by Dr. Hesam Sagha and Prof.Dr. habil. Björn Schuller at University of Passau

Feature Set Optimisation for Multi-Lingual Emotion Recognition

Titel: Feature Set Optimisation for Multi-Lingual Emotion Recognition

Dozent(in): Revathi Sadanand

Termin: 11AM on 28-11-2017

Gebäude/Raum: Eichleitnerstraße 30 / 207

Ansprechpartner: Masters Thesis supervised by Dr. Hesam Sagha and Prof.Dr. habil. Björn Schuller at University of Passau

End-to-End Audio Laughter Detection

Titel: End-to-End Audio Laughter Detection

Dozent(in): Muhammad Mashood Tanveer

Termin: 11AM on 28-11-2017

Gebäude/Raum: Eichleitnerstraße 30 / 207

Ansprechpartner: Masters Thesis supervised by Dr. Hesam Sagha and Prof.Dr. habil. Björn Schuller at University of Passau

DE-ENIGMA 'Advancing Humanoid Robotics for Children on the Autism Spectrum'

There are over 5 million people with autism in the European Union. If you include their families, autism touches the lives of over 20 million Europeans. It affects the way a person communicates, understands and relates to others. People with autism often have difficulty using and understanding verbal and non-verbal language. This often makes it difficult to understand others and interact with them. Getting the right support and therapies makes a substantial difference to people with autism. The overall aim of the DE-ENIGMA project is to realize robust, context-sensitive, multimodal and naturalistic human-robot interaction (HRI) aimed at enhancing the social imagination skills of children with autism. This extends and contrasts considerably to the current state of the art in existing technological solutions to machine analysis of the facial, bodily, vocal and verbal behaviour that are used in (commercially and otherwise) available human-centric HRI applications.

Titel:DE-ENIGMA 'Advancing Humanoid Robotics for Children on the Autism Spectrum'

 

Dozent(in): Ms. Alice Baird

Termin: 21. 11. 2017

Gebäude/Raum: Eichleitnerstraße 30 / 207

Ansprechpartner: University of Augsburg

An Image-based Deep Spectrum Feature Representation for the Recognition of Emotional Speech

The outputs of the higher layers of deep pre-trained convolutional neural networks (CNNs) have consistently been shown to provide a rich representation of an image for use in recognition tasks. This study explores the suitability of such an approach for speech-based emotion recognition tasks. First, we detail a new acoustic feature representation, denoted as deep spectrum features, derived from feeding spectrograms through a very deep image classification CNN and forming a feature vector from the activations of the last fully connected layer. We then compare the performance of our novel features with standardised brute-force and bag-of-audio-words (BoAW) acoustic feature representations for 2- and 5-class speech-based emotion recognition in clean, noisy and denoised conditions. The presented results show that image-based approaches are a promising avenue of research for speech-based recognition tasks. Key results indicate that deep-spectrum features are comparable in performance with the other tested acoustic feature representations in matched for noise type train-test conditions; however, the BoAW paradigm is better suited to cross-noise-type train-test conditions.

Titel:An Image-based Deep Spectrum Feature Representation for the Recognition of Emotional Speech

 

Dozent(in): Shahin Amiriparian

Termin: 21-11-2017

Gebäude/Raum: Eichleitnerstraße 30 / 207

Ansprechpartner: U Augsburg/TUM

Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio

This paper describes our contribution to the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017). We propose a system for this task using a recurrent sequence to sequence autoencoder for unsupervised representation learning from raw audio files. First, we extract mel-spectrograms from the raw audio files. Second, we train a recurrent sequence to sequence autoencoder on these spectrograms, that are considered as time-dependent frequency vectors. Then, we extract, from a fully connected layer between the decoder and encoder units, the learnt representations of spectrograms as the feature vectors for the corresponding audio instances. Finally, we train a multilayer perceptron neural network on these feature vectors to predict the class labels. An accuracy of 88.0% is achieved on the official development set of the challenge – a relative improvement of 17.7% over the challenge baseline.

 

Titel: Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio

Dozent(in): Shahin Amiriparian

Termin: 14.11.2017

Gebäude/Raum: Eichleitnerstraße 30 / 207

Ansprechpartner: Universität Augsburg/TUM

Feature Selection in Multimodal Continuous Emotion Prediction

Advances in affective computing have been made by combining information from different modalities, such as audio, video, and physiological signals. However, increasing the number of modalities also grows the dimensionality of the associated feature vectors, leading to higher computational cost and possibly lower prediction performance. In this regard, we present an comparative study of feature reduction methodologies for continuous emotion recognition. We compare dimensionality reduction by principal component analysis, filterbased feature selection using canonical correlation analysis, and correlation-based feature selection, as well as wrapperbased feature selection with sequential forward selection, and competitive swarm optimisation. These approaches are evaluated on the AV+EC-2015 database using support vector regression. Our results demonstrate that the wrapper-based approaches typically outperform the other methodologies, while pruning a large number of irrelevant features.

 

Titel: Feature Selection in Multimodal Continuous Emotion Prediction

Dozent(in): Shahin Amiriparian

Termin: 14-11-2017

Gebäude/Raum: Eichleitnerstraße 30 / 207

Ansprechpartner: U Augsburg/TUM

CAST a database: Rapid targeted large-scale big data acquisition via small-world modelling of social media platforms

The adage that there is no data like more data is not new in affective computing; however, with recent advances in deep learning technologies, such as end-to-end learning, the need for extracting big data is greater than ever. Multimedia resources available on social media represent a wealth of data more than large enough to satisfy this need. However, an often prohibitive amount of effort has been required to source and label such instances. As a solution, we introduce Cost-efficient Audio-visual Acquisition via Social-media Smallworld Targeting (CAS2T) for efficient large-scale big data collection from online social media platforms. Our system is based on a unique combination of small-world modelling, unsupervised audio analysis, and semi-supervised active learning. Such an approach facilitates rapid training on entirely new tasks sourced in their entirety from social multimedia. We demonstrate the high capability of our methodology via collection of original datasets containing a range of naturalistic, in-the-wild examples of human behaviours.

Titel:CAST a database: Rapid targeted large-scale big data acquisition via small-world modelling of social media platforms

 

Dozent(in): Shahin Amiriparian

Termin: 14-11-2017

Gebäude/Raum: Eichleitnerstraße 30 / 207

Ansprechpartner: U Augsburg/TUM

The SEILS dataset: Symbolically Encoded Scores in Modern Ancient Notation for Computational Musicology

The automatic analysis of notated Renaissance music is restricted by a shortfall in codified repertoire. Thousands of scores have been digitised by music libraries across the world, but the absence of symbolically codified information makes these inaccessible for computational evaluation. Optical Music Recognition (OMR) made great progress in addressing this issue, however, early notation is still an on-going challenge for OMR. To this end, we present the Symbolically Encoded “Il Lauro Secco” (SEILS) dataset, a new dataset of codified scores for use within computational musicology. We focus on a collection of Italian madrigals from the 16th century, a polyphonic secular a cappella composition characterised by strong musical-linguistic synergies. Thirty madrigals for five unaccompanied voices are presented in modern and early notation, considering a variety of digital formats: Lilypond, MusicXML, MIDI, and Finale (a total of 150 symbolically codified scores). Given the musical and poetic value of the chosen repertoire, we aim to promote synergies between computational musicology and linguistics.

Titel:The SEILS dataset: Symbolically Encoded Scores in Modern Ancient Notation for Computational Musicology

 

Dozent(in): Emilia Parada-Cabaleiro

Termin: 7.11.2017

Gebäude/Raum: Eichleitnerstraße 30 / 207

Ansprechpartner: Universität Augsburg

 

Web and mobile based intervention to enhance health

Titel: Web and mobile based intervention to enhance health

Dozent(in): Dr. Eva Maria Rathner

Termin: 03-11-2017, 15:00

Gebäude/Raum: Eichleitnerstraße 30 / 207

Ansprechpartner: Department of Clinical Psychology and Psychotherapy, University of Ulm

Wavelets Revisited for the Classification of Acoustic Scenes

We investigate the effectiveness of wavelet features for acoustic scene classification as contribution to the subtask of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017). On the back-end side, gated recurrent neural networks (GRNNs) are compared against traditional support vector machines (SVMs). We observe that, the proposed wavelet features behave comparable to the typically-used temporal and spectral features in the classification of acoustic scenes. Further, a late fusion of trained models with wavelets and typical acoustic features reach the best averaged 4-fold cross validation accuracy of 83.2%, and 82.6% by SVMs, and GRNNs, respectively; both significantly outperform the baseline (74.8%) of the official development set (p<0.001, one-tailed z-test).

Titel:Wavelets Revisited for the Classification of Acoustic Scenes

 

Dozent(in): Kun Qian

Termin: 24-10-2017

Gebäude/Raum: Eichleitnerstraße 30 / 207

Ansprechpartner: U Augsburg/TUM

Deep Sequential Image Features for Acoustic Scene Classification

For the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017), we propose a novel method to classify 15 different acoustic scenes using deep sequential learning, based on features extracted from Short-Time Fourier Transform and scalogram of the audio scenes using Convolutional Neural Networks. It is the first time to investigate the performance of bump and morse scalograms for acoustic scene classification in an according context. First, segmented audio waves are transformed into a spectrogram and two types of scalograms; then, 'deep features' are extracted from these using the pre-trained VGG16 model by probing at the fully connected layer. These representations are then fed into Gated Recurrent Neural Networks for classification separately. Predictions from the three systems are finally combined by a margin sampling value strategy. On the official development set of the challenge, the best accuracy on a four-fold cross-validation setup is 80.9%, which increases by 6.1% when compared with the official baseline (p<.001 by one-tailed z-test).

 

Titel: Deep Sequential Image Features for Acoustic Scene Classification

Dozent(in): Zhao Ren

Termin: 24-10-2017

Gebäude/Raum: Eichleitnerstraße 30 / 207

Ansprechpartner: U Augsburg

Computer Vision for Human Facial Expression analysis

Titel: Computer Vision for Human Facial Expression analysis

Dozent(in): Michel Valstar

Termin: 18-10-2017

Gebäude/Raum: Eichleitnerstraße 30 / 306

Ansprechpartner: University of Nottingham

VoicePlay – An Affective Sports Game Operated by Speech Emotion Recognition based on the Component Process Model

Titel: VoicePlay – An Affective Sports Game Operated by Speech Emotion Recognition based on the Component Process Model

Dozent(in): Gerhard Hagerer

Termin: 17.10.2017

Gebäude/Raum: Eichleitnerstraße 30 / 306

Ansprechpartner: Universität Augsburg

Sentiment Analysis Using Image-based Deep Spectrum Features

We test the suitability of our novel deep spectrum feature representation for performing speech-based sentiment analysis. Deep spectrum features are formed by passing spectrograms through a pre-trained image convolutional neural network (CNN) and have been shown to capture useful emotion information in speech; however, their usefulness for sentiment analysis is yet to be investigated. Using a data set of movie reviews collected from YouTube, we compare deep spectrum features combined with the bag-of-audio-words (BoAW) paradigm with a state-of-the-art Mel Frequency Cepstral Coefficients (MFCC) based BoAW system when performing a binary sentiment classification task. Key results presented indicate the suitability of both features for the proposed task. The deep spectrum features achieve an unweighted average recall of 74.5 %. The results provide further evidence for theeffectiveness of deep spectrum features as a robust feature representation for speech analysis.

 

Titel: Sentiment Analysis Using Image-based Deep Spectrum Features

Dozent(in): Shahin Amiriparian

Termin: 17.10.2017

Gebäude/Raum: Eichleitnerstraße 30 / 207

Ansprechpartner: Universität Augsburg/TUM

From Hard to Soft: Towards more Human-like Emotion Recognition by Modelling the Perception Uncertainty

Over the last decade, automatic emotion recognition has become well established. The gold standard target is thereby usually calculated based on multiple annotations from different raters. All related efforts assume that the emotional state of a human subject can be identified by a `hard' category or a unique value. This assumption tries to ease the human observer's subjectivity when observing patterns such as the emotional state of others. However, as the number of annotators cannot be infinite, uncertainty remains in the emotion target even if calculated from several, yet few human annotators. The common procedure to use this same emotion target in the learning process thus inevitably introduces noise in terms of an uncertain learning target. In this light, we propose a `soft' prediction framework to provide a more human-like and comprehensive prediction of emotion. In our novel framework, we provide an additional target to indicate the uncertainty of human perception based on the inter-rater disagreement level, in contrast to the traditional framework which is merely producing one single prediction (category or value). To exploit the dependency between the emotional state and the newly introduced perception uncertainty, we implement a multi-task learning strategy. To evaluate the feasibility and effectiveness of the proposed soft prediction framework, we perform extensive experiments on a time- and value-continuous spontaneous audiovisual emotion database including late fusion results. We show that the soft prediction framework with multi-task learning of the emotional state and its perception uncertainty significantly outperforms the individual tasks in both the arousal and valence dimensions.

 

Titel: From Hard to Soft: Towards more Human-like Emotion Recognition by Modelling the Perception Uncertainty

Dozent(in): Jing Han

Termin: 17-10-2017

Gebäude/Raum: Eichleitnerstraße 30 / 207

Ansprechpartner: U Augsburg

The Perception of Emotion in the Singing Voice

With the increased usage of internet based services and the mass of digital content now available online, the organisation of such content has become a major topic of interest both commercially and within academic research. The addition of emotional understanding for the content is a relevant parameter not only for music classification within digital libraries but also for improving users experiences, via services including automated music recommendation. Despite the singing voice being well-known for the natural communication of emotion, it is still unclear which specific musical characteristics of this signal are involved such affective expressions. The presented study investigates which musical parameters of singing relate to the emotional content, by evaluating the perception of emotion in electronically manipulated a cappella audio samples. A group of 24 individuals participated in a perception test evaluating the emotional dimensions of arousal and valence of 104 sung instances. Key results presented indicate that the rhythmic-melodic contour is potentially related to the perception of arousal whereas musical syntax and tempo can alter the perception of valence.

Titel:The Perception of Emotion in the Singing Voice

 

Dozent(in): Emilia Parada-Cabaleiro

Termin: 17-10-2017

Gebäude/Raum: Eichleitnerstraße 30 / 207

Ansprechpartner: U Augsburg

Search