Datasets created by members of the APTLY lab can be found here:

  1. The Fake-or-Real Dataset
  2. The BaHaMe Dataset (please scroll to access)

The Fake-or-Real Dataset

The Fake-or-Real (FoR) dataset is a collection of more than 195,000 utterances from real humans and computer generated speech. The dataset can be used to train classifiers to detect synthetic speech.

The dataset aggregates data from the latest TTS solutions (such as Deep Voice 3 and Google Wavenet TTS) as well as a variety of real human speech, including the Arctic Dataset (, LJSpeech Dataset (, VoxForge Dataset ( and our own speech recordings.

The dataset is published in four versions: for-original, for-norm, for-2sec and for-rerec.

The first version, named for-original, contains the files as collected from the speech sources, without any modification (balanced version).

The second version, called for-norm, contains the same files, but  balanced in terms of gender and class and normalized in terms of sample rate, volume and number of channels.

The third one, named for-2sec is based on the second one, but with the files truncated at 2 seconds.

The last version, named for-rerec, is a rerecorded version of the for-2second dataset, to simulate a scenario where an attacker sends an utterance through a voice channel (i.e. a phone call or a voice message).

The BaHaMe Dataset

The BaHaMe dataset consists of audio files in wav format which have been generated from MIDI files gathered from the BitMidi website. The total length of songs is approximately 150 minutes and each song consists of a three part arrangement: a melodic part, a harmonic part which plays chords as accompaniment and a bass part. Four different arrangements are created from each MIDI file (named A1 – A4) each generated by substituting different instruments for each instrument. For each arrangement there is audio for the sub-arrangements of only the melody part (M), harmony and melody (MH), and all of the instruments (MHB).