numpy_datasets.timeseries

numpy_datasets.timeseries.audiomnist.load([path]) digit recognition
numpy_datasets.timeseries.univariate_timeseries.load([path])
param path:default ($DATASET_PATH), the path to look for the data and
numpy_datasets.timeseries.dcase_2019_task4.load([path]) synthetic data for polyphonic event detection
numpy_datasets.timeseries.groove_MIDI.load([path]) The Groove MIDI Dataset (GMD) is composed of 13.6 hours of aligned MIDI and (synthesized) audio of human-performed, tempo-aligned expressive drumming.
numpy_datasets.timeseries.speech_commands.load([path])
numpy_datasets.timeseries.picidae.load([path])
param path:default ($DATASET_PATH), the path to look for the data and
numpy_datasets.timeseries.esc.load([path]) ESC-10/50: Environmental Sound Classification
numpy_datasets.timeseries.warblr.load([path]) Binary audio classification, presence or absence of a bird.
numpy_datasets.timeseries.gtzan.load([path]) music genre classification
numpy_datasets.timeseries.irmas.load([path]) music instrument classification
numpy_datasets.timeseries.vocalset.load([path]) singer/technique/vowel of singing voices
numpy_datasets.timeseries.freefield1010.load([path]) Audio binary classification, presence or absence of bird songs.
numpy_datasets.timeseries.birdvox_70k.load([path]) a dataset for avian flight call detection in half-second clips
numpy_datasets.timeseries.birdvox_dcase_20k.load([path]) Binary bird detection classification
numpy_datasets.timeseries.seizures_neonatal.load([path]) A dataset of neonatal EEG recordings with seizures annotations
numpy_datasets.timeseries.sonycust.load([path]) multilabel urban sound classification
numpy_datasets.timeseries.gtzan.load([path]) music genre classification
numpy_datasets.timeseries.TUTacousticscenes2017.load([path]) Acoustic Scene classification

Detailed description

numpy_datasets.timeseries.audiomnist.load(path=None)[source]
digit recognition
https://github.com/soerenab/AudioMNIST

A simple audio/speech dataset consisting of recordings of spoken digits in wav files at 48kHz. The recordings are trimmed so that they have near minimal silence at the beginnings and ends.

FSDD is an open dataset, which means it will grow over time as data is contributed. In order to enable reproducibility and accurate citation the dataset is versioned using Zenodo DOI as well as git tags.

Current status

4 speakers 2,000 recordings (50 of each digit per speaker) English pronunciations
numpy_datasets.timeseries.univariate_timeseries.load(path=None)[source]
Parameters:path (str (optional)) – default ($DATASET_PATH), the path to look for the data and where the data will be downloaded if not present
Returns:
  • train_images (array)
  • train_labels (array)
  • valid_images (array)
  • valid_labels (array)
  • test_images (array)
  • test_labels (array)
numpy_datasets.timeseries.speech_commands.load(path=None)[source]
numpy_datasets.timeseries.picidae.load(path=None)[source]
Parameters:path (str (optional)) – default ($DATASET_PATH), the path to look for the data and where the data will be downloaded if not present
Returns:
  • wavs (array) – the waveforms in the time amplitude domain
  • labels (array) – binary values representing the presence or not of an avian
  • flag (array) – the Xeno-Canto ID
numpy_datasets.timeseries.esc.load(path=None)[source]

ESC-10/50: Environmental Sound Classification

https://github.com/karolpiczak/ESC-50#download

The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification.

The dataset consists of 5-second-long recordings organized into 50 semantical classes (with 40 examples per class) loosely arranged into 5 major categories:

Animals Natural soundscapes & water sounds Human, non-speech sounds Interior/domestic sounds Exterior/urban noises

Clips in this dataset have been manually extracted from public field recordings gathered by the Freesound.org project. The dataset has been prearranged into 5 folds for comparable cross-validation, making sure that fragments from the same original source file are contained in a single fold.

ESC 50.

https://github.com/karolpiczak/ESC-50#download

Parameters:path (str (optional)) – default $DATASET_path), the path to look for the data and where the data will be downloaded if not present
Returns:
  • wavs (array) – the wavs as a numpy array (matrix) with first dimension the data and second dimension time
  • fine_labels (array) – the labels of the final classes (50 different ones) as a integer vector
  • coarse_labels (array) – the labels of the classes big cateogry (5 of them)
  • folds (array) – the fold as an integer from 1 to 5 specifying how to split the data one should not split a fold into train and set as it would make the same recording (but different subparts) be present in train and test, biasing optimistically the results.
  • esc10 (array) – the boolean vector specifying if the corresponding datum (wav, label, …) is in the ESC-10 dataset or not. That is, to load the ESC-10 dataset simply load ESC-50 and use this boolean vector to extract only the ESC-10 data.
numpy_datasets.timeseries.warblr.load(path=None)[source]

Binary audio classification, presence or absence of a bird.

Warblr comes from a UK bird-sound crowdsourcing research spinout called Warblr. From this initiative we have 10,000 ten-second smartphone audio recordings from around the UK. The audio totals around 44 hours duration. The audio will be published by Warblr under a Creative Commons licence. The audio covers a wide distribution of UK locations and environments, and includes weather noise, traffic noise, human speech and even human bird imitations. It is directly representative of the data that is collected from a mobile crowdsourcing initiative. Load the data given a path

numpy_datasets.timeseries.gtzan.load(path=None)[source]

music genre classification

This dataset was used for the well known paper in genre classification “Musical genre classification of audio signals” by G. Tzanetakis and P. Cook in IEEE Transactions on Audio and Speech Processing 2002.

Unfortunately the database was collected gradually and very early on in my research so I have no titles (and obviously no copyright permission etc). The files were collected in 2000-2001 from a variety of sources including personal CDs, radio, microphone recordings, in order to represent a variety of recording conditions. Nevetheless I have been providing it to researchers upon request mainly for comparison purposes etc. Please contact George Tzanetakis (gtzan@cs.uvic.ca) if you intend to publish experimental results using this dataset.

There are some practical and conceptual issues with this dataset, described in “The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use” by B. Sturm on arXiv 2013.

numpy_datasets.timeseries.irmas.load(path=None)[source]

music instrument classification

ref https://zenodo.org/record/1290750#.WzCwSRyxXMU

This dataset includes musical audio excerpts with annotations of the predominant instrument(s) present. It was used for the evaluation in the following article:

Bosch, J. J., Janer, J., Fuhrmann, F., & Herrera, P. “A Comparison of Sound Segregation Techniques for Predominant Instrument Recognition in Musical Audio Signals”, in Proc. ISMIR (pp. 559-564), 2012

Please Acknowledge IRMAS in Academic Research

IRMAS is intended to be used for training and testing methods for the automatic recognition of predominant instruments in musical audio. The instruments considered are: cello, clarinet, flute, acoustic guitar, electric guitar, organ, piano, saxophone, trumpet, violin, and human singing voice. This dataset is derived from the one compiled by Ferdinand Fuhrmann in his PhD thesis, with the difference that we provide audio data in stereo format, the annotations in the testing dataset are limited to specific pitched instruments, and there is a different amount and lenght of excerpts.

numpy_datasets.timeseries.vocalset.load(path=None)[source]

singer/technique/vowel of singing voices

source: https://zenodo.org/record/1442513#.W7OaFBNKjx4

We present VocalSet, a singing voice dataset consisting of 10.1 hours of monophonic recorded audio of professional singers demonstrating both standard and extended vocal techniques on all 5 vowels. Existing singing voice datasets aim to capture a focused subset of singing voice characteristics, and generally consist of just a few singers. VocalSet contains recordings from 20 different singers (9 male, 11 female) and a range of voice types. VocalSet aims to improve the state of existing singing voice datasets and singing voice research by capturing not only a range of vowels, but also a diverse set of voices on many different vocal techniques, sung in contexts of scales, arpeggios, long tones, and excerpts. :param path: a string where to load the data and download if not present :type path: str (optional)

Returns:
  • singers (list) – the list of singers as strings, 11 males and 9 females as in male1, male2, …
  • genders (list) – the list of genders of the singers as in male, male, female, …
  • vowels (list) – the vowels being pronunced
  • data (list) – the list of waveforms, not all equal length
numpy_datasets.timeseries.freefield1010.load(path=None)[source]

Audio binary classification, presence or absence of bird songs. freefield1010. is a collection of over 7,000 excerpts from field recordings around the world, gathered by the FreeSound project, and then standardised for research. This collection is very diverse in location and environment, and for the BAD Challenge we have newly annotated it for the presence/absence of birds.

numpy_datasets.timeseries.birdvox_70k.load(path=None)[source]

a dataset for avian flight call detection in half-second clips

Version 1.0, April 2018.

Created By

Vincent Lostanlen (1, 2, 3), Justin Salamon (2, 3), Andrew Farnsworth (1), Steve Kelling (1), and Juan Pablo Bello (2, 3).

(1): Cornell Lab of Ornithology (CLO) (2): Center for Urban Science and Progress, New York University (3): Music and Audio Research Lab, New York University

https://wp.nyu.edu/birdvox

Description

The BirdVox-70k dataset contains 70k half-second clips from 6 audio recordings in the BirdVox-full-night dataset, each about ten hours in duration. These recordings come from ROBIN autonomous recording units, placed near Ithaca, NY, USA during the fall 2015. They were captured on the night of September 23rd, 2015, by six different sensors, originally numbered 1, 2, 3, 5, 7, and 10.

Andrew Farnsworth used the Raven software to pinpoint every avian flight call in time and frequency. He found 35402 flight calls in total. He estimates that about 25 different species of passerines (thrushes, warblers, and sparrows) are present in this recording. Species are not labeled in BirdVox-70k, but it is possible to tell apart thrushes from warblers and sparrows by looking at the center frequencies of their calls. The annotation process took 102 hours.

The dataset can be used, among other things, for the research,development and testing of bioacoustic classification models, including the reproduction of the results reported in [1].

For details on the hardware of ROBIN recording units, we refer the reader to [2].

[1] V. Lostanlen, J. Salamon, A. Farnsworth, S. Kelling, J. Bello. BirdVox-full-night: a dataset and benchmark for avian flight call detection. Proc. IEEE ICASSP, 2018.

[2] J. Salamon, J. P. Bello, A. Farnsworth, M. Robbins, S. Keen, H. Klinck, and S. Kelling. Towards the Automatic Classification of Avian Flight Calls for Bioacoustic Monitoring. PLoS One, 2016.

@inproceedings{lostanlen2018icassp, title = {BirdVox-full-night: a dataset and benchmark for avian flight call detection}, author = {Lostanlen, Vincent and Salamon, Justin and Farnsworth, Andrew and Kelling, Steve and Bello, Juan Pablo}, booktitle = {Proc. IEEE ICASSP}, year = {2018}, published = {IEEE}, venue = {Calgary, Canada}, month = {April}, }

Parameters:path (str (optional)) – default ($DATASET_PATH), the path to look for the data and where the data will be downloaded if not present
Returns:
  • wavs (array(70804, 12000)) – the waveforms in the time amplitude domain
  • labels (array(70804,)) – binary values representing the presence or not of an avian
  • recording (array(70804,)) – the file number from which the sample has been extracted
numpy_datasets.timeseries.birdvox_dcase_20k.load(path=None)[source]

Binary bird detection classification

Dataset is 16.5Go compressed.

BirdVox-DCASE-20k: a dataset for bird audio detection in 10-second clips

Version 2.0, March 2018.

link

Description

The BirdVox-DCASE-20k dataset contains 20,000 ten-second audio recordings. These recordings come from ROBIN autonomous recording units, placed near Ithaca, NY, USA during the fall 2015. They were captured on the night of September 23rd, 2015, by six different sensors, originally numbered 1, 2, 3, 5, 7, and 10.

Out of these 20,000 recording, 10,017 (50.09%) contain at least one bird vocalization (either song, call, or chatter).

The dataset is a derivative work of the BirdVox-full-night dataset [1], containing almost as much data but formatted into ten-second excerpts rather than ten-hour full night recordings.

In addition, the BirdVox-DCASE-20k dataset is provided as a development set in the context of the “Bird Audio Detection” challenge, organized by DCASE (Detection and Classification of Acoustic Scenes and Events) and the IEEE Signal Processing Society.

The dataset can be used, among other things, for the development and evaluation of bioacoustic classification models.

We refer the reader to [1] for details on the distribution of the data and [2] for details on the hardware of ROBIN recording units.

[1] V. Lostanlen, J. Salamon, A. Farnsworth, S. Kelling, J.P. Bello. “BirdVox-full-night: a dataset and benchmark for avian flight call detection”, Proc. IEEE ICASSP, 2018.

[2] J. Salamon, J. P. Bello, A. Farnsworth, M. Robbins, S. Keen, H. Klinck, and S. Kelling. Towards the Automatic Classification of Avian Flight Calls for Bioacoustic Monitoring. PLoS One, 2016.

Data Files

The wav folder contains the recordings as WAV files, sampled at 44,1 kHz, with a single channel (mono). The original sample rate was 24 kHz.

The name of each wav file is a random 128-bit UUID (Universal Unique IDentifier) string, which is randomized with respect to the origin of the recording in BirdVox-full-night, both in terms of time (UTC hour at the start of the excerpt) and space (location of the sensor).

The origin of each 10-second excerpt is known by the challenge organizers, but not disclosed to the participants.

Please Acknowledge BirdVox-DCASE-20k in Academic Research

When BirdVox-70k is used for academic research, we would highly appreciate it if scientific publications of works partly based on this dataset cite the following publication:

V. Lostanlen, J. Salamon, A. Farnsworth, S. Kelling, J. Bello. “BirdVox-full-night: a dataset and benchmark for avian flight call detection”, Proc. IEEE ICASSP, 2018.

The creation of this dataset was supported by NSF grants 1125098 (BIRDCAST) and 1633259 (BIRDVOX), a Google Faculty Award, the Leon Levy Foundation, and two anonymous donors. :param path: default ($DATASET_PATH), the path to look for the data and

where the data will be downloaded if not present
Returns:
  • wavs (array) – the waveforms in the time amplitude domain
  • labels (array) – binary values representing the presence or not of an avian
  • recording (array) – the file number from which the sample has been extracted
numpy_datasets.timeseries.seizures_neonatal.load(path=None)[source]

A dataset of neonatal EEG recordings with seizures annotations

source: https://zenodo.org/record/2547147

Neonatal seizures are a common emergency inthe neonatal intensive care unit (NICU). There are many questions yet to be answered regarding the temporal/spatial characteristics of seizures from different pathologies, response to medication, effects on neurodevelopment and optimal detection. This dataset contains EEG recordings from human neonates and the visual interpretation of the EEG by the human expert. Multi-channel EEG was recorded from 79 term neonates admitted to the neonatal intensive care unit (NICU) at the Helsinki University Hospital. The median recording duration was 74 minutes (IQR: 64 to 96 minutes). EEGs were annotated by three experts for the presence of seizures. An average of 460 seizures were annotated per expert in the dataset, 39 neonates had seizures by consensus and 22 were seizure free by consensus. The dataset can be used as a reference set of neonatal seizures, for the development of automated methods of seizure detection and other EEG analysis, as well as for the analysis of inter-observer agreement. :param path: a string where to load the data and download if not present :type path: str (optional)

Returns:
  • annotations (list) – the list of multichannel binary vectors representing the presence or absence of seizure, 3 channels due to 3 expert annotations
  • waveforms (list) – list of (channels, TIME) multichannel EEGs
numpy_datasets.timeseries.sonycust.load(path=None)[source]

multilabel urban sound classification

Reference at https://zenodo.org/record/3233082

Description

SONYC Urban Sound Tagging (SONYC-UST) is a dataset for the development and evaluation of machine listening systems for realistic urban noise monitoring. The audio was recorded from the SONYC acoustic sensor network. Volunteers on the Zooniverse citizen science platform tagged the presence of 23 classes that were chosen in consultation with the New York City Department of Environmental Protection. These 23 fine-grained classes can be grouped into 8 coarse-grained classes. The recordings are split into three subsets: training, validation, and test. These sets are disjoint with respect to the sensor from which each recording came. For increased reliability, three volunteers annotated each recording, and members of the SONYC team subsequently created a set of ground-truth tags for the validation set using a two-stage annotation procedure in which two annotators independently tagged and then collectively resolved any disagreements. For more details on the motivation and creation of this dataset see the DCASE 2019 Urban Sound Tagging Task website.

Audio data

The provided audio has been acquired using the SONYC acoustic sensor network for urban noise pollution monitoring. Over 50 different sensors have been deployed in New York City, and these sensors have collectively gathered the equivalent of 37 years of audio data, of which we provide a small subset. The data was sampled by selecting the nearest neighbors on VGGish features of recordings known to have classes of interest. All recordings are 10 seconds and were recorded with identical microphones at identical gain settings. To maintain privacy, the recordings in this release have been distributed in time and location, and the time and location of the recordings are not included in the metadata.

Labels

there are fine and coarse labels engine 1: small-sounding-engine 2: medium-sounding-engine 3: large-sounding-engine X: engine-of-uncertain-size machinery-impact 1: rock-drill 2: jackhammer 3: hoe-ram 4: pile-driver X: other-unknown-impact-machinery non-machinery-impact 1: non-machinery-impact powered-saw 1: chainsaw 2: small-medium-rotating-saw 3: large-rotating-saw X: other-unknown-powered-saw alert-signal 1: car-horn 2: car-alarm 3: siren 4: reverse-beeper X: other-unknown-alert-signal music 1: stationary-music 2: mobile-music 3: ice-cream-truck X: music-from-uncertain-source human-voice 1: person-or-small-group-talking 2: person-or-small-group-shouting 3: large-crowd 4: amplified-speech X: other-unknown-human-voice dog 1: dog-barking-whining

numpy_datasets.timeseries.gtzan.load(path=None)[source]

music genre classification

This dataset was used for the well known paper in genre classification “Musical genre classification of audio signals” by G. Tzanetakis and P. Cook in IEEE Transactions on Audio and Speech Processing 2002.

Unfortunately the database was collected gradually and very early on in my research so I have no titles (and obviously no copyright permission etc). The files were collected in 2000-2001 from a variety of sources including personal CDs, radio, microphone recordings, in order to represent a variety of recording conditions. Nevetheless I have been providing it to researchers upon request mainly for comparison purposes etc. Please contact George Tzanetakis (gtzan@cs.uvic.ca) if you intend to publish experimental results using this dataset.

There are some practical and conceptual issues with this dataset, described in “The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use” by B. Sturm on arXiv 2013.