Fork of Liu Feng's CoverHunter project. Goals: Make it run, and run fast, on any platform. Document it better. And build it out as a useful toolset for music research generally.
See https://ar5iv.labs.arxiv.org/html/2306.09025 for the July 2023 research paper that accompanied the original CoverHunter code. From their abstract:
Cover song identification (CSI) focuses on finding the same music with different versions in reference anchors given a query track. In this paper, we propose a novel system named CoverHunter that overcomes the shortcomings of existing detection schemes by exploring richer features with refined attention and alignments. [...] Experiments on several standard CSI datasets show that our method significantly improves over state-of-the-art methods [...].
The CoverHunterMPS project also has longer-term goals to expand the utility of the CoverHunter model to address a wide range of musicological questions and needs, such as:
- Identify known repertoire items in new, unfamiliar audio (basic CSI)
- Discover and describe how to adapt training hyperparameters for specific musical cultures.
- Make it easy for ethnomusicologists to train this model for specific musical cultures.
- Adapt this model to go beyond CSI to learn and classify audio using other musical categories such as rhythms, styles, tunings, forms, etc.
- Modify, confirm, or debunk established but currently merely subjectively defined musical concepts within specific musical cultures.
Collaborators are welcome at any time! That includes:
- Python co-authors
- Neural network designers
- Data scientists
- Data source contributors (music data)
- Musicologists (posing valuable challenges to tackle)
- Anyone interested in learning and practicing in the above fields
Get started by participating in the Issues or Discussions tabs here in this Github site. Or contact Alan Ng directly (such as by using the Feedback Form link at the bottom of https://www.irishtune.info/public/MLdata.htm).
- GPU-equipped computer. CPU-only hardware should work but will be very slow. Tested platforms:
- Apple computer with an Apple M-series chip
- Other computer with an Nvidia GPU (including free cloud options like Google Colab)
- python3 (minimum version 3.10, tested on 3.11)
Either:
-
Clone this repo or download it to a folder on your computer. Run the following OS command-line commands from that folder. These commands assume you have a Unix/Linux/MacOS environment, but Windows equivalents exist - see issue #10.
-
Or run this project in Google Colab, using this Colab notebook: https://colab.research.google.com/drive/1HKVT3_0ioRPy7lrKzikGXXGysxZrHpOr?usp=sharing
-
The requirements.txt file contains the python dependencies of the project. Run
python -m pip install requirements.txt
to install, ormake virtualenv
to install the requirements in a virtualenv (the python3-venv package must be installed). -
Install the
sox
package and its libraries. In some distributions, those libraries come in a separate package, likelibsox-fmt-all
.
Follow the example of the prepared Covers80 dataset included with the original CoverHunter. Directions here are for using that prepared data. See also the "dataset.txt format" heading below.
- Download and extract the contents of the
covers80.tgz
file from http://labrosa.ee.columbia.edu/projects/coversongs/covers80/ - Abandon the 2-level folder structure that came inside the covers80.tgz file, flattening so all the .mp3 files are in the same folder. One way to do this is:
- In Terminal, go to the extracted
coversongs
folder as the current directory. Then: cd covers32k && mv */* .; rm -r */
- In Terminal, go to the extracted
- Convert all the provided .mp3 files to .wav format. One way to do this is:
setopt EXTENDED_GLOB; for f in *.mp3; do sox "$f" -r 16000 "${f%%.mp3}.wav" && rm "$f"; done
- Move all these new .wav files to a new folder called
wav_16k
in the project'sdata/covers80
folder. - You can delete the rest of the downloaded
covers80.tgz
contents.
Background explanation: Covers80 is a small, widely used dataset of modern, Western pop music intended only for benchmarking purposes, so that the accuracy of different approaches to solving the problem of CSI can be compared against each other. It is far too small to be useful for neural-network training, but it is a rare example of a published, stable collection of audio files. This makes it easy for you to get started, so you can confirm you have a working setup of this project without having to have your own set of audio files and their metadata ready. You might even end up using Covers80 yourself as a benchmarking test to see how well your own training project handles modern, Western pop music in comparison to published Covers80 benchmarks from other CSI projects.
You must run this before proceeding to the Train step. And you can't run this without first doing the Data Preparation step above. See "Input and Output Files" below for more information about what happens here. In summary, this step generates some data augmentation - artificial variants of the real music you provide that help the neural network generalize across the various ways that humans might perform any musical work -, converts all of that audio (original and artificial) to CQT arrays (basically a type of spectrogram), and does some plain old data wrangling to prepare the metadata that the training script will need.
To use the Covers80 example you prepared above, next run this from the project root folder:
python3 -m tools.extract_csi_features data/covers80/
Training is the core of the work that your computer will do to learn how to differentiate the various works and performances of music you give it. When successfully done, it will have constructed a general model which only needs to be trained once for your targeted musical culture. It will then be able to distinguish this culture's musical works from each other even when it never encountered those particular works during training. See the "Reference embeddings" section below regarding what life will look like once you reach that end goal of training.
Before you get to that ideal future state, the core of the work that you will do will be preparing your training data and discovering the optimal training hyperparameters. See the "Training Hyperparameters" section below.
The training script's output consists of checkpoint files and embedding vectors, described below in the "Training Checkpoint Output" section.
Note: Don't use the torchrun
launch command offered in original CoverHunter. At least in a single-computer Apple Silicon context, it is not only irrelevant, it actually slows down performance. In my tests on an MPS computer, torchrun
slowed down tools.train performance by about 20%.
The original CoverHunter project included a prepared configuration to run a training session on the Covers80 dataset, and this is now located in the 'training/covers80' subfolder of this project. See the "Background explanation" above in the Data Preparation section about what to expect from using Covers80 for training. In particular, their test configuration used the same dataset for both training and validation, so results looked fabulously accurate and were essentially meaningless except that you could confirm that your setup is working. This fork added a train/validate/test data-splitting function in the extract_csi_features tool, along with corresponding new data-preparation hyperparameters, so you can choose to try more realistic training - in which the model validates its learning against data it has not seen before - even if you only have Covers80 data to play with.
Optionally edit the training hyperparameters in the hparams.yaml
configuration file in the folder training/covers80/config
before starting a training run. For example, if you run into memory limits, start with decreasing the batch size from 64 to 32.
The one required command-line parameter for the training script is to specify the path where the training hyperparameters are available and where the model output will go, like this:
python -m tools.train training/covers80/
This fork also added an optional --runid
parameter so you can distinguish your training runs in TensorBoard in case you are experimenting:
python -m tools.train training/covers80/ --runid 'first try'
To see the TensorBoard live visualization of the model's progress during training, run this in a separate terminal window, from the root of the project folder, and then use the URL listed in the output to watch the TensorBoard:
tensorboard --logdir=training/covers80/logs
After you use the tools.train script to confirm your data is usable with CoverHunterMPS, and perhaps to do some basic experimentation, you may be motivated to experiment with a wide range of training hyperparameters to discover the optimal settings for your data that will lead you to better training metrics. You should be able to use your knowledge of its unique musical characteristics to make some educated guesses on how to diverge from the default CoverHunter hyperparameters, which were optimized for Western pop music.
Step 1: Study the explanations in the Training Hyperparameters section below to make some hypotheses about alternative hyperparameter values to try with your data.
Step 2: Add your hypotheses as specific hyperparameter values to try in the hp_tuning.yaml file in the model's training folder, following the comments and examples there.
Step 3: Launch training with model_dir
as the one required parameter:
python -m tools.train_tune training/covers80
This script will not retain any model checkpoints from the training runs, but it does create separate log files for each run that you can monitor and study in TensorBoard.
If you are running on a CUDA platform, the make_deterministic()
function in tools.train_tune may have significant performance disadvantages for you. Consider whether you'd rather comment out that line and instead run enough different random seeds to compensate for non-deterministic training behavior so that you can reliably compare results between different hyperparameter settings.
This script evaluates your trained model by providing standard mAP (mean average precision) and MR1 (mean rank one) training metrics, plus an optional t-SNE clustering plot (compare Fig. 3 in the CoverHunter paper).
- Have a pre-trained CoverHunter model's output checkpoint files available. You only need your best set (typically your highest-numbered one). If you use original CoverHunter's pre-trained model from https://drive.google.com/file/d/1rDZ9CDInpxQUvXRLv87mr-hfDfnV7Y-j/view), unzip it, and move it to a folder that you specify in step 3 below.
- Run your query data through
extract_csi_features.py
. In thehparams.yaml
file for the feature extraction, turn off all augmentation. Seedata/covers80_testset/hparams.yaml
for an example configuration to treat covers80 as the query data:
python3 -m tools.extract_csi_features data/covers80_testset
The important output from that isfull.txt
and thecqt_feat
subfolder's contents. - Run the evaluation script. This example assumes you are using the trained model you created in
training/covers80
and you want to use all the optional features I added in this fork:
python3 -m tools.eval_testset training/covers80 data/covers80_testset/full.txt data/covers80_testset/full.txt -plot_name="training/covers80/tSNE.png" -dist_name='distmatrix' -test_only_labels='data/covers80/test-only-work-ids.txt'
See the "Training checkpoint output" section below for a description of the embeddings saved by the eval_for_map_with_feat()
function called in this script. They are saved in a new subfolder of the pretrained_model
folder named embed_NN_tmp
where NN is the highest-numbered epoch subfolder in the pretrained_model
folder.
CoverHunter only shared an evaluation example for the case when query and reference data are identical, presumably to do a self-similarity evaluation of the model. But there is an optional 4th parameter for query_in_ref_path
that would be relevant if query and reference are not identical. See the "query_in_ref" heading below under "Input and Output Files."
The optional plot_name
argument is a path or just a filename where you want to save the t-SNE plot output. If you provide just a filename, model_dir
will be used as the path. See example plot below. Note that your query and reference files must be identical to generate a t-SNE plot (to do a self-similarity evaluation).
The optional test_only_labels
argument is a path to the text file generated by extract_csi_features.py
if its hyperparameters asked for some work_ids to be reserved exclusively for the test dataset. The t-SNE plot will then mark those for you to see how well your model can cluster classes (work_ids) it has never seen before.
This figure shows the results of training from scratch on the covers80 dataset with a train/val/test split of 8:1:1 and 3 classes (work_ids) reserved exclusively for the test dataset.
The optional dist_name
argument is a path where you want to save the distance matrix and ref labels so that you can study the results separately, such as perhaps doing custom t-SNE plots, etc.
The default value for the optional marks
argument is 'markers', which makes the output for plot_name
differentiate works by using using standard matplotlib markers in various colors and shapes. The alternative value is 'ids' which uses the work_id
numbers defined by extract_csi_features instead of matplotlib markers.
Once you have tuned your data and your hyperparameters for optimal training results, you may be ready to train a model that knows all of your data, without reserving any data for validation and test sets. The tools/train_prod.py script uses stratified K-fold cross validation to dynamically generate validation sets from your dataset so that the model is exposed to all works and perfs equally. It concludes with one final training run on the entire dataset in which the dataset you specify in test_path
serves as the validation dataset (for early stopping purposes). This final validation set should be entirely unseen perfs, even if some or all of the works are represented in the training data.
Use the full.txt
output from extract_csi_features.py
for your train_path
with val_data_split
, val_unseen
, test_data_split
, and test_data_unseen
all set to 0. Prepare the training/covers80/hparams_prod.yaml
file following the instructions in the comment header of train_prod.py
. An example hparams_prod.yaml
is provided for using covers80 for testing purposes.
You may need to experiment with learning rates and other hyperparameters for the somewhat different training situation of training on your full dataset if your hyperparameter tuning work used significantly smaller datasets. Also consider experimenting with the hard-coded learning-rate strategy for later folds after the first fold that is configured within train_prod.py
in the cross_validate()
function. Look for the comment line "# different learning-rate strategy for all folds after the first."
Launch training with:
python -m tools.train_prod training/covers80/ --runid='test of production training'
TensorBoard will show each fold as a separate run, but within a continuous progression of epochs. You can safely interrupt production training for any reason and re-launching it with the same command will resume from the last fold and checkpoint that was automatically saved by this script.
After you have trained a model and are satisfied with its quality based on the metrics you saw during training and from the evaluation script, it's time to use your model to generate reference embeddings. An embedding is a numerical representation generated by your trained model of any audio sample, essentially identifying the audio in a high-dimensional conceptual space that differentiates works from each other based on the knowledge the neural network learned from your training data. Your trained model can even generate embeddings for recordings that were not used in training, assuming the new recordings fit well inside the same musical culture and vocabulary as the one you used in training.
Reference embeddings, then, are the complete set of embeddings for all of the recorded performances you would like to be already known to your final inference solution. These points in space, like stars in a galaxy, can then be compared with a new embedding from new audio, and by measuring the distance between the new embedding to all the reference embeddings, you can locate the new audio in that galaxy, by learning who the nearest neighbors are.
Example for covers80:
python -m tools.make_embeds data/covers80 training/covers80
See comments at the top of the make_embeds script for more details.
Now that you have reference embeddings and the trained model to generate new embeddings for any new audio, you can use the identify
script to identify any music you give it. See the high-level explanation of how this works in the "Generate reference embeddings" section above. See comments at the top of tools.identify for documentation of the parameters.
Example for covers80:
python -m tools.identify data/covers80 training/covers80 query.wav -top=10
To interpret the output, use the data/covers80/work_id.map text file to see which work_id
goes with which work
. Good news: even the bare-bones demo of training from scratch on covers80 shows that CoverHunter does a good job of identifying versions (covers) of those 80 pop songs.
Future goal and call for help: How do we take this command-line solution for inference and productionize it for broader use outside the context of the specific machine where this CoverHunterMPS project was installed?
CoverHunter did not include an implementation of the coarse-to-fine alignment training described in the research paper. (Liu Feng confirmed to me that his employer considers it proprietary technology). See issue #1. But it did include this script which apparently could be useful as part of an implementation we could build ourselves. The command to launch the alignment script that CoverHunter included is:
python3 -m tools.alignment_for_frame pretrained_model data/covers80/full.txt data/covers80/alignment.txt
Arguments to pass to the script:
- Folder containing a pretrained model. For example if you use original CoverHunter's model from https://drive.google.com/file/d/1rDZ9CDInpxQUvXRLv87mr-hfDfnV7Y-j/view), unzip it, and move it to a folder that you rename to
pretrained_model
at the top level of your project folder. That folder in turn must contain acheckpoints
subfolder that contains the do_000[epoch] and g_000[epoch] checkpoint files. - The output from tools/extract_csi_features.py or an equivalent script. The metadata file like full.txt must include
work_id
values for eachperf
(unlike the rawdataset.txt
file that CoverHunter provided for covers80). - The
alignment.txt
file will receive the output of this script.
There are two different hparams.yaml files, each used at different stages.
The hparams.yaml file located in the folder you provide on the command line to tools.extract_csi_features.py is used only by that script.
key | value |
---|---|
add_noise | Original CoverHunter provided the example of: { prob : 0.75,sr : 16000,chunk : 3,name : "cqt_with_asr_noise",noise_path : "dataset/asr_as_noise/dataset.txt"} However, the CoverHunter repo did not include whatever might supposed to be in "dataset/asr_as_noise/dataset.txt" file nor does the CoverHunter research paper describe it. If that path does not exist in your project folder structure, then tools.extract_csi_features will just skip the stage of adding noise augmentation. At least for training successfully on Covers80, noise augmentation doesn't seem to be needed. |
aug_speed_mode | list of ratios used in tools.extract_csi_features for speed augmention of your raw training data. Example: [0.8, 0.9, 1.0, 1.1, 1.2] means use 80%, 90%, 100%, 110%, and 120% speed variants of your original audio data. |
bins_per_octave | See fmin and n_bins . If your musical culture uses a scale that does not fit in the Western standard 12-semitone scale, set this to a higher number. Default 12. |
device | 'mps' or 'cuda', corresponding to your GPU hardware and PyTorch library support. 'cpu' is not currently implemented but could be if needed. Original CoverHunter used CPU for this stage but was much slower. |
fmin | The lowest frequency you want the CQT arrays to include. Set this to just below the lowest pitch used in the musical culture you are teaching the model. Consider only the pitches relevant to the work-identification skill you want it to learn. For example, in some cultures, bass accompaniment is not relevant for work identification. Default is 32. |
n_bins | The number of frequency bins you want the CQT arrays to include. For example, if you set bins_per_octave to 12, then set n_bins to 12 times the number of octaves above fmin that are relevant to this culture's work-identification skill. Be sure to also set the input_dim training hyperparameter to match this number. Default is 96. |
val_data_split | percent of training data to reserve for validation expressed as a fraction of 1. Example for 10%: 0.1 |
val_unseen | percent of work_ids from training data to reserve exclusively for validation expressed as a fraction of 1. Example for 2%: 0.02 |
test_data_split | percent of training data to reserve for test expressed as a fraction of 1. Example for 10%: 0.1 |
test_data_unseen | percent of work_ids from training data to reserve exclusively for test expressed as a fraction of 1. Example for 2%: 0.02 |
The hparams.yaml file located in the "config" subfolder of the path you provide on the command line to tools.train.py uses all the other parameters listed below during training.
key | value |
---|---|
covers80: query_path ref_path every_n_epoch_to_test |
Test dataset(s) used for automated model evaluation purposes during training. "covers80" was the only example provided with the original CoverHunter. For an example of a different culture's test set, see https://www.irishtune.info/public/MLdata.htm. Note that ref_path and query_path are set to the same data in order to do a self-similarity evaluation, testing how well the model can cluster samples (perfs) relative to their known classes (works). You can add as many test datasets as you want. Each will be displayed as separate results in the TensorBoard visualization during training. New testsets must be added to the src/trainer.py script in the list where ALL_TEST_SETS is defined.Subparameters for covers80: query_path : "data/covers80/full.txt"ref_path : "data/covers80/full.txt"every_n_epoch_to_test : How many epochs to wait between each test of the current model against this testset. |
test_path | Compare train_path and val_path . This dataset is used in each epoch to run the same validation calculation as with the val_path . Presumably one should include both classes and samples that were excluded from both train_path and val_path . |
train_path | path to a JSON file containing metadata about the data to be used for model training (See full.txt below for details) |
val_path | Path to a JSON file containing metadata about the data to be used for model validation. Compare test_path above. Presumably one should include a balanced distribution of samples that are not included in the train_path dataset, but do include samples for the classes represented in the train_path dataset. (See full.txt below for details) |
key | value |
---|---|
chunk_frame | List of 3 numbers used with mean_size that describe the duration of each chunk, measured as a count of CQT features. CoverHunter's covers80 config used [1125, 900, 675]. Here the word "chunk" apparently refers to the chunks described in the time-domain pooling strategy part of the CoverHunter paper, not the chunks discussed in their coarse-to-fine alignment strategy. See also chunk_s . In our experiments, the 5:4:3 ratio that CoverHunter used is significantly better than a variety of alternative ratios we tried. However, in Irish traditional music, which has shorter time structures than Western pop music, we achieved better results using shorter durations than [1125, 900, 675]. |
chunk_s | Duration of the first-listed (longest) chunk_frame in seconds. You have to manually calculate chunk_s = chunk_frame[0] / audio sample rate * mean_size . Couldn't the script just calculate this itself using CQT hop-size to get the sample rate? |
cqt: hop_size: | Fine-grained time resolution, measured as duration in seconds of each CQT spectrogram slice of the audio data (the inverse of the audio sample rate). CoverHunter's provided setting is 0.04 with a comment "1s has 25 frames", but this meaning of "frame" is not the same meaning of "frame" as used more appropriately in chunk_frame . Presumably the intended meaning here would conventionally be described as: "audio sample rate of 25 samples per second." The value 25 is hard-coded as an assumption into CoverHunter in various places. |
data_type | "cqt" (default) or "raw" or "mel". It remains unknown whether the CoverHunter team actually implemented or tested anything but CQT-based training. |
mean_size | See chunk_s above. An integer used in combination with chunk_frame to define the length of the chunks. |
mode | "random" (default) or "defined". Changes behavior of AudioFeatDataset related to how it cuts each audio sample into chunks. "random" is described in CoverHunter code as "cut chunk from feat from random start". "defined" is described as "cut feat with 'start/chunk_len' info from line." We observed better training results using "defined" when working with datasets that are very consistently trimmed so that CSI-relevant audio always starts right at the beginning of the recording. "random" would be better when CSI-irrelevant audio may be present at the start of many of your audio data samples. |
m_per_class | From CoverHunter code comments: "m_per_class must divide batch_size without any remainder" and: "At every iteration, this will return m samples per class. For example, if dataloader's batch-size is 100, and m = 5, then 20 classes with 5 samples iter will be returned." |
spec_augmentation | Spectral augmentation settings, used to generate temporary data augmentation on the fly during training. CoverHunter settings were:random_erase :prob : 0.5erase_num : 4roll_pitch :prob : 0.5shift_num : 12 |
spec_augmentation : random_erase | During each epoch, each CQT array may have a rectangular block of its array values replaced with the value -80 (a low amplitude signal). The size of the block is defined as 25% of the height of the frequency bins and 10% of the width of the time bins. prob specifies the probability of calling the erase method for this feature in this epoch, between 0 and 1. erase_num specifies the quantity of such blocks that will be erased if the erase method is called. region_size specifies the size of each erased block, as (width, height) as fractions of the CQT array size. Default is "[.25, .1]" |
spec_augmentation : roll_pitch | During each epoch, each CQT array may be shifted pitch-wise. CoverHunter's original method, left as the default here, was to rotate the entire array in the frequency dimension, with the overflowing content wrapped around to the opposite end of the spectrum. For example, if shifted an octave up, then the top octave's CQT content would be presented as the bottom octave of content. prob specifies the probability of doing this for this feature in this epoch, between 0 and 1. shift_num specifies the number of frequency CQT bins by which the array will be shifted. method accepts 3 values: 1) "default" is the original CoverHunter method 2) "low_melody" is an alternative approach added for CoverHunterMPS to accommodate musical cultures in which CSI-significant melodic content may appear in the bottom frequency range of the CQT array. Since trimming CQT arrays to eliminate irrelevant harmonic and percussive content in the bottom octaves has proven beneficial, this feature can be significantly useful. In this case, instead of rotating the entire array either up or down, the array is shifted upwards either 1 x or 2 x shift_num bins, and overflowing high-frequency content is simply discarded, instead of being copied to the bottom rows of the array. 3) "flex_melody" generalizes the "low_melody" approach by using loudness to estimate where the tonal center is, in order to avoid shifting the melody off the "edge" of the spectrogram either too low or too high. |
key | value |
---|---|
adam_b1 and adam_b2 | See https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html for documentation of the two "beta" parameters used by the AdamW optimizer that the CoverHunter authors chose. Our experiments showed these can have a strong impact. Note that the CoverHunter default values of .8 and .99 are not the usual default AdamW values, for unknown reasons. We recommend experimenting with these values. |
batch_size | Usual "batch size" meaning in the field of machine learning. An important parameter to experiment with. Original CoverHunter's preset batch size of 16 was no longer able to succeed at the covers80 training task after @alanngnet fixed an important logic error in extract_csi_features.py. Now only batch size 32 or larger works for covers80. Be sure to consider adjusting learning_rate and lr_decay whenever you change batch_size based on general deep-learning best practices and your own experimentation. |
device | 'mps' or 'cuda', corresponding to your GPU hardware and PyTorch library support. Theoretically 'cpu' could work but untested and probably of no value. |
early_stopping_patience | how many epochs to wait for validation loss to improve before early stopping |
learning_rate | The initial value for how much variability to allow the model during each learning step. See lr_decay . Default = .001. |
lr_decay | Learning-rate decay - see learning_rate . Default = .9975, but for small data sets, such as during testing and tuning work, we found lower values like .99 more appropriate. |
min_lr | Minimum learning rate, below which lr_decay is ignored. Default = 0.0001. |
key | value |
---|---|
embed_dim | 128 |
encoder | # model-encode Subparameters: attention_dim : 256 # "the hidden units number of position-wise feed-forward"output_dims : 128num_blocks : 6 # number of decoder blocks |
input_dim | The "vertical" (frequency) dimension size of the CQT arrays you provide to the model. Set this to the same value you used for n_bins in the data preparation hyperparameters. Default is 96. |
A JSON formatted or tab-delimited key:value text file (see format defined in the utils.py::line_to_dict() function) expected by extract_csi_features.py that describes the training audio data, with one line per audio file.
key | value |
---|---|
perf | Unique identifier. Abbreviation for "performance." CoverHunter originally used "utt" throughout, borrowing the term "utterance" from speech-recognition ML work which is where much of their code was adapted from. Example "cover80_00000000_0_0". |
wav | relative path to the raw audio file. Example: "data/covers80/wav_16k/annie_lennox+Medusa+03-A_Whiter_Shade_Of_Pale.wav" |
dur_s | duration of the audio file in seconds. Example 316.728 |
work | title of the work. Example "A_Whiter_Shade_Of_Pale" The _add_work_id() function in extract_csi_features assumes that this string is a unique identifier for the work (so it can't handle musically distinct works that happen to have the same title). Advice: Use a unique, stable identifier that applies across the entire musical culture in which you will be training. For example in Irish traditional music, use the irishtune.info TuneID number. |
version | Not used by CoverHunter. Example from covers80: "annie_lennox+Medusa+03-A_Whiter_Shade_Of_Pale.mp3", which would have been the original audio file source for that perf. |
full.txt is the JSON-formatted training data catalog for tools.train.py, generated by tools.extract_csi_features. In case you do your own data prep instead of using tools.extract_csi_features, here's the structure of full.txt.
key | value |
---|---|
perf | See dataset.txt. Except in this context, for each original perf, extract_csi_features generates additional artificial variants, which each get their own perf identifier. |
wav | (see dataset.txt) |
dur_s | (see dataset.txt) |
work | (see dataset.txt) |
version | (see dataset.txt) |
feat | path to the CQT features of this perf stored as .npy array. Example: "data/covers80/cqt_feat/sp_0.8-cover80_00000146_71_0.cqt.npy" |
feat_len | output of len(np.load(feat)). Example: 9198 |
work_id | internal, arbitrary unique identifier for the work. This is what teaches the model which perfs (performances) are considered by humans to be the "same work." Example: 0 |
version_id | internal, arbitrary unique identifier for each artificially augmented variant of the original perf (performance). Example: 0 |
Text file crosswalk between "work" (unique identifying string per work) and the "work_id" number arbitrarily assigned to each "work" by the extract_csi_features.py script. Not used by any scripts in this project currently, but definitely useful as a reference for human interpretation of training results.
filename | comments |
---|---|
cqt_feat subfolder | Contains the Numpy array files of the CQT data for each file listed in full.txt. Needed by train.py. Also used each time you run extract_csi_features.py to save time in creating CQT data by skipping CQT generation for samples already represented in this folder. |
data.init.txt | Copy of dataset.txt after sorting by perf and de-duping. Not used by train.py |
test.txt | A subset of full.txt generated by the _split_data_by_work_id() function intended for use by train.py as the test dataset. |
test-only-work-ids.txt | Text file listing one work_id per line for each work_id that the train/val/test splitting function held out from train/val to be used exclusively in the test dataset. This file can be used by eval_testset.py to mark those samples in the t-SNE plot. |
full.txt | See above detailed description. Contains the entire dataset you provided in the input file. |
work_id.map | Text file, with 2 columns per line, separated by a space, sorted alphabetically by the first column. First column is a distinct "work" string taken from dataset.txt. Second column is the work_id value assigned to that "work." |
sp_aug subfolder | Contains the sox-modified wav speed variants of the raw training .wav files, at the speeds defined in hparams.yaml. Not used by train.py. Also used each time you run extract_csi_features.py to save time in creating speed variants by skipping speed augmentation for samples already represented in this folder. |
sp_aug.txt | Copy of data.init.txt but with addition of 1 new row for each augmented variant created in sp_aug/*.wav. Not used by train.py. |
train.txt | A subset of full.txt generated by the _split_data_by_work_id() function intended for use by train.py as the train dataset. |
val.txt | A subset of full.txt generated by the _split_data_by_work_id() function intended for use by train.py as the val dataset. |
Original CoverHunter also generated the following files, but were not used by their published codebase, so I commented out those functions:
filename | comments |
---|---|
work_id_num.map | Text file, not used by train.py, maybe not by anything else? |
work_name_num.map | Text file, not used by train.py, maybe not by anything else? |
The structure of the CQT arrays as handled within this project is: [time bins ordered from start to end, frequency bins ordered from low to high frequencies]
Note that to visualize these arrays in traditional spectrogram form with time on the x axis and frequency on the y axis, the CQT arrays must be transposed, for example by using the native Python .T
suffix.
Using the default configuration, training saves checkpoints after each epoch in the training/covers80 folder.
The checkpoints
subfolder gets two files per epoch: do_000000NN and g_000000NN where NN=epoch number. The do_ files contain the AdamW optimizer state. The g_ files contain the model's state dictionary. "g" might be an abbreviation for "generator" given that a transformer architecture is involved?
The eval_for_map_with_feat()
function, called at the end of each epoch, also saves data in a separate new subfolder for each epoch, named epoch_NN_covers80. This in turn gets a query_embed
subfolder containing the model-generated embeddings for every sample in the training data, plus the embeddings for time-chunked sections of those samples, named with a suffix of ...__start-N.npy where N is the timecode in seconds of where the chunk starts. The saved embeddings are 1-dimensional arrays containing 128 double-precision (float64) values between -1 and 1. The epoch_NN_covers80 folder also gets an accompanying file query.txt
(with an identical copy as ref.txt
) which is a text file listing the attributes of every training sample represented in the query_embed
subfolder, following the same format as described above for full.txt
.
The file you can prepare for the tools/eval_testset.py
script to pass as the 4th parameter query_in_ref_path
(CoverHunter did not provide an example file or documentation) assumes:
- JSON or tab-delimited key:value format
- The only line contains a single key "query_in_ref" with a value that is itself a list of tuples, where each tuple represents a mapping between an index in the query input file and an index in the reference input file.
This mapping is only used by the
_generate_dist_matrix()
function. That function explains: "List[(idx, idy), ...], means query[idx] is in ref[idy] so we skip that when computing mAP." idx and idy are the sequentially assigned index numbers to each perf in the order they appear in the query and ref data sources.
Hand-made visualization of how core functions of this project interact with each other. Also includes additional beginner-friendly or verbose code-commenting that I didn't add to the project code. Not regularly maintained, but still useful for getting oriented in this project's code:
https://miro.com/app/board/uXjVNkDkn70=/
Unit tests are in progress, currently only with partial code coverage. Run them from the repository root using:
python3 -m unittest -c tests/test_*.py
or if you installed the project in a virtualenv:
make tests
As a contribution to the CSI community, where the SHS100K dataset has been used as a standard training dataset for many years, including for the CoverHunter research paper, here is a histogram showing the distribution of works vs. performances in SHS100K.
This figure may be helpful as a reference for comparing the distribution of works vs. performances in datasets you want to use with CoverHunterMPS, knowing that CoverHunter was able to train successfully given this distribution.
To help you understand this visualization of the SHS100K dataset, here are some example data points from it: The most common work ("Summertime") is represented by 387 performances, and there are over 300 works having only a single performance. The most common count of performances per work is 6.